Starting from:

$25

SOFT8033 Assignment 2- Spark Algorithms Solved

 It’s been already a few weeks since you started your short-term internship in the Data Analytics Department of the start-up OptimiseYourJourney, which will enter the market next year with a clear goal in mind: “leverage Big Data technologies for improving the user experience in transportation”. Your contribution in Assignment 1 has proven the potential OptimiseYourJourney can obtain by applying MapReduce to analyse large-scale public transportation datasets as the one in the New York City Bike Sharing System: https://www.citibikenyc.com/   

 

 

         In the department meeting that has just finished your boss was particularly happy, again. 

•      The very same dataset from Assignment 1 (let’s call it my_dataset_1) provides an opportunity to leverage other large-scale data analysis libraries, such as Spark SQL.   

•      The graph structure of the dataset allows you to explore the potential of Spark GraphFrames, a small library of Spark specialised in the parallel execution of classical graph algorithms. To do so, two small graph examples (let’s call them my_dataset_2 and my_dataset_3) are provided to explore the classical algorithms of: o Dijkstra – for finding the shortest path from a source node to the remaining nodes. 

o PageRank – for assigning a value to each node based on its neighbourhood.  

 

 

 

 

DATASET 1:   

 This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:

•      2019_05_01.csv => All trips registered on the 1st of May of 2019.  

•      2019_05_02.csv => All trips registered on the 2nd of May of 2019.  

•      … 

•      2019_07_12.csv => All trips registered on the 12th of July of 2019. 

 

Altogether, the files contain 444,110 rows. Each row contains the following fields:

start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id 

 

•      (00) start_time  

! A String representing the time the trip started at.  

<%Y/%m/%d %H:%M:%S>  

! Example: “2019/05/02 10:05:00” 

•      (01) stop_time  

! A String representing the time the trip finished at.   

<%Y/%m/%d %H:%M:%S>  

! Example: “2019/05/02 10:10:00” 

•      (02) trip_duration  

! An Integer representing the duration of the trip. 

! Example: 300 

•      (03) start_station_id 

! An Integer representing the ID of the CityBike station the trip started from. 

! Example: 150 

•      (04) start_station_name  

! A String representing the name of the CitiBike station the trip started from. 

! Example: “E 2 St &; Avenue C”. 

•      (05) start_station_latitude 

! A Float representing the latitude of the CitiBike station the trip started from. 

! Example: 40.7208736 

•      (06) start_station_longitude 

! A Float representing the longitude of the CitiBike station the trip started from. 

! Example:  -73.98085795 

 

 

•      (07) stop_station_id 

! An Integer representing the ID of the CityBike station the trip stopped at. 

! Example: 150 

•      (08) stop_station_name  

! A String representing the name of the CitiBike station the trip stopped at.  

! Example: “E 2 St &; Avenue C”. 

•      (09) stop_station_latitude 

! A Float representing the latitude of the CitiBike station the trip stopped at. 

! Example: 40.7208736 

•      (10) stop_station_longitude 

! A Float representing the longitude of the CitiBike station the trip stopped at. 

! Example:  -73.98085795 

•      (11) bike_id 

! An Integer representing the id of the bike used in the trip.  

! Example:  33882. 

•      (12) user_type 

! A String representing the type of user using the bike (it can be either “Subscriber” or “Customer”).  

! Example: “Subscriber”. 

•      (13) birth_year 

! An Integer representing the birth year of the user using the bike.  

! Example:  1990. 

•      (14) gender 

! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).  

! Example:  2. 

•      (15) trip_id 

! An Integer representing the id of the trip.  

! Example:  190. 

 

 

 

DATASET 2:   

 This dataset consists in the file tiny_graph.txt, which contains 26 edges (indeed, 13 edges, one on each direction) in a graph with 8 nodes.

 

DATASET 3:   

 This dataset consists in the file tiny_graph.txt, which contains 18 edges (indeed, 9 edges, one on each direction) in a graph with 6 nodes.

 

 

 

 

 

 

 

 

TASKS / EXERCISES. 

The tasks / exercises to be completed as part of the assignment are described in the next pages:  

 

•      The following exercises are placed in the folder my_code: 

1.      A02_Part1/A02_Part1.py   

2.      A02_Part2/A02_Part2.py   3. A02_Part3/A02_Part3.py   

4.      A02_Part4/A02_Part4.py   

5.      A02_Part5/A02_Part5.py   

 

Marks are as follows: 

1.      A02_Part1/A02_Part1.py => 18 marks

2.      A02_Part2/A02_Part2.py => 18 marks 3. A02_Part3/A02_Part3.py => 18 marks

4.      A02_Part4/A02_Part4.py => 18 marks

5.      A02_Part5/A02_Part5.py => 28 marks

 

 

Tasks: 

1.      A02_Part1/A02_Part1.py 

2.      A02_Part2/A02_Part2.py 

3.      A02_Part3/A02_Part3.py 

Complete the function my_main of the Python program.  

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

§  The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.

§  The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.

§  The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.

§  The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.

 

4. A02_Part4/A02_Part4.py 

Complete the function compute_page_rank of the Python program.  

Do not modify the name of the function nor the parameters it receives.

The function must return a dictionary with (key, value) pairs, where:

§  Each key represents a node id.  

§  Each value represents the pagerank value computed for this node id.  

 

5. A02_Part5/A02_Part5.py 

Complete the function my_main of the Python program.  

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

§  The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.

§  The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.

§  The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.

§  The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.

More products