$25
It’s been already a few weeks since you started your short-term internship in the Data Analytics Department of the start-up OptimiseYourJourney, which will enter the market next year with a clear goal in mind: “leverage Big Data technologies for improving the user experience in transportation”. Your contribution in Assignment 1 has proven the potential OptimiseYourJourney can obtain by applying MapReduce to analyse large-scale public transportation datasets as the one in the New York City Bike Sharing System: https://www.citibikenyc.com/
In the department meeting that has just finished your boss was particularly happy, again.
• The very same dataset from Assignment 1 (let’s call it my_dataset_1) provides an opportunity to leverage other large-scale data analysis libraries, such as Spark SQL.
• The graph structure of the dataset allows you to explore the potential of Spark GraphFrames, a small library of Spark specialised in the parallel execution of classical graph algorithms. To do so, two small graph examples (let’s call them my_dataset_2 and my_dataset_3) are provided to explore the classical algorithms of: o Dijkstra – for finding the shortest path from a source node to the remaining nodes.
o PageRank – for assigning a value to each node based on its neighbourhood.
DATASET 1:
This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:
• 2019_05_01.csv => All trips registered on the 1st of May of 2019.
• 2019_05_02.csv => All trips registered on the 2nd of May of 2019.
• …
• 2019_07_12.csv => All trips registered on the 12th of July of 2019.
Altogether, the files contain 444,110 rows. Each row contains the following fields:
start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id
• (00) start_time
! A String representing the time the trip started at.
<%Y/%m/%d %H:%M:%S>
! Example: “2019/05/02 10:05:00”
• (01) stop_time
! A String representing the time the trip finished at.
<%Y/%m/%d %H:%M:%S>
! Example: “2019/05/02 10:10:00”
• (02) trip_duration
! An Integer representing the duration of the trip.
! Example: 300
• (03) start_station_id
! An Integer representing the ID of the CityBike station the trip started from.
! Example: 150
• (04) start_station_name
! A String representing the name of the CitiBike station the trip started from.
! Example: “E 2 St &; Avenue C”.
• (05) start_station_latitude
! A Float representing the latitude of the CitiBike station the trip started from.
! Example: 40.7208736
• (06) start_station_longitude
! A Float representing the longitude of the CitiBike station the trip started from.
! Example: -73.98085795
• (07) stop_station_id
! An Integer representing the ID of the CityBike station the trip stopped at.
! Example: 150
• (08) stop_station_name
! A String representing the name of the CitiBike station the trip stopped at.
! Example: “E 2 St &; Avenue C”.
• (09) stop_station_latitude
! A Float representing the latitude of the CitiBike station the trip stopped at.
! Example: 40.7208736
• (10) stop_station_longitude
! A Float representing the longitude of the CitiBike station the trip stopped at.
! Example: -73.98085795
• (11) bike_id
! An Integer representing the id of the bike used in the trip.
! Example: 33882.
• (12) user_type
! A String representing the type of user using the bike (it can be either “Subscriber” or “Customer”).
! Example: “Subscriber”.
• (13) birth_year
! An Integer representing the birth year of the user using the bike.
! Example: 1990.
• (14) gender
! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).
! Example: 2.
• (15) trip_id
! An Integer representing the id of the trip.
! Example: 190.
DATASET 2:
This dataset consists in the file tiny_graph.txt, which contains 26 edges (indeed, 13 edges, one on each direction) in a graph with 8 nodes.
DATASET 3:
This dataset consists in the file tiny_graph.txt, which contains 18 edges (indeed, 9 edges, one on each direction) in a graph with 6 nodes.
TASKS / EXERCISES.
The tasks / exercises to be completed as part of the assignment are described in the next pages:
• The following exercises are placed in the folder my_code:
1. A02_Part1/A02_Part1.py
2. A02_Part2/A02_Part2.py 3. A02_Part3/A02_Part3.py
4. A02_Part4/A02_Part4.py
5. A02_Part5/A02_Part5.py
Marks are as follows:
1. A02_Part1/A02_Part1.py => 18 marks
2. A02_Part2/A02_Part2.py => 18 marks 3. A02_Part3/A02_Part3.py => 18 marks
4. A02_Part4/A02_Part4.py => 18 marks
5. A02_Part5/A02_Part5.py => 28 marks
Tasks:
1. A02_Part1/A02_Part1.py
2. A02_Part2/A02_Part2.py
3. A02_Part3/A02_Part3.py
Complete the function my_main of the Python program.
Do not modify the name of the function nor the parameters it receives.
The entire work must be done within Spark SQL:
§ The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.
§ The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.
§ The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.
§ The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.
4. A02_Part4/A02_Part4.py
Complete the function compute_page_rank of the Python program.
Do not modify the name of the function nor the parameters it receives.
The function must return a dictionary with (key, value) pairs, where:
§ Each key represents a node id.
§ Each value represents the pagerank value computed for this node id.
5. A02_Part5/A02_Part5.py
Complete the function my_main of the Python program.
Do not modify the name of the function nor the parameters it receives.
The entire work must be done within Spark SQL:
§ The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.
§ The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.
§ The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.
§ The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.