SOFT8033 Assignment 2- Spark Algorithms Solved

Your shopping cart is empty.

It’s been already a few weeks since you started your short-term internship in the Data Analytics Department of the start-up OptimiseYourJourney, which will enter the market next year with a clear goal in mind: “leverage Big Data technologies for improving the user experience in transportation”. Your contribution in Assignment 1 has proven the potential OptimiseYourJourney can obtain by applying MapReduce to analyse large-scale public transportation datasets as the one in the New York City Bike Sharing System: https://www.citibikenyc.com/

In the department meeting that has just finished your boss was particularly happy, again.

• The very same dataset from Assignment 1 (let’s call it my_dataset_1) provides an opportunity to leverage other large-scale data analysis libraries, such as Spark SQL.

• The graph structure of the dataset allows you to explore the potential of Spark GraphFrames, a small library of Spark specialised in the parallel execution of classical graph algorithms. To do so, two small graph examples (let’s call them my_dataset_2 and my_dataset_3) are provided to explore the classical algorithms of: o Dijkstra – for finding the shortest path from a source node to the remaining nodes.

o PageRank – for assigning a value to each node based on its neighbourhood.

DATASET 1:

This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:

• 2019_05_01.csv => All trips registered on the 1st of May of 2019.

• 2019_05_02.csv => All trips registered on the 2nd of May of 2019.

• …

• 2019_07_12.csv => All trips registered on the 12th of July of 2019.

Altogether, the files contain 444,110 rows. Each row contains the following fields:

start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id

• (00) start_time

! A String representing the time the trip started at.

<%Y/%m/%d %H:%M:%S>

! Example: “2019/05/02 10:05:00”

• (01) stop_time

! A String representing the time the trip finished at.

<%Y/%m/%d %H:%M:%S>

! Example: “2019/05/02 10:10:00”

• (02) trip_duration

! An Integer representing the duration of the trip.

! Example: 300

• (03) start_station_id

! An Integer representing the ID of the CityBike station the trip started from.

! Example: 150

• (04) start_station_name

! A String representing the name of the CitiBike station the trip started from.

! Example: “E 2 St &; Avenue C”.

• (05) start_station_latitude

! A Float representing the latitude of the CitiBike station the trip started from.

! Example: 40.7208736

• (06) start_station_longitude

! A Float representing the longitude of the CitiBike station the trip started from.

! Example: -73.98085795

• (07) stop_station_id

! An Integer representing the ID of the CityBike station the trip stopped at.

! Example: 150

• (08) stop_station_name

! A String representing the name of the CitiBike station the trip stopped at.

! Example: “E 2 St &; Avenue C”.

• (09) stop_station_latitude

! A Float representing the latitude of the CitiBike station the trip stopped at.

! Example: 40.7208736

• (10) stop_station_longitude

! A Float representing the longitude of the CitiBike station the trip stopped at.

! Example: -73.98085795

• (11) bike_id

! An Integer representing the id of the bike used in the trip.

! Example: 33882.

• (12) user_type

! A String representing the type of user using the bike (it can be either “Subscriber” or “Customer”).

! Example: “Subscriber”.

• (13) birth_year

! An Integer representing the birth year of the user using the bike.

! Example: 1990.

• (14) gender

! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).

! Example: 2.

• (15) trip_id

! An Integer representing the id of the trip.

! Example: 190.

DATASET 2:

This dataset consists in the file tiny_graph.txt, which contains 26 edges (indeed, 13 edges, one on each direction) in a graph with 8 nodes.

DATASET 3:

This dataset consists in the file tiny_graph.txt, which contains 18 edges (indeed, 9 edges, one on each direction) in a graph with 6 nodes.

TASKS / EXERCISES.

The tasks / exercises to be completed as part of the assignment are described in the next pages:

• The following exercises are placed in the folder my_code:

1. A02_Part1/A02_Part1.py

2. A02_Part2/A02_Part2.py 3. A02_Part3/A02_Part3.py

4. A02_Part4/A02_Part4.py

5. A02_Part5/A02_Part5.py

Marks are as follows:

1. A02_Part1/A02_Part1.py => 18 marks

2. A02_Part2/A02_Part2.py => 18 marks 3. A02_Part3/A02_Part3.py => 18 marks

4. A02_Part4/A02_Part4.py => 18 marks

5. A02_Part5/A02_Part5.py => 28 marks

Tasks:

1. A02_Part1/A02_Part1.py

2. A02_Part2/A02_Part2.py

3. A02_Part3/A02_Part3.py

Complete the function my_main of the Python program.

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

§ The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.

§ The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.

§ The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.

§ The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.

4. A02_Part4/A02_Part4.py

Complete the function compute_page_rank of the Python program.

Do not modify the name of the function nor the parameters it receives.

The function must return a dictionary with (key, value) pairs, where:

§ Each key represents a node id.

§ Each value represents the pagerank value computed for this node id.

5. A02_Part5/A02_Part5.py

Complete the function my_main of the Python program.

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

§ The function my_main must start with the creation operation 'read' above loading the dataset to Spark SQL.

§ The function my_main must finish with an action operation 'collect', gathering and printing by the screen the result of the Spark SQL job.

§ The function my_main must not contain any other action operation 'collect' other than the one appearing at the very end of the function.

§ The resVAL iterator returned by 'collect' must be printed straight away, you cannot edit it to alter its format for printing.

Shopping cart

US$0

SOFT8033 Assignment 2- Spark Algorithms Solved

More products