$25
From the window you can see the sun shining in a lovely spring morning.
Its Monday, 10am, and you are in the open plan office of a new start-up, OptimiseYourJourney, which will enter the market next year with a clear goal in mind: “leverage Big Data-based technologies for improving the user experience in transportation”.
The start-up is at a very early stage, and has no clear product in mind yet. However, they have offered a short-term internship in their Data Analytics Department to help them exploring the datasets, technologies and techniques that can be applied in their future products. They do not pay very well (0€ per hour), but you see this as a good opportunity to complement your knowledge in the module Big Data & Analytics you are studying at the moment, so you have decided to give it a go.
In the department meeting that has just finished your boss was particularly happy. During the weekend he was exploring public datasets over the internet and he found a dataset regarding the New York City Bike Sharing System: https://www.citibikenyc.com/
The official website (https://www.citibikenyc.com/system-data) provides publicly available datasets in a monthly basis. Each of these datasets amalgamate all the bike trips of the month. For example, the files “201905-citibike-tripdata.csv.zip”, “201906-citibike-tripdata.csv.zip” and “201907-citibike-tripdata.csv.zip” contain information for all the trips that took place in May, June and July of 2019, resp.
Your boss thinks this dataset provides a great opportunity to explore the potential of MapReduce in analysing large datasets. He has already cleaned the dataset for you to perform some data analysis on it.
DATASET:
This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:
• 2019_05_01.csv => All trips registered on the 1st of May of 2019.
• 2019_05_02.csv => All trips registered on the 2nd of May of 2019.
• …
• 2019_07_12.csv => All trips registered on the 12th of July of 2019.
Altogether, the files contain 444,110 rows. Each row contains the following fields:
start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id
• (00) start_time
! A String representing the time the trip started at.
<%Y/%m/%d %H:%M:%S>
! Example: “2019/05/02 10:05:00”
• (01) stop_time
! A String representing the time the trip finished at.
<%Y/%m/%d %H:%M:%S>
! Example: “2019/05/02 10:10:00”
• (02) trip_duration
! An Integer representing the duration of the trip.
! Example: 300
• (03) start_station_id
! An Integer representing the ID of the CityBike station the trip started from.
! Example: 150
• (04) start_station_name
! A String representing the name of the CitiBike station the trip started from.
! Example: “E 2 St &; Avenue C”.
• (05) start_station_latitude
! A Float representing the latitude of the CitiBike station the trip started from.
! Example: 40.7208736
• (06) start_station_longitude
! A Float representing the longitude of the CitiBike station the trip started from.
! Example: -73.98085795
• (07) stop_station_id
! An Integer representing the ID of the CityBike station the trip stopped at.
! Example: 150
• (08) stop_station_name
! A String representing the name of the CitiBike station the trip stopped at.
! Example: “E 2 St &; Avenue C”.
• (09) stop_station_latitude
! A Float representing the latitude of the CitiBike station the trip stopped at.
! Example: 40.7208736
• (10) stop_station_longitude
! A Float representing the longitude of the CitiBike station the trip stopped at.
! Example: -73.98085795
• (11) bike_id
! An Integer representing the id of the bike used in the trip.
! Example: 33882.
• (12) user_type
! A String representing the type of user using the bike (it can be either “Subscriber” or “Customer”).
! Example: “Subscriber”.
• (13) birth_year
! An Integer representing the birth year of the user using the bike.
! Example: 1990.
• (14) gender
! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).
! Example: 2.
• (15) trip_id
! An Integer representing the id of the trip.
! Example: 190.
TASKS / EXERCISES.
The tasks / exercises to be completed as part of the assignment are described in the next pages of this PDF document.
• The following exercises are placed in the folder my_code:
1. A01_Part1/A01_Part1.py 2. A01_Part2/A01_Part2.py
3. A01_Part3/A01_Part3.py
4. A01_Part4/my_mapper.py
5. A01_Part4/my_reducer.py
6. A01_Part5/my_mapper.py
7. A01_Part5/my_reducer.py
8. A01_Part6/my_mapper.py
9. A01_Part6/my_reducer.py
Marks are as follows:
o A01_Part1/A01_Part1.py => 16 marks o A01_Part2/A01_Part2.py => 16 marks o A01_Part3/A01_Part3.py => 18 marks o A01_Part4/my_mapper.py => 8 marks o A01_Part4/my_reducer.py => 8 marks o A01_Part5/my_mapper.py => 8 marks o A01_Part5/my_reducer.py => 8 marks o A01_Part6/my_mapper.py => 9 marks o A01_Part6/my_reducer.py => 9 marks
Tasks:
o A01_Part1/A01_Part1.py o A01_Part2/A01_Part2.py o A01_Part3/A01_Part3.py
Complete the function my_main of the Python program.
Do not modify the name of the function nor the parameters it receives.
o A01_Part4/my_mapper.py o A01_Part5/my_mapper.py o A01_Part6/my_mapper.py
Complete the function my_map of the Python program.
Do not modify the name of the function nor the parameters it receives.
o A01_Part4/my_reducer.py o A01_Part5/my_reducer.py o A01_Part6/my_reducer.py
Complete the function my_reduce of the Python program.
Do not modify the name of the function nor the parameters it receives.