Starting from:

$25

CSGY6513 Homework 2-Using Spark to explore NYC Parking Violations Solved



 

In this assignment, we will analyze different aspects of parking violations in NYC using Spark and python. You will write pyspark programs using both RDDs and DataFrames for 7 different tasks, described later in this document.  

We will use the datasets listed below, which are available on HDFS on the Peel cluster. You should not store additional copies in your HDFS directory. The files can be found at  

 

      /user/CS-GY-6513/parking-violations/parking-violations-header.csv 

      /user/CS-GY-6513/parking-violations/open-violations-header.csv  

The first line of these two datasets contain column names separated by commas. 

The schema definition for these datasets can be found at:   

 

/user/CS-GY-6513/parking-violations/mysql-load.sql   

As you work on the different tasks, we suggest that you test and debug your code on a smaller dataset, which you should create yourself from the available data. You can find smaller dataset for testing your code at: 

 

/user/CS-GY-6513/test_data/open-violations-header.csv 

/user/CS-GY-6513/test_data/parking-violations-header.csv 

 

Your final submission must successfully run the Spark programs on the full datasets using Spark on the Peel cluster.  

 

 

 

Submission: 

You will submit the Spark programs for all tasks in a .zip file named “yournetid.zip” (e.g., “jf1870.zip”). The zip file should include 14 python files: task1.py, task1-sql.py,...,task7.py, task7-sql.py.  

*Do not include any input or output files with your submission.  

Notes: 

        •     For each task you will use both core Spark using RDDs and SparkSQL using DataFrames.  

 

●    Your program should read in the path to the input file on HDFS from the command line arguments. For task 1, you are guaranteed that parking-violations.csv will be the first of the two files passed in. In other words, we will execute your programs with the following commands:

 

             For task 1: 

spark-submit <directory>/task1.py /user/CS-GY-6513/parkingviolations/parking-violations-header.csv /user/CS-GY-6513/parkingviolations/open-violations-header.csv 

For task 1 - SQL: 

spark-submit <directory>/task1-sql.py /user/CS-GY-6513/parkingviolations/parking-violations-header.csv /user/CS-GY-6513/parkingviolations/open-violations-header.csv 

 

 

            For task 2, 3: 

spark-submit <directory>/taskx.py /user/CS-GY-6513/parkingviolations/open-violations-header.csv 

For task 2-SQL, 3-SQL:

spark-submit <directory>/taskx-sql.py /user/CS-GY-6513/parkingviolations/open-violations-header.csv 

 

For all other tasks: 

spark-submit <directory>/taskx.py /user/CS-GY-6513/parkingviolations/parking-violations-header.csv and for SQL versions:

spark-submit <directory>/taskx-sql.py /user/CS-GY-6513/parkingviolations/parking-violations-header.csv 

 

●       You should only use the most recent available versions of python and Spark on peel (3.7.9 and

2.4.0, respectively)

 

●       Your code should output a directory named “taskx.out” or “taskx-sql.out” to HDFS, i.e., use the Spark RDD function saveAsTextFile(“taskx.out”) or the Spark DataFrame function save(“taskxsql.out”,format=”text”) rather than python I/O. (In order for the hw2tester script to work, you must name your output directories “taskx.out” and ”taskx-sql.out” where x is in 1 through 7).  

 

●       As in Assignment 1, you will be reading from CSV files. To do this in core Spark, reading into an

RDD, you would use, e.g.,

 

            from csv import reader 

lines = sc.textFile(sys.argv[1], 1) 

     lines = lines.mapPartitions(lambda x: reader(x)) 

 

 To do this in SparkSQL, reading into a DataFrame, you would use, e.g.,  parking = 

spark.read.format('csv').options(header='true',inferschema='true').load (sys.argv[1]) 

 

●       For core Spark: You may find that using a final map() stage is helpful for formatting your output correctly.  

●       For SparkSQL: Avoid using “not in” and try to use join operation instead in your query to reduce the running time. 

●       For SparkSQL: You may find that using a final select() which produces a single column using format_string() is helpful for formatting the output and writing to a text file. For example, for Task 1-SQL, if result is the dataframe that contains the query results, you could write:  

result.select(format_string('%d\t%s, %d, %d, 

%s',result.summons_number,result.plate_id,result.violation_precinct,result.violation_code,dat e_format(result.issue_date,'yyyy-MM-dd'))).write.save("task1-sql.out",format="text") 

 

 

       

  

             


Assignment:  

=================================================================== 

Task 1 
Write a Spark program that finds all parking violations that have been paid, i.e., that do not occur in openviolations-header.csv.

 

Output: A key-value pair per line, where    key = summons_number

  values = violation_county, registration_state, violation_code, issue_date 

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: 

1403770815      BX, NY, 20, 2016-03-26 

1406045901      Q, NJ, 21, 2016-03-15 

 

Task 1-SQL 
Write a SparkSQL program for Task 1. 

===================================================================

 

Task 2 
Write a Spark program that finds the distribution of the violation types, i.e., for each violation type, the number of violations that belongs to this type.

 

Output: A key-value pair per line, where 

  key = violation 

  value = number of violations 

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: DETACHED TRAILER        123 

  

Task 2-SQL 
Write a SparkSQL program for Task 2.

===================================================================

 

Task 3 
Write a Spark program that finds the total and average amount due in open violations for each precinct.

 

Output: A key-value pair per line, where  

  key = precinct   value = total, average 

where total and average are rounded to 2 decimal places.  

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: 

        0          9482469.38, 35.82 

 

Task 3-SQL 
Write a SparkSQL program for Task 3. 

=================================================================== 

 

Task 4 
Write a Spark program that computes the total number of violations for vehicles registered in the state of NY and all other vehicles.  

 

Output: 2 key-value pairs with one key-value pair per line.  

You should separate the key and value by a tab character (‘\t’). Your output format should conform to the following example: 

NY      12345 

Other   6789 

  

Task 4-SQL 
Write a SparkSQL program for Task 4. 

=================================================================== 

 

Task 5 
Write a Spark program that finds the vehicle make that has had the greatest number of violations.

 

Output: One key-value pair  

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: JEEP      138 

 

Task 5-SQL 
Write a SparkSQL program for Task 5. 

=================================================================== 

 

Task 6 
Write a Spark program that finds the top-10 vehicles makes in terms of total violations.

 

Output: List of 10 key-value pairs, ordered by decreasing number of violations. For items with the same number of violations, order by ascending vehicle_make.  

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: JEEP      138 

 

Task 6-SQL 
Write a SparkSQL program for Task 1. 

=================================================================== 

 

Task 7 
In March 2016, the 5th, 6th, 12th, 13th, 19th, 20th, 26th, and 27th were weekend days (i.e., Sat. and Sun.). 

Write a Spark program that, for each violation county, lists the average number of violations with that code issued per day on weekdays and weekend days. You may hardcode “8” as the number of weekend days and “23” as the number of weekdays in March 2016.

 

Output: List of key-value pairs where key = violation_county  

value = weekend_average, week_average where weekend_average and week_average are rounded to 2 decimal places.  

You should separate the key and value by a tab character (‘\t’) and elements within the key/value should be separated by a comma and a space. Your output format should conform to the following example: 

Q       3.25, 5.78 

BX       0.12, 0.17 

 

Task 7-SQL 
Write a SparkSQL program for Task 7. 

=================================================================== 

 

Testing your solutions 

We have provided a script to test your solutions on the March 2016 data. On peel, type  hfs -get /user/CS-GY-6513/hw2-test/hw2tester.tar.gz 

To unpack the files and give permission to the entire folder, type  

tar -xzvf hw2tester.tar.gz chmod -R +x hw2tester 

 

Go to the tester folder. To run the tests, type  

./testall.sh <INPUTPATH> 

where <INPUTPATH> is the full path to your directory containing task1.py, task1-sql.py, task2.py, task2sql.py, etc. (e.g., “/home/<netid>/assignment2”)  

You will be told for each task, whether you pass or fail in the report testresults.txt. If you fail some task X, you can view the diff of your output and the solution file in results/taskX.diff.  

More products