$30
1. Overview of the Assignment
In assignment 1, you will work on three tasks. The goal of these tasks is to get you familiar with Spark operation types (e.g., transformations and actions) and explore a real-world dataset: Yelp dataset (https://www.yelp.com/dataset).
If you have questions about the assignment, please ask on the Piazza, which will also help other students.
You only need to submit on the Vocareum, NO NEED to submit on the Blackboard.
2. Requirements
2.1 Programming Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any point if you use Spark DataFrame or DataSet.
2.2 Programming Environment
Python 3.6, JDK 1.8, Scala 2.11, and Spark 2.4.4
We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum.
On Vocareum, you can call `spark-submit` located at `/home/local/spark/latest/bin/spark-submit`. (Do not use the one at /usr/local/bin/spark-submit (2.3.0)). We use `--executor-memory 4G --driver-memory 4G` on Vocareum for grading.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code!
TAs will combine all the code that can be found from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.
3. Yelp Data
In this assignment, you will explore the Yelp dataset. You can find the data on Vocareum under resource/asnlib/publidata/. The two files business.json and test_review.json are the two files you will work on for this assignment, and they are subsets of the original Yelp Dataset. The submission report you get from Vocareum is for the subsets. For grading, we will use the files from the original Yelp which are SIGNIFICANTLY larger (e.g. review.json can be 5GB). You should make sure your code work well on large datasets as well.
4. Tasks
4.1 Task1: Data Exploration
You will work on test_review.json, which contains the review information from users, and write a program to automatically answer the following questions: A. The total number of reviews
B. The number of reviews in 2018
C. The number of distinct users who wrote reviews
D. The top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote
E. The number of distinct businesses that have been reviewed (0.5 point)
F. The top 10 businesses that had the largest numbers of reviews and the number of reviews they had
Input format: (we will use the following command to execute your code) Python:
spark-submit --executor-memory 4G --driver-memory 4G task1.py <review_filepath> <output_filepath> Scala:
spark-submit --class task1 --executor-memory 4G --driver-memory 4G hw1.jar <review_filepath> <output_filepath> Output format:
IMPORTANT: Please strictly follow the output format since your code will be graded automatically.
a. The output for Questions A/B/C/E will be a number. The output for Questions D/F will be a list, which is sorted by the number of reviews in the descending order. If two user_ids/business_ids have the same number of reviews, please sort the user_ids /business_ids in the alphabetical order.
b. You need to write the results in the JSON format file. You must use exactly the same tags (see the red boxes in Figure 2) for answering each question.
4.2 Task2: Partition
Since processing large volumes of data requires performance decisions, properly partitioning the data for processing is imperative.
In this task, you will show the number of partitions for the RDD used for Task 1 Question F and the number of items per partition.
Then you need to use a customized partition function to improve the performance of map and reduce tasks. A time duration (for executing Task 1 Question F) comparison between the default partition and the customized partition (RDD built using the partition function) should also be shown in your results.
Hint:
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation. So, trying to design a partition function to avoid the shuffle will improve the performance a lot.
Input format: (we will use the following command to execute your code)
Python:
spark-submit --executor-memory 4G --driver-memory 4G task2.py <review_filepath> <output_filepath> <n_partition>
Scala:
spark-submit --class --executor-memory 4G --driver-memory 4G task2 hw1.jar <review_filepath> <output_filepath> <n_partition>
Output format:
A. The output for the number of partition and execution time will be a number. The output for the number of items per partition will be a list of numbers.
B. You need to write the results in a JSON file. You must use exactly the same tags.
Figure 3: JSON output structure for task2
4.3 Task3: Exploration on Multiple Datasets
In task3, you are asked to explore two datasets together containing review information (test_review.json) and business information (business.json) and write a program to answer the following questions:
A. What is the average stars for each city? (DO NOT use the stars information in the business file) (1 point)
B. You are required to compare the execution time of using two ways to print top 10 cities with highest stars. You should store the execution time (start from loading the file) in the json file with tag “m1” and “m2”.
Method1: Collect all the data, sort in python, and then print the first 10 cities
Method2: Sort in Spark, take the first 10 cities, and then print these 10 cities
Input format: (we will use the following command to execute your code)
Python: spark-submit --executor-memory 4G --driver-memory 4G task3.py <review_filepath> <business_filepath> <output_filepath_question_a> <output_filepath_question_b>
Scala: spark-submit --class task3 --executor-memory 4G --driver-memory 4G hw1.jar <review_file> <business_file> <output_filepath_question_a> <output_filepath_question_b>
Output format:
a. You need to write the results for Question A as a file,The header (first line) of the file is “city,stars”. The outputs should be sorted by the average stars in descending order. If two cities have the same stars, please sort the cities in the alphabetical order. (see Figure 3 left)
b. You also need to write the answer for Question B in a JSON file. You must use exactly the same tags for the task.
Figure 3: Question A output file structure (left) and JSON output structure (right) for task3