In assignment 1, you will complete three tasks. The goal of these tasks is to help you get familiar with Spark operations (e.g., transformations and actions) and MapReduce.
1. Requirements 2.1 Programming Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be 10% bonus for Scala implementation. You can get the bonus only when both Python and Scala implementations are correct.
b. You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.
2.2 Programming Environment
2. Yelp Data In this assignment, you are provided with two datasets (i.e., reviews and businesses) extracted from the Yelp dataset for developing your assignment.[1] You can access and download the datasets either under the directory on Vocareum: resource/asnlib/publicdata/ or in the Google Drive: https://drive.google.com/drive/folders/1dnVCZazaR84UkvhHAwoHyFqxmFF01IHa?usp=sharing
We generated these datasets in a random sampling way. These given datasets are only for your testing. During the grading period, we will use different sampled subsets of the Yelp datasets.
3. Tasks You need to submit the following files on Vocareum: (all lowercase)
A. Python scripts: task1.py, task2.py, task3.py
B. [OPTIONAL] Scala scripts: task1.scala, task2.scala, task3.scala; Jar package: hw1.jar
3.1 Task1: Data Exploration 4.1.1 Task description
You will explore the review dataset and write a program to answer the following questions:
A. The total number of reviews
B. The number of reviews in a given year, y
C. The number of distinct users who have written the reviews
D. Top m users who have the largest number of reviews and its count
E. Top n frequent words in the review text. The words should be in lower cases. The following punctuations
i.e., “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)”, and the given stopwords are excluded
Scala: $ spark-submit --class task1 hw1.jar <input_file <output_file <stopwords <y <m <n Params: input_file – the input file (the review dataset) output_file – the output file contains your answers
stopwords – the file contains the stopwords that should be removed for Question E y/m/n – see 4.1.1
4.1.3 Output format:
You must write the results in the JSON format using exactly the same tags for each question (see an example in Figure 2). The answer for A/B/C is a number. The answer for D is a list of pairs [user, count]. The answer for E is a list of frequent words. All answers should be sorted by the count in the descending order. If two users/words have the same count, please sort them in the alphabetical order.
Figure 2: An example output for task1 in JSON format
3.2 Task2: Exploration on Multiple Datasets 4.2.1 Task description
In task2, you will explore the two datasets together (i.e., review and business) and write a program to compute the average stars for each business category and output top n categories with the highest average stars . The business categories should be extracted from the “categories” tag in the business file and split by comma (also need to remove leading and trailing spaces for the extracted categories). You need to implement a version without Spark and compare to a version with Spark for reducing the time duration of execution .
Scala: $ spark-submit --class task2 hw1.jar <review_file <business_file <output_file <if_spark <n Params: review _file – the input file (the review dataset) business_file – the input file (the business dataset) output_file – the output file contains your answers if_spark – either “spark” or “no_spark”
n – top n categories with highest average stars (see 4.2.1)
4.2.3 Output format:
You must write the results in the JSON format using exactly the same tags (see an example in Figure 3). The answer is a list of pairs [category, stars], which are sorted by the stars in the descending order. If two categories have the same value, please sort the categories in the alphabetical order.
Figure 3: An example output for task2 in JSON format
3.3 Task3: Partition 4.3.1 Task description
In this task, you will learn how partitions work in the RDD. You need to compute the businesses that have more than n reviews in the review file . At the same time, you need to show the number of partitions for the RDD and the number of items per partition with either default or customized partition function (1pts). You should design a customized partition function to improve the computational efficiency, i.e., reducing the time duration of execution .
Scala: $ spark-submit --class task3 hw1.jar <input_file <output_file <partition_type <n_partitions <n Params: input_file – the input file (the review dataset) output_file – the output file contains your answers
partition_type – the partition function, either “default” or “customized”
n_partitions – the number of partitions (only effective for the customized partition function) n – the threshold of the number of reviews (see 4.3.1)
4.3.3 Output format:
You must write the results in the JSON format using exactly the same tags (see an example in Figure 4). The answer for the number of partitions is a number. The answer for the number of items per partition is a list of numbers. The answer for result is a list of pairs [business, count] (no need to sort).
Figure 4: An example output for task3 in JSON format