INF553-Assignment 1 Spark Installation, Scala Installation and Python Configuration Solved
In this assignment, students will complete three tasks. The goal of these tasks is to let students get familiar with Spark and perform data analysis using Spark. In the assignment description, the first part is about how to configure the environment and data sets, the second part describes the three tasks in details, and the third part is about the files the students should submit and the grading criteria.
Spark Installation Spark can be downloaded from the official website (refer to: lin k)
Please use Spark 2.3.1 with Hadoop 2.7 for this assignment. The interface of Spark official website is shown in the following figure.
Scala Installation You can use Intellij if you prefer IDE for creating and debugging projects. And install Scala/SBT plugins for Intellij. You can refer to the tutorial "Setting UP Spark 2.0 environment on intellij community edition".
Python Configuration You need to add the paths of your Spark (path/to/your/Spark) and Python
(path/to/your/Spark/python) folders to the interpreter’s environment variables named as
Data Please download the “Stack Overflow 2018 Developer Survey” data from this lin k. Detailed introduction of the data can also be found through the link.
You are required to download the dataset that contains two files: survey_results_public.csv and survey_results_schema.csv . The first file contains the survey responses and will be required for this homework. The second file describes the 129 columns of the dataset. In this assignment only 3 columns of the dataset will be used: Country ,
Salary, and SalaryType .
Task1: Students are required to compute the total number of survey responses per country that have provided a salary value – i.e., response entries containing ‘NA ’ or ‘0 ’ salary values are considered non-useful responses and should be discarded.
Result format:
1. Save the result as one csv file;
2. The first line in the file should contain the keyword ‘Total’ and the total number of survey responses containing a salary;
3. The result is ordering by country in ascending order.
The following snapshot is an example of result for Task 1:
Task2: Since processing large volumes of data requires performance decisions, properly partitioning the data for processing is imperative. In this task, students are required to show the number of partitions for the RDD built in Task 1 and show the number of items per partition. Then, students have to use the partition function (using the country value as driver) to improve the performance of map and reduce tasks. A time span comparison between the standard (RDD used in Task 1) and partition (RDD built using the partition function) should also be shown.
Hints for Task 2:
1. When initializing the SparkContext, limit the number of processors to 2;
2. Only 2 partitions should be used.
Result format:
1. Save the result as one csv file.
2. The file should have two lines (one for standard and another for partition ). The second and third columns should list the number of items per partition.
3. A separate file should show the total time spent to perform a simple reduce operation for both standard and partition :
rdd.reduceByKey((a, b) = a+b).collect()
The following snapshot is an example of result for Task 2:
Task3: Students are required to compute annual salary averages per country and show min and max salaries.
Hints for Task 3:
1. Some salary values represent weekly or monthly payments. Recall performing the appropriate transformations to compute the annual salary. The value in the column SalaryType informs whether the salary amount is annual, weekly, or monthly.
Result format:
1. Save the result as one csv file.
2. The result is ordering by country in ascending order.
3. Columns should present: country , number of salaries, min salary, max salary, and average salary.
The following snapshot is an example of result for Task 3: