Starting from:

$30

FIT5202 Assignment 2A Building models to predict pedestrian traffic -Solved  

Required Datasets (available in Moodle):

-       Two data files

-       Pedestrian_Counting_System_-_Monthly_counts_per_hour.csv

-       Pedestrian_Counting_System_-_Sensor_Locations.csv

-       A Metadata file is included which contains the information about the dataset.

-       These files are available in Moodle under Assessment 2A data folder

Information on Dataset
Two data files from the City of Melbourne are provided, which captures the hourly count of pedestrians recorded by the sensors and the corresponding sensor locations. The data is also available on the website https://data.melbourne.vic.gov.au​        /.​

What you need to achieve
The MelbourneGig company requires us to build models for predicting whether the potential count would go above the threshold of 2000 and also predicting the possible count. So we would need binary classification models and regression models.  

Use case 1
Predict whether count would go above 2000 for the hours between 9:00am and midnight​     
Binary classification
Use case 2
Predict the possible count for the hours between 9:00am and midnight
Regression
 

●     To build the binary classification models, use the column “Hourly_Count” to create a binary label

●     To build the regression models, use the column “Hourly_Count” as your label  

Architecture
The overall architecture of the assignment setup is represented by the following figure. Part​ A of the assignment consists of preparing the data, performing data exploration and extracting features, building and persisting the machine learning models.    

Fig 1: Overall architecture for assignment 2

In both parts, for the data pre-processing, the machine learning processes, you are required to implement the solutions using PySpark SQL / MLlib / ML packages. For the data visualisations, excessive usage of Pandas for data processing is discouraged. Please follow the steps to document the processes and write the codes in Jupyter Notebook.  

Getting Started 
●     Download the datasets from moodle.

●     Create an  Assignment-2A.ipynb​      ​ file in Jupyter Notebook to write your solution for processing data. 

You will be using Python 3+ and PySpark 3.0+ for this assignment.             


1 Data Loading and exploration 
 In this section, you will need to load the given datasets into PySpark DataFrames and use DataFrame functions​ to process the data. Excessive usage of Spark SQL or pandas is discouraged. For plotting, different visualisation packages can be used, but you need to ensure that you have included instructions to install the additional packages and the installation would be successful in the provided VM setup. 

 1.1 Data Loading 
1.    Write the code to get a SparkSession. For creating the SparkSession, you need to use a SparkConf object to configure the Spark app with a proper application name, to use UTC as the session timezone, and to run locally with as many working processors as local cores on your machine[1]​

 2.    Write code to define the data schema for both pedestrian count CSV file and the sensor location file, following the data types suggested in the metadata file[2]​ ​, with the exception of the “location” columns

a. Use StringType for “location” column

 3. Using predefined schema, write code to load the pedestrian count csv files into a dataframe, and load the sensor location csv file into another dataframe. Print the schema of both dataframe after transformation  

 4.    Write code to create an additional column “above_threshold” in the pedestrian count dataframe to indicate whether the hourly count is above 2000 or below. Use label 0 for count below 2000, and label 1 for count above or equal to 2000

1.2 Exploring the data  
 
1.    For the pedestrian count dataframe, write code to show the basic statistics (including count, mean, stddev, min, max, 25 percentile, 50 percentile, 75 percentile) for each numeric column, except for the columns of “above_threshold”, “Date_Time”  

2.    Write code to show the count of above-threshold and below-threshold based on the column “above_threshold”

○     Do you see any class imbalance? Describe what you observe and discuss how it could impact classification

3.    Write code to display a histogram to show the distribution of the hourly counts with log-scale for the frequency axis, and a line-plot to show the trend of the average daily count change by month  

○     Describe what you observe from the plots.

4.    Explore the data provided and write code to present two plots[3]​ worthy of presenting to the MelbourneGig company, describe your plots and discuss the findings from the plots  

○     Hint - 1: you can use the basic plots (e.g. histograms, line charts, scatter plots) for relationship between a column and the label; or more advanced plots like correlation plots; 2: if your data is too large for the plotting, consider using sampling before plotting

○    150 words max for each plot’s description and discussion ○ Please do not repeat the plots in task 1.2.3.

                        ○    Please only use the provided data for visualisation

2. Feature extraction and ML training 
In this section, you will need to use PySpark DataFrame functions and ML packages for data preparation, model building and evaluation. Other ML packages such as scikit learn would receive zero marks. Excessive usage of Spark SQL is discouraged.  

 2.1 Discuss the feature selection and prepare the feature columns 
 
1.    Considering the data exploration from 1.2 and the nature of time-series data, we would be performing a one-step time-series prediction, meaning that the model’s prediction for the next hour count would be based on the previous pedestrian count(s)[4]​ ​. And the prediction is only needed for the hours between​ 9:00am and midnight.​ Which columns are you planning to use as features? Discuss the reasons for selecting them and how you create/transform them[5]​

                        ○    400 words max for the discussion

                        ○    Please only use the provided data for model building

○ Hint - things to consider include whether to create more feature columns, whether to remove some columns, using the insights from the data exploration/domain knowledge/statistical models

 2.    Write code to create the columns based on your discussion above

 2.2 Preparing Spark ML Transformers/Estimators for features, labels and models 
1. Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1, and create ML model Estimators for Decision

Tree and Gradient Boosted Tree model for each use case

                        ○    Please DO NOT fit/transform the data yet
 
2. Write code to include the above Transformers/Estimators into pipelines

                        ○    A maximum of two pipelines can be created for each use case

                        ○    Please DO NOT fit/transform the data yet

3. For the Decision​ Tree classification​ model you have created, explain the purposes of the hyperparameters of maxDepth and maxBin, and how they impact the model in theory and in this use case

 2.3 Preparing the training data and testing data 
 
1. Write code to split the data for training and testing purpose - use the data between 2014 and 2018 (including 2018) for training purpose and the data in 2019 as testing purpose[6]​ ​; then cache the training and testing data

○ Note: From task 2.1.1, the model training and the prediction is only needed for the hours between 9:00am and midnight​          .​  

 2.4 Training and evaluating models

Use case 1
 1.    For use case 1, write code to use the corresponding ML Pipelines to train the models on the training data from 2.3. And then use the trained models to perform predictions on the testing data from 2.3[7]​

2.    For both models’ results in use case 1, write code to display the count of each combination of above-threshold/below-threshold label and prediction label in formats like the screenshot below. Compute the AUC, accuracy, recall and precision for the above-threshold/below-threshold label from each model testing result using pyspark MLlib/ML APIs

○     Discuss which metric is more proper for measuring the model performance on predicting above-threshold events, in order to give the performers good recommendations while reducing the chance of falsely recommending a location.

  ○    Discuss which is the better model, and persist the better model. 

3.    For the Decision Tree classification model in use case 1, write code to print out the leaf node splitting criteria and the top-3 features with each corresponding feature importance. Describe the result in a way that it could be understood by your potential users (e.g. street art performers)
 

4.    How to improve the prediction for use case 1? Propose at least two suggestions, elaborate on why each could improve the models, and also briefly explain how to implement it with code snippets (no need for full implementation)

○     Hint - your suggestion should assume that model training is run on a Spark cluster with the data being in either Spark RDD or Dataframe format; you can also suggest using additional packages which are compatible with Spark. ○ 600 words max for the discussion

Use case 2

5.    For use case 2, write code to use the corresponding ML Pipelines to train the models on the cache training data from 2.3. And then use the trained models to perform predictions on the testing data from 2.3[8]​  

6.    For both models’ results in use case 2, compute the RMSE, R-squared ○  Discuss which is the better model, and persist the better model.
        

3. Knowledge sharing  
In addition to building the machine learning models, the IT manager from MelbourneGig would like to learn more about parallel processing. You are expected to combine the theory from the lecture and the observation from Spark UI or Spark source code to explain the ideas of data parallelism and result parallelism using the KMeans clustering as an example

3.1 How many jobs are observed when training the KMeans clustering model following the code below? Provide a screenshot from Spark UI for running a simple KMeans model training from the provided data[9]​

customer_df = spark.createDataFrame([

    (0,19,15,39),

    (0,21,15,81),     (1,20,16,6),

    (1,23,16,77),

    (1,31,17,40),

    (1,22,17,76),     (1,35,18,6),

    (1,23,18,94),     (0,64,19,3),

    (1,30,19,72),

    (0,67,19,14),

    (1,35,19,99),

    (1,58,20,15)],  

    ['gender', 'age', 'annual_income', 'spending_score'])

assembler = VectorAssembler(     inputCols=['gender', 'age', 'annual_income', 'spending_score'],     outputCol='features') kmeans = KMeans(k=4).fit(assembler.transform(customer_df)) 

3.2 Combining the parallelism theory from lecture, Spark source code, and the Spark UI, explain whether data parallelism or result parallelism is being adopted in the implementation of KMeans clustering in Spark 

●     300 words max for the discussion

●     Hint      -           you      can      also     refer     to         the       Spark   source             code    on             github https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ mllib/clustering/KMeans.scala

More products