$34.99
● Tidy the data by transforming it from the current wide format into a long format.
● Compare hourly pedestrian count distributions at locations Melborne.Central and Southern.Cross.Station using box plots.
● Compare hourly pedestrian count distributions at The.Arts.Centre and Southbank.
Tutorial Activities
1. Slide 45 lists 3 models for data analytics: KDD, SEMMA and CRISP-DM. Describe each of them and outline the origin, main similarities and differences between models. You can use these Wikipedia pages as a starting point.
http://en.wikipedia.org/wiki/Data_mining http://en.wikipedia.org/wiki/SEMMA
http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Tips
http://link.springer.com/article/10.1023%2FA%3A1021564703268 )
(a) Simplify the taxonomy by making groups of errors you think are closely related.
(b) Choose 10 specific error types and give an example of each.
Tips
3. Briefly read: Tidy Data http://www.jstatsoft.org/v59/i10/paper and summarize the main principles of tidy data.
Tips
(b) Over the same time period investigate whether customers who spend the most in total have a greater number of visits to the store than those who spend the least. To do this create two groups of 10: those having the highest spend in total and those with the lowest. You can now compare the number of visits made by each customer in each of the two groups. Hint: you might want to use the “length” function to count the number of visits made by each customer.
Tips
6 The data file “govhackelectricitytimeofusedataset.csv” has been created from the .txt file originally available as part of the Australian Government’s data resources. See link at: https://data.gov.au/dataset/sample-household-electricity-time-of-use-data. The file contains the smart meter records for a number of households recorded at 30 minute intervals over varying periods of time. The first few rows of the csv file are below.
The columns of interest are “Customer_Key” (meter), “End Datetime”, and “General SupplyKWH” (power used each 30 mins).
Using the 30 minute general supply, calculate the daily supply for each meter for every day there is data available. Because the number of records is unreliable you will also need to count the number of daily observations for each (day, meter). You should then discard any (day, meter) readings that do not have the complete number of observations.
Tips
7. Analyse the Anscombe data set (anscombe). This data set is part of the base R installation and consists of 4 pairs of x,y observations.
(a) Using summary statistics and correlation describe the main similarities and differences between the pairs.
(b) Now, using some visual analysis describe the similarities and differences between the pairs.