Starting from:

$29.99

FIT1043 Assignment 3 Solution

1) Please hand in a PDF file containing your answers to all the questions and, numbered correspondingly.
2) Your report should include the following cases:
● The screenshots/images of the outputs/graphs you generate in order to justify your answers to all the questions.
3) Please be informed that you need to explain what each part of command does for all your answers. For instance, if the code you use is ‘unzip tutorial_data.zip‘, you need to explain that the code is used to uncompress the zip file.
4) Please don’t include the questions into the assignment (It has 5% penalty).
NOTE: Two data sets for this assignment are in the Google shared drive:
https://drive.google.com/drive/folders/1Ala0KnoHlgeXFpxVwr7OJaOzSR7xxHSN?us p=sharing
Both are large, so your best bet is to download them while in the lab/studio and do the assignment there. You will need to use either a Linux machine for this or a Mac terminal or Cygwin on a Windows machine.
Assignment Tasks:
Download the file dataset_TIST2015.tar, which contains user check-in data from Foursquare (https://foursquare.com/).
8) Background: How would you select venues from Europe? Consider the structure of the data presented in the readme file. Check-ins are indexed by a Venue ID, and these are described separately in a separate file, the POI file. You can select European venues from the POI file in (at least) two ways: select items in a latitude longitude bounding box, or select items by country code. Don’t be too fussed by the exact locations (include or exclude Turkey,
Ukraine, etc., that is OK either way).
B. What country has the most venues and what the least, with how many?
C. Which country has the most Seafood restaurants?
D. What is the most common (as in, how many venues) class of restaurant in Europe?
In this task you are working with Twitter_Data_1.gz data file. Please decompress the file and answer the following questions.
(Note: If the term appears two times in a tweet, we count as two)
To answer this question, you will need to extract the timestamps for all tweets referring to Donald Trump. You will then need to read them into R and generate a histogram. [Hint: To read the data into R, first generate a file containing only the timestamp column as text. Then read the file into R as a CSV.] R will not recognise the strings as timestamps automatically, so you’ll need to convert them from text values using the strptime() function. Instructions on how to use the function are available here:
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html). [Note: the histogram should be plotted in the next question (Q3).]
sufficiently high to see the variation over time well.]
[Hint: You’ll need to count up the number of Tweets by each unique author in the Twitter file giving a file with two columns “user” and “twitter count” in the bash Shell . Then load them into R. This is a large file so you can also just isolate the counts, sort and count them to get a summary statistics file with columns “twitter count” and “number of users”.]
Good Luck!

More products