Starting from:

$25

TDT4225 -  Assignment 3 - Very Large -Distributed Data Volumes -Solved

 Introduction 
 In this assignment you’ll be working in groups of (max) 3 students to solve some  database tasks using the MongoDB database ( version 4.2.15)  and coding in Python.

 Most groups have access to a virtual machine at IDI’s cluster, running Ubuntu. This  machine will have MongoDB already installed and set up for you to use.

 This assignment will look at an open dataset of trajectories, and your task is to create  collections, clean and insert data, and fetch the data to answer some questions. Your  Python program will handle this functionality. The program is inspired by the social  media workout application  Strava , where users can track activities like running,  walking, biking etc and post them online with stats about their workout.

 Along with this assignment sheet, you have been given several files:

●     Dataset - Modified dataset based on Geolife GPS Trajectory dataset (please use  this dataset, instead of the one from the website. The dataset from the website  contains some files that may give you an error).

●     DbConnector.py -  a Python class that connects to MySQL on the virtual  machine.

●     example.py -  a Python class with examples on how to create tables, insert,  fetch and delete data in MySQL.

●     requirements.txt -  a file containing some pip packages that should be used  in this assignment.

●     labeled_ids.txt -  a file containing the ID of all users who have labeled their  data. This is found in the  dataset.zip  file.

●     User-Guide-1.3.pdf   -  an official user guide to the Geolife dataset. This is  found in the  dataset.zip  file.

 This assignment is very similar to assignment 2 where you used MySQL. The dataset is  the same, and the tasks are the same. The switch from an SQL-database to a  NoSQL-database is the real challenge here.

 It is  strongly  recommended that you read through and understand the whole  assignment sheet before you start solving the tasks. There are also some tips at the  bottom of this assignment, which could be very handy! 

 Dataset - Geolife GPS Trajectory dataset 
 Geolife GPS Trajectory dataset  is an open source dataset collected by Microsoft from 2007-2011          . This dataset recorded a broad range of users’ outdoor movements,  including not only life routines like going home and going to work but also some  entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and  cycling. The user data is mainly from Beijing, China, but also from the US and Europe.

 The dataset is provided in the  dataset.zip  and has some differences from the one  online, for the purpose of this assignment. It contains 182 users that have tracked 18 669             activities in total. Each activity contains a number of GPS-points, which we call  TrackPoints. In total there are over 24 million trackpoints in this dataset.

 In the  dataset.zip  you will find the official user guide for the dataset. Be sure to  read through this before you start the task. Here is also a short summary of the  dataset,  understanding this is crucial  for completing the assignment:

 The folder named Data contains all the 182 users, labeled from “000” to “181”. Each  user has a Trajectory-folder where each activity is stored as a .plt file. Each .plt file  contains several trackpoints.

 Line 1…6 in the .plt files are useless in this dataset, and can be ignored. Trackpoints are  described in following lines, one for each line:

 Field 1: Latitude in decimal degrees.

 Field 2: Longitude in decimal degrees.

 Field 3: All set to 0 for this dataset, i.e. you don’t need this field.

 Field 4: Altitude in feet (-777 if not valid).

 Field 5: Date - number of days (with fractional part) that have passed since  12/30/1899.

 Field 6: Date as a string.

 Field 7: Time as a string.

 Note that field 5 and field 6&7 represent the same date/time in this dataset. You may  use either of them.

 Example:

 39.906631, 116.385564, 0, 492, 40097.5864583333, 2009-10-11, 14:04:30

 39.906554, 116.385625, 0, 492, 40097.5865162037, 2009-10-11, 14:04:35

 Some users have also labeled their data with transportation modes. Possible  transportation modes are:  walk, bike, bus, taxi, car, subway, train, airplane, boat, run and  motorcycle . We have provided you with the ids of the users that have labeled their data  in  labeled_ids.txt , which will be used for the tasks. The labels contain start time      and end time (both date and time) along with the transportation mode for each activity.

 Example:

 Start Time                               End Time                                 Transportation Mode 

 2008/04/02 11:24:21               2008/04/02 11:50:45            bus

 2008/04/03 01:07:03               2008/04/03 11:31:55            train

 MongoDB - NoSQL 
 A quick note on MongoDB and NoSQL. There is some terminology that is different  from the regular relational-database. MongoDB stores data in documents. Documents  are JSON documents based on the JSON specification.  An example of a JSON document would be as follows:

 

 Notice that documents are not just key-value pairs but can include arrays and  subdocuments. The data itself can be different data types like geospatial, decimal,  string etc. This does not have to be pre-defined, neither does two documents in the  same collection have to include the same fields.

 A  database  in MongoDB is the container for  collections . A  collection  is a container for  documents . This grouping is similar to relational databases and is pictured below:

 

 Source: Robert Walters , ‘Getting Started with Python and MongoDB’,  2017.

 Url:  https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb 

 Collections 
 As mentioned earlier, collections in MongoDB are not as strict as in relational  databases. How you decide to organize your collections and documents is up to you.  The data we are interested in is the same as from Assignment 2 and is displayed in the  figure below. Notice that foreign keys are not really a thing in MongoDB, but it is  optional how you want to reference things in each document.

 Because MongoDB offers a lot of ways to handle this, we encourage you to investigate  what solution that might suit the purpose of this assignment best.

 Here’s three great articles that we recommend reading before making a decision:

 MongoDB Schema Design - Part 1 (5      min reading time)

 MongoDB Schema Design - Part 2 (5      min reading time)

 MongoDB Schema Design - Part 3 (5      min reading time)

 Remember what you did in Assignment 2 and what approach that would fit the tasks.

 User 

 _id - string  has_labels - boolean
 Activity 

 _id - int

 transportation_mode - string  start_date_time - datetime  end_date_time - datetime
 TrackPoint 

 _id  - int  lat - double  lon - double  altitude - int  date_days - double  date_time - datetime
 Optional reference  to Activity
 Optional reference to User  and/or TrackPoint
 Optional reference to

 Activity
 Notice how the id is written as “_id”. In MongoDB, there will always be a “_id” as an  ObjectID if the field is not specified (read more  here) . E.g., creating a document like  this:

 Person = {

 id: 1,

 name: ‘Name Namesen’

 }

 would result in a Person-document like this when inserted to the database:

 Person = {

 _id: ObjectID(‘Some12byteNumber’)  id: 1,

 name: ‘Name Namesen’

 }

 You have to manually override the “_id”-field if you want to set a defined ID (e.g. ‘014’  for Users).

 Datetime should be consistent throughout your collections, e.g. in the format  YYYY-MM-DD HH:MM:SS.

 Setup database 
 These steps will guide you through how to connect to the virtual machine from your  computer and create a MongoDB-database user with privileges.

1.      First, you need to use NTNU’s VPN to access the virtual machines. Log onto  NTNU-VPN with your Feide user (check out  this link  if you have not done this  before).

2.      Open your terminal and enter the following command:

 ssh your_username@tdt4225-xx.idi.ntnu.no  where

 “your_username”  is the Feide-username and  “xx”  is your group number for this  project (01, 02, 03… etc).

( If this message pops up, enter  yes :

 The authenticity of host 'tdt4225-xx.idi.ntnu.no ( xxx.xxx.xxx.xxx)' can't be established. 

 ECDSA key fingerprint is … 

 Are you sure you want to continue connecting (yes/no)?  yes )  

 You will then be asked to enter your password, use the Feide-password here.

 If the password is correct, you have now successfully logged in!

3.    When using MongoDB, we have to have a server running in one terminal  window. First, type  sudo mkdir -p /data/db , then type       sudo mongod -- repair ( both only have to be done the first time you start your server). If  prompted to provide a password, type in your Feide password. Once that is  completed type  sudo mongod --bind_ip_all . If successful, you have now     a running MongoDB-server (the bind_ip_all flag opens for connections from any  IP-address, but don’t worry, we’ll use authentication). This server must be  running whenever you are working with the task. Only one of the group  members has to run the server at a time.

 N.B.  if you get error messages running  sudo mongod --repair  or  sudo  mongod --bind_ip_all  it is likely that an instance of mongod is already  running and that you get a problem with unavailable port(s). In this case, try  the following chain of commands: 

 sudo service mongod restart  sudo systemctl stop mongod  sudo mongod --repair  sudo mongod --bind_ip_all 

 Continue to step 4 

4.    Open a new terminal window and ssh into the virtual machine. Do not close the  server-window you have running. To open a client-window for MongoDB type  sudo mongo . Now, you have your MongoDB-server running in one terminal    window, and a MongoDB-client running in the other. If you check your server,  the client-login should’ve been logged there.

5.    Now we want to create an admin MongoDB-user. The MongoDB-client accepts  shell commands written in JavaScript.

 Start by creating an admin-user by typing the following commands:

                >  use admin , this chooses the admin database      

>  db.createUser({  user: ‘your_username’,  pwd: ‘your_password’, 

                                roles: [‘root’]      

 }) , this creates a user with the root-role. Change your_username and   your_password with desired names. The command can be written in one line, it               is spread across several lines here for readability. Here is an example where  your_username=adminuser and pwd=admin123:

>  db.createUser({  user: ‘adminuser’, 

 pwd: ‘admin123’,  roles: [‘root’] 

 }) 

 Now type >  show users;  and find your newly created admin user.

6.      Now let’s create the user and the database that the group will use during this  assignment. To create a database you simply type >  use my_db;  where  my_db is your chosen name. If the database does not exist, it will create it for you. Now do the same command as in Step 5, but with a read/write access-role:      

>  db.createUser({  user: ‘your_username’,  pwd: ‘your_password’,  roles: [‘readWrite’] 

})  

 Now, your user will have read and write access to the my_db database that you  created. Type >  show users;  and find your newly created user, and check  that the role is correct and attached to the correct database (my_db not admin).  You can also type , but the database will not show until you have created  collections and inserted data into it. Remember to run “use my_db” before  creating the user, so that the user gets access to the correct database.

7.      You can use this client window to create collections etc with the shell          commands, but from now on the Python program will be our MongoDB-client.  Remember to keep the MongoDB-server running.

 Setup Python 
We will be using a         MongoDB connector to Python (PyMongo)  in this assignment. If you for some reason would like to complete the assignment in another language, you are            allowed to do so. ( NB! We will  not  provide support or documentation for other languages,  so we strongly encourage you to use Python.) 

 With the files provided in this assignment, you will find a requirements.txt,     DbConnector.py    and  example.py , among others.     

1.       To set up the required pip-packages for this assignment, simply run             pip install -r requirements.txt  while you are in the directory with  the requirements-file. This will install the connector, along with  tabulate ( used  to print pretty tables in python) and  haversine ( used to calculate distance  between two coordinates).

2.      Now, open  DbConnector.py  and look at the code. There will be a TODO there  to set up the database correctly with the settings from the database setup.

 Update the values accordingly:

 Host =  tdt4225-xx.idi.ntnu.no, where xx is your group number

 Database =  the name you provided for your database in step 6 of the database  setup (my_db from the example)

 User =  the username provided in step 6, (not the admin user)

 Password =  the password provided in step 6

3.      Open  example.py , the file has a main method that creates a connection to the      database from DbConnector, creates a table named “Person”, inserts some  names in the database, displays the data, and then drops the table.  Try to run the code. If everything is set up correctly, you will establish a  connection to the database-server and insert/delete data. You can use this file  for inspiration when solving the tasks in this assignment.

 Tasks 
 The tasks are divided into three parts: Part 1 will focus on cleaning and inserting the  data into defined tables. Part 2 will focus on writing queries to the database to gain  knowledge of the dataset. Part 3 will focus on writing a report where you discuss your  answers.

 We recommend looking at the  documentation  for the MongoDB-Python connector  before you start coding.

 Part 1
 In this task you’ll clean and insert the Geolife dataset into your own MongoDB  database, to  be able to solve the questions in Part 2. In the section “Geolife GPS  Trajectory dataset“ above, you’ll find all the information you need about the dataset  you are handed, and in the section “Collection” you’ll find some nice articles for how to  structure the dataset in your MongoDB database.  Please  discuss in your report  how  you structure your collections and documents.

 Write a Python program that does the following:

1.      Connects to the MongoDB server on your Ubuntu virtual machine.

2.      Creates and defines the collections User, Activity and TrackPoint

3.      Inserts the data from the Geolife dataset into your MongoDB database

 ○  Here, we require a bit of data cleaning and integration. Study the dataset  closely  to understand which data goes where.

 ○  Iterating through the dataset in the directories can be done using  os.walk  method, but you are free to use other methods if you find them better for  your solution.

 ○  When matching transportation_mode from the  labels.txt  files to the  activities, we only consider exact matches on starttime  and end time, i.e.  if you find a match in start time, but not in end time, or vice versa, it  should not be included. Additionally, there are some labeled activities in  labels.txt  that will not have a match amongst the activities.

 ○  Important!  Sometimes it is wise to limit the dataset we’re working with. 24           million TrackPoints is a lot. Therefore we only want you to insert activities that have          fewer than or exactly 2500 trackpoints  in them, i.e.  when inserting activities into the database, check that the size of the  .plt-files do not exceed 2500 lines (excluding the headers, of course).

 When you insert TrackPoints, the same rule applies.

 ○  Inserting the trackpoints may take a while! Instead of inserting one document of trackpoints at a time, find out if there is a way to insert    bulks of data instead.

 Part 2
 Some of the tasks can be answered using MongoDB-queries only, while some might  require both queries and Python code to manipulate the data correctly.  Use  pprint  to  pretty-print the data output from MongoDB.

 Answer the following questions by writing a Python program using MongoDB-queries

( like in  example.py ):  

1.      How many users, activities and trackpoints are there in the dataset (after it is  inserted into the database).

2.      Find the average, minimum and maximum number of activities per user.

3.      Find the top 10 users with the highest number of activities.  

4.      Find the number of users that have started the activity in one day and ended  the activity the next day. 

5.      Find activities that are registered multiple times. You should find the query  even if you get zero results. 

6.      An infected person has been at position (lat, lon) (39.97548, 116.33031) at  time ‘2008-08-24 15:38:00’.  Find the user_id(s) which have been close to this  person in time and space (pandemic  tracking). Close is defined as the same  minute (60 seconds) and space (100 meters). (This is a simplification of the 

 “unsolvable” problem given i exercise 2). 

7.      Find all users that have never taken a taxi.     

8.      Find all types of transportation modes and count how many distinct users that  have used the different transportation modes. Do not count the rows where the  transportation mode is null .

9.      a) Find the year and month with the most activities.  

 b)  Which user had the most activities this year and month, and how many  recorded hours do they have? Do they have more hours recorded than the user  with the second most activities?

10.  Find the total distance (in km)  walked  in 2008, by user with id=112.

11.  Find the top 20 users who have gained the most altitude  meters .

1.      Output should be a table with (id, total meters gained per user).

2.      Remember that some altitude-values are invalid

3.      Tip:  (tpn.altitude-tpn-1.altitude),  tpn.altitude >tpn-1.altitude

12.  Find all users who have invalid activities, and the number of invalid activities per  user  

 1.  An invalid activity is defined as an activity with consecutive trackpoints  where the timestamps deviate with at least 5 minutes.  

 Part 3
 Write a short report (see  report-template.docx  in the

 tdt4225-assignment2.zip)   where you will display and discuss your results from  the tasks. Include screenshots from both part 1 (showing top 10 rows from all of your  tables is sufficient) and part 2 (all the results to each task).

 Hand in both the report (as PDF) and your code to BlackBoard within  Oct 22, at 16:00 . 

 Tips 
●     Using the MongoDB terminal on your Ubuntu machine could be useful for  checking simple queries without having to use Python and the connector.

○     Be sure to have a server running,  sudo mongod --bind_ip_all 

 ○  Open the MongoDB client terminal by typing  sudo mongo , log in with       Feide-password and type use name_of_db     to do this.

●     Using dictionaries/hashmaps are faster for lookups than lists/arrays in your  Python code

●     When extracting transportation_mode from the  labels.txt  files, remember that you only have to consider the user-ids found in the     labeled_ids.txt     , as    they are the only users where transportation_mode may not    be null.

●     Start_time and end_time for activities can be found by looking at the date and  time for the first and last trackpoint in each .plt-file

●     Use some time to consider the different choices for how to store your data.

○     Embedded documents, one-way referencing  or  to -way referencing.

●     Remember to insert  data with  the right  type. Insert integer as int,  date as           some dateformat, etc..      

○     MongoDB support e.g ISODate

●     As stated, using  Haversine  for calculating distance is recommended  ●  Comparison  between SQL and  MongoDB.

●     Comparison  between SQL-functions and MongoBD-functions.

●     Aggregation  and the  pipeline stages for this  are worth checking out.

●     Optional - if you want to visualize the data in the dataset, you can get inspired  here .

More products