$30
Purpose
The purpose of this assignment is for you to:
• Increase your proficiency in C programming, your dexterity with dynamic memory allocation and your understanding of more advanced data structures (K-D trees).
• Increase your understanding of how computational complexity can affect the performance of an algorithm by conducting orderly experiments with your program and comparing the results of your experimentation with theory.
• Increase your proficiency in using UNIX utilities.
Background
Interactive navigation tools like Google Maps and GPS navigation systems are commonplace, but how do these systems actually work? Spatial datasets can be very large: OpenStreetMap, an opensource global mapping dataset, contains over 10 million points of interest. Various algorithms and data structures have been developed to allow users to quickly search and navigate these large spatial datasets.
Your task
In this assignment, you will create a K-D tree to support interactive map functionality for the City of Melbourne Census of Land Use and Employment (CLUE) dataset. A user will be able to query locations to find nearby businesses.
In Assignment 1, you wrote code to read the census data from a file, insert the records as nodes in a linked list, and allow users to search for records by trading name. In this assignment, you should modify your code to insert records into a K-D tree and allow users to search by x,y coordinates. You can use your own Assignment 1 code for this assignment or the sample solution we have provided.
Dataset
This assignment uses the same dataset as Assignment 1, which is a subset of the Business Establishment Trading Name and Industry Classification 2018 dataset, accessed from:
https://data.melbourne.vic.gov.au/Business/Business-establishment-trading-nameand-industry-c/vesm-c7r2
The <x coordinate> and <y coordinate> columns should be used as the 2-D key to store and query records. The other columns can be treated as the associated <data>.
Deliverable 1 - Source code
Stage 1 - What’s here? (7 marks)
In stage 1, you will implement the basic functionality for an interactive map that allows a user to click on locations and retrieve data about the nearest point of interest. Instead of clicks, your code will accept (x,y) pairs from stdin, find the closest business establishment to that location in the dataset, and output the information about that establishment to a file.
Your Makefile should produce an executable program called map1. This program should take two command line arguments: (1) the name of the data file used to build the tree, and (2) the name of an output file.
Your map1 program should:
• Construct a K-D tree to store the information contained in the data file specified in the command line argument. Each record (row) should be stored in a separate Node.
• Handle duplicates (businesses located at exactly the same x,y coordinates) by chaining them together in a linked list connected to a single Node in the K-D tree. Exact duplicate locations should not be added as separate Nodes in the tree.
• Accept locations queries from stdin and search the tree for the location nearest to the query point. The record(s) of the business(s) at this location should be printed to the output file. If there are multiple businesses at this location, all of them must be included in the output.
• In addition to outputting the record(s) to the output file, the number of key comparisons performed during the search should be written to stdout.
For testing, it may be convenient to create a file of keys to be searched, one per line, and redirect the input from this file. Use the UNIX operator < to redirect input from a file.
Example input
• map1 datafile outputfile then type in queries; or
• map1 datafile outputfile < queryfile
Queries should be entered as x,y pairs separated by a space: x y
Example output
This is an example of what might be output to the file after two queries:
144.959522 -37.800095 −− > Census year: 2018 || Block ID: 240 || Property ID: 104466 || Base property ID: 104466 || CLUE small area: Carlton || Trading Name: The Bio 21 Cluster || Industry (ANZSIC4) code: 6910 || Industry (ANZSIC4) description: Scientific Research Services || x coordinate: 144.9593
|| y coordinate: -37.8002 || Location: (-37.80023252, 144.9592806) ||
144.959522 -37.800095 −− > Census year: 2018 || Block ID: 240 || Property ID: 104466 || Base property ID: 104466 || CLUE small area: Carlton || Trading Name: The University of Melbourne || Industry
(ANZSIC4) code: 8102 || Industry (ANZSIC4) description: Higher Education || x coordinate: 144.9593
|| y coordinate: -37.8002 || Location: (-37.80023252, 144.9592806) ||
144.959522 -37.800095 −− > Census year: 2018 || Block ID: 240 || Property ID: 104466 || Base property ID: 104466 || CLUE small area: Carlton || Trading Name: The Co-Op Bookstore || Industry (ANZSIC4) code: 4244 || Industry (ANZSIC4) description: Newspaper and Book Retailing || x coordinate: 144.9593
|| y coordinate: -37.8002 || Location: (-37.80023252, 144.9592806) ||
144.959522 -37.800095 −− > Census year: 2018 || Block ID: 240 || Property ID: 104466 || Base property ID: 104466 || CLUE small area: Carlton || Trading Name: Baretto Cafe || Industry (ANZSIC4) code: 4511 || Industry (ANZSIC4) description: Cafes and Restaurants || x coordinate: 144.9593 || y coordinate: -37.8002 || Location: (-37.80023252, 144.9592806) ||
0 0 −− > Census year: 2018 || Block ID: 571 || Property ID: 602254 || Base property ID: 602254 ||
CLUE small area: Kensington || Trading Name: Shine Australia || Industry (ANZSIC4) code: 5511 || Industry
(ANZSIC4) description: Motion Picture and Video Production || x coordinate: 144.9081 || y coordinate:
-37.7851 || Location: (-37.78505447, 144.9081466) ||
This is an example of what might be output to stdout:
144.959522 -37.800095 −− > 4815
0 0 −− > 359
Note that the key is output to both the file and to stdout, for identification purposes. Also note that the number of comparisons is only output at the end of the search, so there is only one number for key comparisons per key, even when multiple records have been located for that key.
The format need not be exactly as above; variations in whitespace/tabs are permitted. The number of comparisons above has been made up, do not take it as an example of a correct execution!
Stage 2 - Radius search
In stage 2, you will code a function that allows the user to find all of the business establishments within some distance of a query point. Your code will accept (x,y,radius) triplets from stdin, find all business establishments within the requested radius of the x,y point, and output the information about those establishments to a file.
Your Makefile should produce an executable program called map2. This program should take two command line arguments: (1) the name of the data file used to build the tree, and (2) the name of an output file.
Your map2 program should:
• Construct a K-D tree to store the information contained in the data file specified in the command line argument, exactly as in Stage 1. Note that you can (and should!) reuse your code from Stage 1 to do this step.
• Accept x,y,radius queries from stdin and search the tree for all locations within the requested radius of the x,y point. These records should be printed to the output file. When there are multiple businesses at the same location, all of these records should be included in the output.
• If no business establishments are located with the query radius, your code must output the word NOTFOUND.
• In addition to outputting the above data to the output file, the number of key comparisons performed during the search should be written to stdout.
Example input
• map2 datafile outputfile then type in queries; or
• map2 datafile outputfile < queryfile
Queries should be entered as x,y,radius triplets separated by spaces: x y r
Example output
This is an example of what might be output to the file after two queries:
144.967058 -37.817313 0.0005 −− > Census year: 2018 || Block ID: 15 || Property ID: 109260
|| Base property ID: 109260 || CLUE small area: Melbourne (CBD) || Trading Name: Hungry Jack’s Pty Ltd
|| Industry (ANZSIC4) code: 4511 || Industry (ANZSIC4) description: Cafes and Restaurants || x coordinate:
144.9668 || y coordinate: -37.8171 || Location: (-37.81711586, 144.9668418) ||
144.967058 -37.817313 0.0005 −− > Census year: 2018 || Block ID: 15 || Property ID: 109258
|| Base property ID: 109258 || CLUE small area: Melbourne (CBD) || Trading Name: McDonalds || Industry
(ANZSIC4) code: 4511 || Industry (ANZSIC4) description: Cafes and Restaurants || x coordinate: 144.9669
|| y coordinate: -37.8172 || Location: (-37.81724484, 144.9669126) ||
144.967058 -37.817313 0.0005 −− > Census year: 2018 || Block ID: 15 || Property ID: 104015
|| Base property ID: 104015 || CLUE small area: Melbourne (CBD) || Trading Name: Dangerfield || Industry
(ANZSIC4) code: 4251 || Industry (ANZSIC4) description: Clothing Retailing || x coordinate: 144.9668
|| y coordinate: -37.8174 || Location: (-37.81741866, 144.9668092) ||
144.967058 -37.817313 0.0005 −− > Census year: 2018 || Block ID: 15 || Property ID: 109257
|| Base property ID: 109257 || CLUE small area: Melbourne (CBD) || Trading Name: Young & Jacksons ||
Industry (ANZSIC4) code: 4520 || Industry (ANZSIC4) description: Pubs, Taverns and Bars || x coordinate:
144.967 || y coordinate: -37.8174 || Location: (-37.81735593, 144.967023) ||
144.967058 -37.817313 0.0005 −− > Census year: 2018 || Block ID: 15 || Property ID: 109259
|| Base property ID: 109259 || CLUE small area: Melbourne (CBD) || Trading Name: Souvenir Collection
|| Industry (ANZSIC4) code: 4310 || Industry (ANZSIC4) description: Non-Store Retailing || x coordinate:
144.9669 || y coordinate: -37.8172 || Location: (-37.8171602, 144.9668989) ||
144.963089 -37.799092 0.0005 −− > NOTFOUND
This is an example of what might be output to stdout:
144.967058 -37.817313 0.0005 −− > 678
144.963089 -37.799092 0.0005 −− > 1356
Note that the key is output to both the file and to stdout, for identification purposes. Also note that the number of comparisons is only output at the end of the search, so there is only one number for key comparisons per key, even when multiple records have been located for that key.
The format need not be exactly as above; variations in whitespace/tabs are permitted. The number of comparisons above has been made up, do not take it as an example of a correct execution!
• Discussion
Notes on K-D Trees
K-D trees are an extension of binary search trees for k-dimensional keys. When searching, each layer of the tree checks a different dimension of the key. In the 2-D case, the root of the tree should consider only the first element of the key (x) and split left or right depending on whether the first element of the search key is less than or greater than the first element of the root’s key. The second layer of nodes should consider only the second element of the key (y). The third layer should consider the first element, etc. To illustrate, consider how the key (6,0) would be inserted into the K-D tree shown below:
The root node compares the first elements of the keys and since 6 > 3, it directs the search right. The next node compares the second elements of the keys and since 0 < 2, it directs the search left. The new node (6,0) is then inserted as a left child.
Note that it is possible for a key to match an existing node on either x or y but not both. These partial matches are not duplicates and should be inserted as new nodes in the tree. For example, suppose the next node inserted was (2,8). This would go left from the root because 2 < 3 and be compared to (1,8). The second element of the keys match: 8 = 8, but the keys are not exactly the same (2,8) 6= (1,8). So (2,8) could be inserted as either a left or right child of (1,8) (for consistency with our testing code, please put equal value keys to the right).
Searching for the nearest point to a query
To find the nearest node to a query point, you will need to search through the tree, keeping track of the closest match found so far, and the squared distance d to that closest match. Start at the root of the K-D tree. At each node, check the distance between the current node and the query location
D = p(nodex− queryx)2+(nodey − queryy)2. If this distance is lower than the current d, replace the current “closest” match with the current node and set d = D. Then compare the current node’s key to the query. (As in insertion, only one element of the node’s key should be compared to the query – for example, at the root node, compare x values only, and in the first layer nodes compare y values only.) Proceed down one or both branches of the tree depending on the distance between the current node’s key and the query:
• If the current node’s key is > d from the query, proceed down either the left or right branch of the tree depending on whether the query is less or greater than the current node’s key.
• If the current node’s key is ≤d from the query, proceed down both branches of the tree.
These two cases are illustrated in the images below. In this example, M is a node of the K-D tree which splits the data along the x dimension (indicated by the black line) and Q is the query location. d is the distance to the current “closest” match. In the first case (a), the x distance between M and Q is larger than d, so no points left of M could be closer to Q than the current “closest” match. There is no need to search the left children of M. In the second case, the x distance between M and Q is less than d, so there is a chance that there could be a point left of M that is closer than the current match. In order to guarantee that we find the closest possible point to Q, it is necessary to search both the left and right children of M.
(a) Distance between M and Q is (b) Distance between M and Q is greater than d less than d
Searching for points in a radius
Searching for points within a radius of a query is similar to searching for the nearest point. Start at the root of the K-D tree, and at each node, check the distance between the current node and the query location. If this distance is within the requested radius, output the information related to this node and proceed down both branches to look for further matches. If the current node is outside the requested radius, check the distance between the current node’s key and the query coordinates as above, and proceed down the left, right, or both branches of the tree accordingly.
Conventions and recommendations
For easier testing and debugging, we ask that you follow these conventions:
• The root of the tree should check the x coordinate of the key.
• Equal keys (meaning, keys that match the current node on the portion of the key it checks, but differ from the current node on the other value) should be grouped with the keys greater than the current node, so each node splits keys into values < and ≥ the node’s key.
The x and y coordinates should be stored as double type.
Math functions like sqrt() and pow() can be found in the <math.h> library.
Until you are confident your code is working, you might want to test the radius search function with small radius values (e.g., around 0.0005).