$25
You have been contracted by the Department of Fire and Emergency Services to develop an algorithm they can use to place temporary fire stations in the case of extensive bushfires. They would provide you with a map of hotspots affected by a bushfire and a number of temporary fire stations they are able to support at that time.
The temporary fire stations will each be responsible several hotspots, with no double-handling of hotspots (i.e., each hotspot will be handled by only one temporary fire station).
The fire department have two important considerations:
• On one hand, they want to ensure that hotspots handled by different fire stations are as far from each other as possible. This is important as fire can easily spread and in general it is not possible to fully synchronise the efforts of all neighbouring fire stations.
• On the other hand, it is also important that hotspots handled by the same fire station are as close together as possible so that fire brigade can move from one hotspot to another.
Therefore your program should suggest the number of fire stations to maximize the ratio of distances between hotspots handled by different fire stations and those handled by the same station.
Your task is to solve this problem with an efficient algorithm for solving a clustering problem.
CLUSTERING BACKGROUND
The clustering problem is defined as follows:
Given a set of items, 𝑎1, 𝑎2, … , 𝑎𝑛, and a set of distances between those items,
𝑑𝑖𝑗 = 𝑑(𝑎𝑖, 𝑎𝑗) where 𝑖, 𝑗
then find the optimal clustering of the points.
We explain/define the important concepts as follows.
• Distance between two items is a measure of dissimilarity between the two items. If the items are points in a two-dimensional plane, distance can be defined as the Euclidean distance.
• Inter-clustering distance (InterCD) is the minimum distance between any two items belonging to different clusters. Formally:
InterCD = min{𝑑(𝑎𝑖, 𝑎𝑗) | 𝑖, 𝑗 and 𝑎𝑖 is in a different cluster to 𝑎𝑗}
• Similarly, intra-clustering distance (IntraCD) is the maximum distance between any two items belonging to the same cluster. Formally:
IntraCD = max{𝑑(𝑎𝑖, 𝑎𝑗) | 𝑖, 𝑗 ∈ [1, 𝑛] and 𝑎𝑖 is in the same cluster as 𝑎𝑗}
• A clustering is optimal if it maximizes the inter-clustering distance for the given number k of clusters. (Note that, in general, there are many different ways to define an optimal clustering, and we are using this particular one).
• The centroid of the cluster is the “average” point in the cluster. The x-coordinate of the centroid is the average of the 𝑥-coordinates of all the points in the cluster; similarly, the 𝑦-coordinate of the centroid is the average of the 𝑦-coordinates of all the points in the cluster.
TASKS
Task 1 – Individual and pairs assignment: Use Kruskal’s algorithm for finding a minimum spanning tree to find an optimal clustering (clustering that maximizes the inter-clustering distance) of the given set of n points in a two-dimensional plane, for the given number k of clusters.
Task 2 – Pairs assignment only: Among the optimal clusterings obtained by Kruskal’s algorithm for different values of 𝑘 (where 2 ≤ 𝑘 ≤ 𝑛 − 1), find the one that maximizes the ratio InterCD between
IntraCD the inter-clustering distance and intra-clustering distance.
INPUT
1. Your program must be executable from the command line. It must accept input as command line arguments (a.k.a., parameters), and write output to standard output (e.g., cout in C++, System.out in Java). Errors should be printed to standard error (e.g., cerr in C++, System.error in Java).
2. Your program executable (or your main Java class) must be called kcluster. The first command line arguments it accepts must be a path to the input file, and the second must be the number of temporary fire stations.
For example, to run the program using a file named “input.txt” in the current folder, and 2 temporary fire stations, we would run:
kcluster ./input.txt 2 or
java kcluster ./input.txt 2
3. If you are working in pairs, then the program must allow for the 2nd argument to be the word “auto”. In this case the program will automatically calculate number of clusters so as to maximize the inter-clustering to intra-clustering distance ratio.
4. The input file will be a comma separated value file, with the first column being the ID of the hotspot, the 2nd column is the 𝑥-coordinate of the hotspot, and the 3rd column is the 𝑦-coordinate of the hotspot. The ID will be a unique integer, and the 𝑥- and 𝑦-coordinates will be floating point numbers.
For example, an input file might look like:
1,1.0,1.0
2,2.0,2.0
3,3.0,5.0
4,7.0,8.0
5,8.0,7.0
5. The program should handle obvious error conditions (e.g., incorrectly formatted or empty input files, negative numbers of temporary fire stations, etc) by printing an error message to standard error and immediately stopping execution.
OUTPUT
The output must look like the following example.
INDIVIDUAL ASSIGNMENT:
Suppose the program is run for an input file named “input.txt”, requesting two stations by running java kcluster ./input.txt 2 then the output would be:
Hello and welcome to Kruskal’s Clustering!
The weighted graph of hotspots:
0 1.41 4.47 9.22 9.22
1.41 0 3.16 7.81 7.81
4.47 3.16 0 5 5.39
9.22 7.81 5 0 1.41
9.22 7.81 5.39 1.41 0
There are 5 hotspots.
You have requested 2 temporary fire stations.
Station 1:
Coordinates: (2.00, 2.67)
Hotspots: {1,2,3}
Station 2:
Coordinates: (7.50, 7.50)
Hotspots: {4,5}
Inter-clustering distance: 5.00
Thank you for using Kruskal’s Clustering. Bye.
PAIRS ASSIGNMENT:
The assignment should look the same as the individual assignment if the number of requested temporary fire stations is a number.
In addition: if the number of fire stations is given as auto, then the output should be as below. For example, suppose the program is run for an input file named “input.txt”, requesting automatic station calculation by running java kcluster ./input.txt auto then the output would be:
There are 5 hotspots.
Automatically calculating number of temporary fire stations.
The weighted graph of hotspots:
0 1.41 4.47 9.22 9.22
1.41 0 3.16 7.81 7.81
4.47 3.16 0 5 5.39
9.22 7.81 5 0 1.41
9.22 7.81 5.39 1.41 0
Number of stations: 3
Station 1:
Coordinates: (1.50, 1.50)
Hotspots: {1,2}
Station 2:
Coordinates: (3.00,5.00)
Hotspots: {3}
Station 3:
Coordinates: (7.50, 7.50)
Hotspots: {4,5}
InterCD/IntraCD = 2.24
Thank you for using Kruskal’s Clustering. Bye.