Starting from:

$35

ITE4005 Data Science Programming Assignment #3 Solved

 Data Science (ITE4005) 
 Programming Assignment #3


1. Environment
l  OS: Windows, Mac OS, or Linux

l  Languages: C, C++, C#, Java, or Python (any version is ok)

 

2. Goal: Perform clustering on a given data set by using DBSCAN. 

 

3. Requirements
The program must meet the following requirements: l       Execution file name: clustering.exe

n  Execute the program with four arguments: input data file name, n, Eps and MinPts

-         Three input data will be provided: ‘input1.txt’, ‘input2.txt’, ‘input3.txt

-         n: number of clusters for the corresponding input data

-         Eps: maximum radius of the neighborhood

-         MinPts: minimum number of points in an Eps-neighborhood of a given point

-         We suggest that you use the following parameters (n, Eps, MinPts) for each input data

l   For ‘input1.txt’,  n=8,  Eps=15,  MinPts=22

l   For ‘input2.txt’,  n=5,  Eps=2,  MinPts=7

l   For ‘input3.txt’,  n=4,  Eps=5,  MinPts=5 n Example:

  

-   Input data file name = ‘input1.txt’, n = 8, Eps = 15, MinPts = 22 l       File format for an input data

[object_id_1]\t[x_coordinate]\t[y_coordinate]\n 

[object_id_2]\t[x_coordinate]\t[y_coordinate]\n 

[object_id_3]\t[x_coordinate]\t[y_coordinate]\n

[object_id_4]\t[x_coordinate]\t[y_coordinate]\n 

... 

n  Row: information of an object

-         [object_id_i]: identifier of the ith object

-         [x_coordinate], [y_coordinate]: the location of the corresponding object in the 2-dimensional space

n  Example:

 

Figure 1. An example of an input data.

       l    Output files

n  You must print n output files for each input data

-         (Optional) If your algorithm finds m clusters for an input data and m is greater than n (n = the number of clusters given), you can remove (m-n) clusters based on the number of objects within each cluster. In order to remove (m-n) clusters, for example, you can select (m-n) clusters with the small sizes in ascending order

-         You can remove outlier. In other words, you don't need to include outlier in a specific cluster n  File format for the output of ‘input#.txt’ -           ‘input#_cluster_0.txt’

[object_id]\n

[object_id]\n

...

-         ‘input#_cluster_1.txt’

[object_id]\n

[object_id]\n

...

-         ‘input#_cluster_n-1.txt’

[object_id]\n

[object_id]\n

...

n  ‘output#_cluster_i.txt’ should contain all the ids belonging to cluster i that were obtained by using your algorithm 

n  Supposed to follow the naming scheme for the output file as above 

4. Rubric
l  The following figure shows the clustering result for each input data

  

l  Test method

n For testing, we will use a measure similar to the Kendall’s tau measure. Please refer to the following wikipedia page.  

(http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient) 

                    -      Example 

l  Correct answer: [object_id_1] and [object_id_2] are contained in different clusters 

l  Your answer 

n    [object_id_1] and [object_id_2] are contained in the same cluster à INCORRECT 

n    [object_id_1] and [object_id_2] are contained in different clusters à CORRECT n         The final score will be computed as follows: 

𝑻𝑻𝑻𝑻𝑻𝑻 𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝑻𝑻𝒏𝒏 𝒐𝒐𝒐𝒐 𝒄𝒄𝒐𝒐𝒏𝒏𝒏𝒏𝑻𝑻𝒄𝒄𝒄𝒄 𝒑𝒑𝒑𝒑𝒑𝒑𝒏𝒏𝒑𝒑

  

𝑻𝑻𝑻𝑻𝑻𝑻 𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝑻𝑻𝒏𝒏 𝒐𝒐𝒐𝒐 𝒑𝒑𝒂𝒂𝒂𝒂 𝒑𝒑𝒐𝒐𝒑𝒑𝒑𝒑𝒑𝒑𝒏𝒏𝒂𝒂𝑻𝑻 𝒑𝒑𝒑𝒑𝒑𝒑𝒏𝒏𝒑𝒑

5. Submission
l    Please submit the program files and the report to GitLab n    Report

-         File format must be *.docx, *.doc, *.hwp, *.pdf, or *.odt. -           Guideline 

ü  Summary of your algorithm 

ü  Detailed description of your codes (for each function) 

ü  Instructions for compiling your source codes at TA's computer (e.g. screenshot) (Important!!) ü Any other specification of your implementation and testing n Program and code

-         An executable file

ü  If you are in the following two cases, please submit alternative files (e.g., .py file, makefile)

1.  You cannot meet the requirements (.exe file) of the programming assignment due to your computing environment (ex. Mac OS or Linux) 2.  You are using python for implementing your program

ü  You MUST SUBMIT instructions for compiling your source codes. If TAs read your instructions but cannot compile your program, you will get a penalty. Please, write the instructions carefully.

-         All source files

 

6. Testing program  
l  Please put the following files in a same directory: Testing program, your output files, given input files, attached answer files(~ideal.txt)

  

l  Execute the testing program with one argument (input file name)

  

l  Check your score for the input file

n     If you implement your DBSCAN algorithm successfully and use the given parameters mentioned above, you will be able to get the similar scores with the following score for each input data -    For ‘input1.txt’,  Score=99

-         For ‘input2.txt’,  Score=95

-         For ‘input3.txt’,  Score=99


More products