$35
Data Science (ITE4005)
Programming Assignment #2
1. Environment
l OS: Windows, Mac OS, or Linux
l Languages: C, C++, C#, Java, or Python (any version is ok)
2. Goal: Build a decision tree, and then classify the test set using it
3. Requirements
The program must meet the following requirements:
l Execution file name: dt.exe
l Execute the program with three arguments: training file name, test file name, output file name
n Example:
- Training file name=‘dt_train.txt’, test file name=‘dt_test.txt’, output file name=‘dt_result.txt’
l Dataset
n We provide you with 2 datasets
- Buy_computer: dt_train.txt, dt_test.txt
- Car_evaluation: dt_train1.txt, dt_test1.txt
n You need to make your program that can deal with both datasets
n We will evaluate your program with other datasets with attributes such as the car_evaluation dataset
l File format for a training set
[attribute_name_1]\t[attribute_name_2]\t … [attribute_name_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
n [attribute_name_1] ~ [attribute_name_n]: n attribute names
n [attribute_1] ~ [attribute_n-1]
- n-1 attribute values of the corresponding tuple
- All the attributes are categorical (not continuous-valued) n [attribute_n]: a class label that the corresponding tuple belongs to n Example 1 (data_train.txt):
Figure 1. An example of the first training set.
n Example 2 (data_train1.txt):
Figure 2. An example of the second training set.
- Title: car evaluation database - Attribute values
l Buying: vhigh, high, med, low
l Maint: vhigh, high, med, low
l Doors: 2, 3, 4, 5more
l Persons: 2, 4, more
l Lug_boot: small, med, big
l Safety: low, med, high
- Class labels: unacc, acc, good, vgood
- Number of instances: training set - 1,382; test set - 346
l Attribute selection measure: information gain, gain ratio, or gini index l File format for a test set
[attribute_name_1]\t[attribute_name_2]\t … [attribute_name_n-1]\n [attribute_1]\t[attribute_2]\t … [attribute_n-1]\n
[attribute_1]\t[attribute_2]\t … [attribute_n-1]\n
[attribute_1]\t[attribute_2]\t … [attribute_n-1]\n
n The test set does not have [attribute_name_n] (class label) n Example 1 (dt_test.txt):
Figure 3. An example of the first test set.
n Example 2 (dt_test1.txt):
Figure 4. An example of the second test set.
l Output file format
[attribute_name_1]\t[attribute_name_2]\t … [attribute_name_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
[attribute_1]\t[attribute_2]\t … [attribute_n]\n
n Output file name: dt_result.txt (for 1th dataset), dt_result1.txt (for 2nd dataset) n You must print the following values:
- [attribute_1] ~ [attribute_n-1]: given attribute values in the test set
- [attribute_n]: a class label predicted by your model for the corresponding tuple n Please DO NOT CHANGE the order of the tuples in each test set.
- You should print your outputs to match the order of correct answers.
4. Note
l This is a competition project
l As the accuracy of your model is higher, you get a higher score
n We will first give a minimum score at least 70 if (1) you submit your program before the deadline, (2) your program is correctly performed without any errors, and (3) all requirements for this project are satisfied.
n Then, we will assign the additional scores from 0 to 30 based on your rank.
5. Submission
l Please submit the program files and the report to GitLab n Report
- File format must be *.docx, *.doc, *.hwp, *.pdf, or *.odt. - Guideline
ü Summary of your algorithm
ü Detailed description of your codes (for each function)
ü Instructions for compiling your source codes at TA's computer (e.g. screenshot) (Important!!)
ü Any other specification of your implementation and testing n Program and code
- An executable file
ü If you are in the following two cases, please submit alternative files (e.g., .py file, makefile)
1. You cannot meet the requirements (.exe file) of the programming assignment due to your computing environment (ex. Mac OS or Linux) 2. You are using python for implementing your program
ü You MUST SUBMIT instructions for compiling your source codes. If TAs read your instructions but cannot compile your program, you will get a penalty. Please, write the instructions carefully.
- All source files
6. Testing program
l Please put the following files in a same directory: Testing program, your output files (dt_result.txt, dt_result1.txt), an attached answer file (dt_answer.txt, dt_answer1.txt)
l Execute the testing program with two arguments (answer file name and your output file name)
l Check your score for the input file
n the number of your correct prediction / the number of correct answers