Starting from:

$25

CSE601-Project 1 Dimensionality Reduction & Association Analysis Solved

Part 1: Dimensionality Reduction 

 

Dataset Description: 

In this part, you are expected to conduct dimensionality reduction on three biomedical data files (pca_a.txt, pca_b.txt, pca_c.txt), which can be found on Piazza. 

 

In each file, each row represents the record of a patient/sample; the last column is the disease name, and the remaining columns are features. Note that your code should be able to handle the data with different numbers of rows/columns.

 

Required Tasks: 

Please take the following steps:

You can use your preferred programming language(s). You need to implement the PCA algorithm by yourself. Applying existing package(s) to conduct PCA directly will not receive any credit. If you are not sure about whether it is OK to use a certain function, please post your question on Piazza.
Implement PCA and then run it on three data files (txt, pca_b.txt, pca_c.txt) to get the two-dimensional data points. For each dataset, draw the data points with a scatter plot, and color them according to their disease names.
Apply existing packages to run SVD and t-SNE algorithms (Do not need to implement them by yourself) and get the two-dimensional data points. Visualize
the data points of the two algorithms on the three datasets in the same way as the visualization of PCA results in step 2.

Prepare your submission. Create a folder named PCA, in the folder you should include:Report: A pdf file named as pdf. The report should contain:Nine scatter plots from three datasets and three algorithms. Label them properly by the dataset name and algorithm name in each plot.
Describe the flow of your PCA implementation briefly, and discuss the results obtained by different algorithms.
A folder named Code, which contains all codes used in this part. Inside the folder, please have a file README to describe how to run your code.
 

 

Part 2: Association Analysis 

 

Dataset Description: 

The dataset is about gene expressions (association-rule-test-data.txt) and can be found on Piazza. Each row stands for a patient/sample. The last column is the disease name.  For the remaining columns, they are gene expressions with values Up or Down (Binary Value). For example, the row “Down Down Down Up … AML” can be interpreted as “G1_ Down G2_ Down G3_ Down G4_Up … AML”, and AML is a disease name.

 

Required Tasks:

Implement the Apriori algorithm to find all frequent itemsets. Report the number of frequent itemsets for support of 30%, 40%, 50%, 60%, and 70%, respectively.
Please see Template.pdf for details.

You should not directly call any existing function or package that implements Apriori.  Apriori algorithm should be implemented by yourself.  If you are not sure about whether it is OK to use a certain function, please post your question on Piazza.  

Generate association rules based on the templates. The following are templates:Template 1: {RULE|HEAD|BODY} HAS ({ANY|NUMBER|NONE}) OF (ITEM1, ITEM2, ..., ITEMn)
Template 2: SizeOf({HEAD|BODY|RULE}) ≥ NUMBER.
Template 3: Any combined templates using AND or OR. For example:
BODY HAS (1) OF (Disease) AND HEAD HAS (NONE) OF (Disease)

Below is an example illustrating RULE, BODY and HEAD in the templates: Assume we obtain a RULE {G1_Up, G3_Down} → {G4_Down, G34_Up}. {G1_Up, G3_Down} is HEAD and {G4_Down, G34_Up} is BODY. 

More products