$29.99
Exam Guidelines
• This is an open book examination.
• You must pack all solutions together into one zip file, named as itsc studentID final.zip.
The example of directory structure is shown in the Fig 1.
– Your program and report should be based on your own effort. Students cannot collaborate with anyone.
Figure 1: An example of the directory structure to be submitted.
Datasets
The data sets for Q1, Q2, Q5, Q6 and Q7 can be downloaed via the link or the canvas. The data set for Q3 can be downloaded via the link.
Grading
Your score will be given based on:
1. Correctness, completeness and clarity in your report.
2. Codes must be runnable, and code comments are necessary. Missing thenecessary comments will be deducted a certain score.
3. Evaluation performance (if any).
4. You can get bonus if you propose a novel approach to any question otherthan Q7.
Notes
Please read the following notes carefully.
1. Please read Exam Guidelines carefully.
2. Please work hard and arrange your time reasonably.
3. For programming language, in principle, python is preferred.
4. Computation of some questions is very large, students might use cloudcomputing platform, such as azure.
5. If your codes or answer refer to any blog, github, paper and so on, pleaseprovide the links in corresponding QX report.pdf.
6. If you have any question that Google cannot answer, please send email to{hlicg,lwangcg,sdiaa}@connect.ust.hk. Questions from other channels will not be answered.
1 Dimensionality Reduction on Data Feature (10 Points)
In real-world problems, some data has a high dimensional feature that contains noise and impedes the performance of machine learning models. In this question, for simplicity, we give a small toy dataset and you are required to design two dimensionality reduction algorithms to reduce the dimension of features.
You are required to code for your dimensionality reduction algorithms manually, but you are allowed to use any existing package of the classification model or clustering model to test the effect of your algorithms. For example, you feed the features obtained by different dimension reduction algorithms to the same classification or clustering model and compare the performance.
1.1 Data Description
The dataset contains 3686 points where each point has 400 dimension feature. Also, in the dataset, X = Multi-dimensional point data, y = labels (0 or 1).
1.2 Submission
1. Pack all code files in folder Q1 code .
2. You should write the pseudo code of your dimensionality reduction algorithms in the report Q1 report.pdf and describe them in detail.
3. You should write the experiment settings in your report Q1 report.pdf. For example, if you split the dataset, you should write it clearly
4. You should compare the performance of your algorithms and analyze themin the report Q1 report.pdf.
5. Please state clearly how to run your program in Q1 readme.md.
1.3 Notes
1. If your pseudo code includes the classification model or the cluster model,you can use a function to denote it directly instead of writing it in detail.
2. How many dimensions you want to reduce depends on you.
2 Imbalanced Data Classification (15 Points)
In this task, you are required to model classifiers for 5 imbalanced data sets.
2.1 Data Description
You are provided with 5 training imbalanced dataset as follows:
1. Bi-class Datasets v train.csv and p train.csv are data sets with binary classes (e.g., positive, negative).
2. Multi-class Datasets y train.csv, e train.csv and a train.csv are data sets with multi-classes.
2.2 Submission
1. Pack all code files in folder Q2 code.
2. Predict the test data and put your prediction results in folder Q2 output.
3. Please report the algorithm you utilized in details, the experiment setting,and training accuracy in your report Q2 report.pdf.
4. Please state clearly how to run your program in Q2 readme.md.
2.3 Notes
1. You are allowed to use any existing package for this question. But please clearly state your reference link in Q2 report.pdf.
3 Fake Video Detection (15 Points)
3.1 Data description
1. You can download the whole dataset from here.
2. Each sub-folder denotes one category of videos. For example, for videos in./ApplyEyeMakeup folder, their labels are all ’ApplyEyeMakeup’. These videos are all real instead of synthesised.
3.2 Submission
1. Please write down your algorithm’s principles and details in Q3 report.pdf. If your code refers to blogs, GitHub codes, papers or anything else, please cite them clearly.
2. Please put all your code of this question in folder Q3 code.
3. You need put your fake video detection results in Q3 output.csv. You can choose arbitrary fake and real videos online to do the validation. The output format should be like
The Name of Video
Is This Video real?
Diving Bird.avi False
Obama Presenting.avi True
4. Please state clearly how to run your program in Q3 readme.md.
3.3 Notes
1. You can use any algorithm that you know. You can NOT directly use complete models that others have already trained to do classification without any detailed process.
2. We will test your algorithm on separated dataset and use this metric asthe grading criteria.
3. The term fake video means that the video is not originated from thenature world, instead it is generated by some algorithms from noise or some distributions.
4 Stock Prediction (15 Points)
4.1 Data Description
1. You are free to collect arbitrary data and information from the internetlike stock prices, twitters, newspapers etc.
2. Your data collecting method is also not restricted.
4.2 Submission
1. Please write down your algorithm’s principles and details in Q4 report.pdf. If your code refers to blogs, GitHub codes, papers or anything else, please cite them clearly.
2. Please put all your codes of this question in Q4 code folder.
4. Please state clearly how to run your program in Q4 readme.md.
4.3 Notes
1. You can use any algorithm that you know. You can NOT directly use complete models that others have already trained to do sentiment analysis without any detailed process.
5 Frequent Itemset Mining (15 Points)
You are required to write programs to mine the frequent itemsets (min sup = 100) in the provided data. Based on the frequent itemsets you mined, please write program to mine the closed frequent itemsets and maximal frequent itemsets.
5.1 Data Description
The benchmark data set is supported by the IBM Almaden Quest research group, which contains 1,000 items and 100,000 transactions. For simplicity, each number uniquely identifies an item.
5.2 Task Description
1. Task 1 (5 points) Mine the frequent itemsets (min sup = 100).
2. Task 2 (5 points) Mine the closed frequent itemsets.
3. Task 3 (5 points) Mine the maximal frequent itemsets.
5.3 Submission
1. Pack all code files in folder Q5 code.
2. Following the sample submission, put your outputs in folder Q5 output.
3. Describe the algorithm you utilized for task 1 and the running time ofeach task in Q5 report.pdf.
4. Please state clearly how to run your program in Q5 readme.md.
5.4 Notes
1. You are allowed to use any existing package for this question. But please clearly state your reference link in Q5 report.pdf.
6 Community Detection (10 Points)
In many social networks, many different relations usually exist between a same set of users. In this task, given several datasets, you are required to detect the communities among 419 tweet users. The ground truth consists of five communities.
6.1 Data Description
419 users are referenced by their unique Twitter IDs in data ids.mtx. You are provided with three different sources of these users social networks:
1. follow/followedby This folder records “follow” information between users.
2. mention/mentionedby This folder records “mention” information between users.
3. retweet/retweetedby This folder records “retweete” information between users.
6.2 Task Description
1. Task 1 (10 points) Detect 5 communities between 419 users.
6.3 Submission
1. Pack all code files in folder Q6 code.
2. Following the sample submission, put your outputs in folder Q6 output.
3. Describe in detail the algorithm you are using in Q6 report.pdf.
4. Please state clearly how to run your program in Q6 readme.md.
6.4 Notes
1. You are allowed to use any existing package for this question. But pleaseclearly state your reference link in Q6 report.pdf.
2. Do NOT use any external data.
7 Matrix Factorization in Graph Representation and Recommendation System (20 Points)
7.1 Preliminary
Matrix Factorization (MF) (e.g., Probabilistic Matrix Factorization and NonNegative Matrix Factorization) techniques have become the crux of many realworld scenarios, including graph representation and recommendation system (RecSys), because they are powerful models to find the hidden properties behind the data. More specifically, Non-Negative Matrix Factorization (NNMF) is a group of models in multivariate analysis and linear algebra where a matrix A P R|B|ˆ|C| is decomposed into B P R|B|ˆd and C P R|C|ˆd according to:
2
B,C “ arg min1 1 ››A ´ B1C1J››F , (1)
B ,C
where }¨}F denotes the Frobenius norm.
User-Movie RecSys Given a collection of users U and a collection of movies V , let X P R|U|ˆ|V | denotes the user history of rating items, where xij “ 0 if i-th user ui has not rated the j-th movie vj, otherwise xij P p0,5s represents the ratings of ui to the movie vj. Following Eq. 1, you can represent users U and movies V into U P R|U|ˆd and V P R|V |ˆd respectively, by minimizing the loss Lr “ ››X ´ UVJ››F.
2
User-Movie RecSys with Movie Tags In real-world applications, it is important to handle the cold-start issue in RecSys since there are new movies coming out every day. The above user-movie RecSys cannot well handle this issue since no user have seen a new movie before. This is a variant of MF techniques to alleviate the cold-start issue by incorporating the information of movie tags. Given a collection of movie tags T and a collection of movies V , let Y P R|T|ˆ|V | denote movie tag information, where yij “ 1 if movie vj is tagged with ti, otherwise xij “ 0. Therefore, you can represent users U, movies V , and tags T into UP R|U|ˆd, V P R|V |ˆd, and T P R|T|ˆd respectively, by minimizing the loss Lt “ X ´ UVJ››F ` ››Y1 ´´›› TVJ››1F 1as:J›› ›› 1 1J››
2 2
2
U,V,T “ arg 1min1 X ´ U V F ` Y ´ T V.
U ,V ,T
7.2 Task Description
You are required to code three different matrix factorization techiniques as introduced above.
1. Task 1 (5 points) Given the graph data graph.npy, please represent nodes V by minimizing Lg.
2. Task 2 (5 points) Given the rating data rating.csv, please represent users U and movies V by minimizing Lr.
3. Task 3 (10 points) Given the rating data rating.csv and the movie data movie.csv, please represent users U, movies V , and tags T by minimizing Lt.
7.3 Submission
1. Pack all code files in folder Q7 code.
2. Please report the experiment settings (e.g., learning rate), your final losses(Lg, Lr, and Lt), and the corrsponding learning curves (X-axis: iteration and Y-axis: loss) for three tasks in Q7 report.pdf.
3. Please state clearly how to run your program in Q7 readme.md.
7.4 Notes
1. You are allowed to use any existing package for this question. But pleaseclearly state your reference link in Q7 report.pdf.
2. Hints: Learning in Eq. 1 is usually more stable if U or V is updated in each subproblem based on gradient descent by reducing L slightly, such as:
• while not converged:
– Minimizing L over U by gradient descent with V fixed.
– Minimizing L over V by gradient descent with U fixed.