Starting from:

$35

Intelligent-Systems-Mining sequential data Solved

1           Introduction and Overview
In real life, sequential data is common, appearing in text sequences, time series, nucleic acid sequences etc. This seminar tests your ability to model and derive conclusions from sequential data. The assignment has to be submitted in the form of two files: a markdown file and a PDF file and created from the R Studio markdown file (in RStudio → file - new file - R Markdown), where you write both the code, as well as the text of answers (echo = T option must be enabled for each code block). Markdown files can easily be exported to PDF using (“Knit”) button in R Studio.

2           Task overview
Documents are collections of ordered strings – string sequences. In the assignment, you shall inspect, transform and learn from this type of input. You are provided a pre-split collection of labeled documents (train.tsv, test.tsv) related to detection of fake news[1]. The task is to i) pre-process the documents, ii) do feature construction, iii) model of the documents, and iv) evaluate. The documents represent snippets of social media text and are labelled as either fake (0) or true (1).

2.1         Pre-processing
The provided data is realistic in the sense that it can be noisy. Your first task is to clean the documents of possible noise, including, e.g., URL links, strange symbols etc. Compute some basic statistics (e.g., term frequencies) and visualize them. What do you observe?

(15%)

2.2         Feature construction
Raw texts are not the best form of input to many machine learning algorithms. Your task is to convert the documents into feature matrices suitable for learning. Devise a feature construction procedure that outputs a real-valued matrix (rows = documents, columns = features), suitable for learning2. Note that you need to transform both the train and the test instances. Discuss your choices and describe the final space you will use for learning.


2.3         Modeling
Use at least three machine learning classifiers and one ensemble method. Train the models on the training data (train.tsv) and produce the predictions for the test instances (test.tsv). Be very careful not to use any of the test set instances during learning (no data leaks allowed!).

More products