DATA8001 - Assignment 2 - Data ETL

Starting from:

$25

There are 2,000 news articles in the data/R00000000_data.zip file (replacing R00000000 with Unzip the news articles (data/R00000000_data.zip) into the data/files folder.

Each news article can be viewed in Notepad and is in the format:

<REPORTER>Student Name</REPORTER>

<DATE>News Article Date</DATE>

<CATEGORY>News Article Category</CATEGORY>

<HEADLINE>News Article Headline</HEADLINE>

<ARTICLE>News article Text</ARTICLE>

Create a single dataframe containing the 2,000 news articles with the headings: [news_category, news_headline, news_article] and save the dataframe as data/R00000000_processed.csv replacing R00000000 .

All code required to reproduce the data ETL process should be placed in the Python library file (at the bottom where indicated): lib/R00000000_util.py and able to be called from the Jupyter Notebook: R00000000_A2_Notebook.ipynb.

Data Modelling – 20%
Create 3 multi-class classification models to classify news article categories using the sample data provided to train

& test your models. For each model, briefly explain in your report why you selected this model and its accuracy (overall & individual class) on your data. Also provide your recommendations for best models and settings based on your research in the report.

In your report explain your choice of text pre-processing technique (e.g., bag of words, TF-IDF etc.) for each model and also include what text preparation methods you employed (e.g., lowercase, stemming etc.).

For each model, use some form of model parameter optimisation (e.g., grid search, partial grid search etc.) to determine the best model parameters and ensure the models are not overfitted (i.e., they generalise to unseen data).

For each model show the model classification report and confusion matrix in your Jupyter notebook.

Split your dataset into a training set (80%) and a test set (20%) using the seed (random_state=8001).

Using the Python class provided in lib/R00000000_util.py, save the objects to the model folder as: [model/R00000000_model_1.pkl, model/R00000000_model_2.pkl., model/R00000000_model_3.pkl]

All code required to reproduce the modelling process should be placed in the Python library file: lib/R00000000_util.py and able to be called from the Jupyter Notebook: R00000000_A2_Notebook.ipynb.

The pickled model files should be loaded and called from the Jupyter Notebook and available to process unseen test data including any transformations required to ensure the models work. The models can be called from the Jupyter Notebook:

R00000000_model, news_category = util.load_run_model(model_id=model_id, student_id=STUDENT_ID, news_headline=news_headline, news_article=news_article)

Report & Questions (15%)
Write a max 2-page report outlining the steps taken to complete the assignment. Identify any areas you feel are worth mentioning during the ETL, visualisation of modelling steps including any insights developed.

Answer 2 exam type questions (max 300 words) each. Note – due to the “open-book” nature of this assignment, a clean, concise and well-thought-out answer of your “own” viewpoint is expected, this is not a “cut and paste” exercise.

More products

CSCI 2226 and 6620 Data Structures Program 2: Huffman Compression, Phase I

$20

Add to cart

CSCI 6620 Program 3: Sorting a List of Lines

$20

Add to cart

CSCI 6620 Program 4: A Radix Sort Using Queues

$29.99

Add to cart

DATA8001 - Assignment 2 - Data ETL - Solved

More products