Starting from:

$25

CSE351 - Data Science - Course Projects  - Project #1 - Movie Revenue Prediction - Solved

Film industry is booming, the revenues are growing. There are many factors which affect the revenue of a film. In this project, you will explore what features can help to predict the revenue.  Datasets: 

The “movie.zip” file contains the datasets to be used for this project and a file describing the various columns in the data. You must split the dataset yourself into training, testing, and cross validation data(when required). Data points provided include cast, crew, plot keywords, budget, posters, release dates,, languages, production companies, and countries.  

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●     Clean the dataset, remove the outliers, before any data analysis. Explain what you did.  

●     Some of the columns contain lists and dictionaries. Extract information you need and reformat them.  

●     Count the number of movies released by day of week, month and year, are there any patterns that you observe?

●     What are the movie genre trend shifting patterns that you can observe from the dataset?

●     What are the strongest and weakest features correlated with movie revenue?

●     You can also use some external datasets to integrate into your revenue prediction analysis to make it better.

 

Modeling and Question Answering (10 points): 

Extract the features you think are necessary in predicting the movie revenue. Build three models, train them on the training set, and predict the revenue on the test set (after dropping the revenue column in the test set). Explain how each model works (briefly introduce the machine learning algorithms behind them). Evaluate the performance of each model based on the original outcome in the test set. If your predictions are not so accurate, what do you think is the reason? Report your accuracy using metrics such as Residual Standard Error (RSE). Split the data further to include a cross validation set. Did this improve your model’s performance on the test set?  

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  


Project #2: Titanic - Who will survive? (1-2 People)
Titanic, a British passenger liner, sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking at the time one of the deadliest of a single ship and the deadliest peacetime sinking of a superliner or cruise ship to date. This project allows us to gain insight into how to survive from such a catastrophe, is it pure luck or is it something else?  

 

Datasets: 

The “tantanic.zip” file contains the dataset to be used for this project. The dataset is already splitted into training and testing sets.

Here are the definitions of the some of the variables:  

Survival - 0 = No, 1 = Yes pclass(Ticket class) - 1 = 1st, 2 = 2nd, 3 = 3rd Sibsp - # of siblings / spouses aboard the Titanic 

Parch - # of parents / children aboard the Titanic 

Embarked - port of embarkation - C = Cherbourg, Q = Queenstown, S = Southampton 

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●     Clean the dataset, remove the outliers, before any data analysis. Explain what you did.  ● Explore the socio-economic status of the passenger, is there any relationship between socio-economic status with other features, such as age, gender, number of family members on board, etc.  

●     Explore the distribution of survival victims in relation to age, gender, socioeconomic class, etc.  

●     What features seem to be the most important ones? Perform a correlation analysis before your prediction task.

●     How can you extract information from the non-numerical features?

 

Modeling and Question Answering (10 points): 

Build three models, train them on the training set, and predict the outcome on the test set (after dropping the survival column in the test set). Explain how each model works (briefly introduce the machine learning algorithms behind them). Evaluate the performance of each model based on the original outcome in the test set. If your predictions are not so accurate, what do you think is the reason? Use other evaluation metrics to evaluate your models (Precision, Recall, Fscore). Split the data further to include a cross validation set. Did this improve your model’s performance on the test set?  

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  

 

 Project #3: Fatal Force in the US (2-3 people)
In the United States, use of deadly force by police has been a high-profile and contentious issue. 1000 people are shot and killed by US cops each year. The ever-growing argument is that the US has a flawed Law Enforcement system that costs too many innocent civilians their lives. In this project, we will analyze one of America’s hottest political topics, which encompasses issues ranging from institutional racism to the role of Law Enforcement personnel in society. 

 

Datasets: The “fatal_force.zip” file contains six datasets to use for this project. “police_killings_train.csv” and “police_killings_test.csv” are mandatory datasets with self-explanatory data fields. The other four files “share_race_by_city.csv”, “income.csv”, “poverty.csv”, and “education.csv” are optional datasets you can use to perform analysis and to add features to your models. 

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●        Clean and merge the datasets, explain what you did.

●        Which state has the most fatal police shootings? Which city is the most dangerous?

●        What is the most common way of being armed?

●        What is the age distribution of the victims? Compare age distribution of different races.

●        Compare the total number of people killed per race. Compare the number of people killed per race as a proportion of respective races. What difference do you observe?

 

Modeling and Question Answering (10 points): 

Apply three machine learning algorithms to explore whether it is possible to predict the race of a victim based on other features. Train your models on the training set, and make predictions for the test set with the “race” column dropped. Evaluate the accuracy of your predictions. If your predictions are not very accurate, what do you think is the reason? 

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  


 

 

Project #4: What makes people in a country happy? (2-3 people)
The World Happiness Report is a landmark survey of the state of global happiness that ranks countries by how happy their citizens perceive themselves to be. The report gains global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. This project allows us to gain insight into the state of happiness in the world today. 

 

Datasets: 

The “world_happiness.zip” file on Blackboard contains happiness data for different countries from year 2015 to year 2019. We will treat data of year 2015 to year 2018 as the training set, and year 2019 data as the test set. Description of the data fields can be found on the FAQ page of World Happiness Report: https://worldhappiness.report/faq/ 

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●        Merge and clean the data. Explain what you did.

●        What are the central tendencies of happiness score over the years? Did they increase or decrease?

●        Which countries have stable rankings over the years? Which countries improved their rankings?

●        Visualize the relationship between happiness score and other features such as GDP, social support, freedom, etc.

●        Find out what features contribute to happiness. If you are the president of a country, what would you do to make citizens happier?

 

Modeling and Question Answering (10 points): 

The happiness rankings in the datasets are determined by happiness scores only. Now we want to predict the ranking using a machine learning approach. Build three models based on data from year 2015 to year 2018.  Explain how each model works (briefly introduce the machine learning algorithms behind them). Predict the happiness ranking for the year 2019 (drop the “overall rank” and “score” columns first). Compare your rankings to the original rankings in “2019.csv”. How does each model perform? Invent your own formula to calculate happiness score using features of your choice. 

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  

 
Project #5: Can we predict whether a Hotel Booking will be cancelled? 

When it comes to hotel bookings, customers have a variety of options and deals and are sometimes often cancelling certain bookings for several reasons. Given hotel booking data for two major hotels, can we predict whether a customer will cancel the booking or not? We will explore the main concepts of EDA and modelling classification algorithms in this project. 

 

Datasets: The “Hotel_Bookings.zip” file contains the dataset to be used for this project and a file describing the various columns in the data. You must split the dataset yourself into training, testing, and cross validation data(when required). 

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●        Which country saw the most hotel bookings according to the data?

●        What is the distribution like for both hotels with respect to price of a room per night?

●        Which months are the most busy for both hotels? Which months see the most expensive per night costs?

●        Which months see the most cancellations for both hotels?

●        Examine distributions of bookings vs market segment.

●        Which room type was most commonly booked? Most commonly cancelled? ●        What percentage of the data recorded cancellations for each hotel?

 

Modeling and Question Answering (10 points): 

Apply three machine learning algorithms to predict whether or not a customer will cancel a booking. Train your models on the training set, and make predictions for the test set with the “is_canceled” and “reservation_status” columns dropped. Evaluate the accuracy of your predictions. If your predictions are not so accurate, what do you think is the reason? Use other evaluation metrics to evaluate your models (Precision, Recall, F-score). Split the data further to include a cross validation set. Did this improve your model’s performance on the test set?  

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  


 

 

Project #6: Covid-19 in Germany Analysis (2-3)
Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease has since spread worldwide, leading to an ongoing pandemic. In this project, you will explore with the Covid cases in Germany.  

 

Datasets:

The “Covid-germany.zip” file contains the dataset to be used for this project, and the column name should be self explanatory. You must split the dataset yourself into training, testing, and cross validation data(when required). 

Here is the additional data that might be helpful for this project:  

●        https://github.com/GoogleCloudPlatform/covid-19-open-data

 

EDA (10 points): 

Get familiar with the dataset and decide what features and observations will be useful. Make good use of visualizations. Specific tasks may include but are not limited to: 

●        Clean the dataset, remove the outliers, before any data merging and analysis. Explain what you did.  

●        What is the covid case trend in Germany, and how is it different from each state/county?  

Which state/county has the highest/lowest increasing rate?

●        What is the covid death rate trend in Germany, and how is it different from each state/county?  Which state/county has the highest/lowest increasing rate?

●        Which age/gender group has the highest covid positive cases?  

●        Which age/gender group has the highest covid death cases?  

●        What contributes to the spreading of the covid cases in Germany? (Additional datasets probably will be helpful)

 

Modeling and Question Answering (10 points): 

Apply three machine learning algorithms to explore whether it is possible to predict whether the covid patient would survive . Train your models on the training set, and make predictions for the test set with the “death” column dropped. Evaluate the accuracy of your predictions. If your predictions are not very accurate, what do you think is the reason? Use other evaluation metrics to evaluate your models (Precision, Recall, F-score). Split the data further to include a cross validation set. Did this improve your model’s performance on the test set?  

 

Project Report (10 points): 

You are required to document your project, which can be included in the notebook itself. Don't forget to include the team members contribution information in the documentation. Include visualizations to prove your point. You should prepare a powerpoint presentation, which can help you during the demo. 

 

Demo (5 points): 

Sign up for a Zoom session with the mentor to present your project. All the team members should be present during the demo. Be prepared to answer questions related to your work. You should present your findings for the project, and you should also be able to run your code.  


More products