Starting from:

$24.99

STA4102 Assignment 3-SQLite or SAS (with PROC SQL) Solution

This work is to be done individually and submitted through webcourses.
There are two datasets GDP.csv and GEP.csv . The files contain country names and numerical data. They come from the worldbank databank. The GDP is the gross domestic product for each country and the GEP is the Global Economic Prospect for each country over a set of years in the past’s prediction and into the future. The headers have been removed but for the GDP.csv they are: Country code, Country name, gdp
for GEP.csv:
Country Name, CountryCode, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
Question 1 (20pnts)
a) Load in the data from CSVs into SQLite. Where each CSV goes into a new table. You can handle thecolumn names as you see fit. (4 pnts)
b) Produce a histogram for the GDP values in the dataset after extracting the values from the database.
(1 pnt)
c) Use SQL to select from GEP the countries (United States,USA), (Greece,GRC), (China,CHN), (United Kingdom,UK), (Argentina,ARG) and print the mean GEP. For each. The plot the GEP values over the years (5 counties will produce 5 lines). (2 pnts)
d) Select the countries which have an above average GDP using SQL. (1 pnt)
e) Select all the countries whose country name starts with a letter ‘G’ (1 pnt)
d) Join the 2 tables together using SQL. (3 pnts)
e) Using the data from the joined table as training data you will try to predict the GEP values from he file GEPsupplementRecent.csv which now has a recent addition of GEP information. Using a subset of the rows for training and the remaining rows as testing calculate the MSE or RMSE for your model predictions. Use 2 different models (eg linear regression and random forests) and then compare the
MSE or RMSE values (5 pnts)
f) If you remove the countries with below average GDP, how does the RMSE or MSE change the predictive quality of the models (discuss). (3 pnts)

More products