Starting from:

$25

CS210 - Homework Assignment 3 - EU Cities Temperaturs - German Credit - Google Playst - Solved

Problem 1: EU Cities Temperatures Dataset (55 points)
Given a CSV data file as represented by the sample file EuCitiesTemperatures.csv   

(https://rutgers.instructure.com/courses/133454/files/20252138/download?download_frd=1)  (213 records), load it into a Pandas DataFrame and perform the following tasks on it.

Important: Your code should be applicable to any extension of this sample, so make sure you don't hardcode anything that applies only to the values in this dataset.

Preprocessing/Analysis (28 pts)
1.  [9 pts] Fill in the missing latitude and longitude values by calculating the average for that country.

Round the average to 2 decimal places.

2.  [9 pts] Find out the subset of cities that lie between latitudes 40 to 60 (both inclusive) and longitudes 15 to 30 (both inclusive). Find out which countries have the maximum number of cities in this geographical band. (More than one country could have the maximum number of values.)

3.  [10 pts] Fill in the missing temperature values by the average temperature value of the similar region type. A region type would be a combinaton of whether it is in EU (yes/no) and whether it has a coastline (yes/no).



For example, if we have a missing temperature value for Bergen, Norway, which is not in the EU but lies on the coast, we will fill it with the average temperature of cities with EU='no' and coastline='yes')

Visualization (27 pts)
For all plots, make sure to label the axes, and set appropriate tick labels.

1.  [6 pts] Plot a bar chart for the number of cities belonging to each of the regions described in Preprocessing/Analysis #3 above.

2.  [7 pts] Plot a scatter plot of latitude (y-axis) v/s longitude (x-axis) values to get a map-like visual of the cities under consideration. All the cities in the same country should have the same color.

3.  [6 pts] The population column contains values unique to each country. So two cities of the same country will show the same population value. Plot a histogram of the number of countries belonging to each population group: split the population values into 5 bins (groups).

4.  [8 pts] Plot subplots (2, 2), with proper titles, one each for the region types described in Preprocessing/Analysis #3 above.

Each subplot should be a scatter plot of Latitude (y-axis) vs. City (x-axis), where the color of the plot points should be based on the temperature values: ‘red’ for temperatures above 10, ‘blue’ for temperatures below 6 and ‘orange for temperatures between 6 and 10 (both inclusive). For each subplot, set xticks to an array of numbers from 0 to n-1 (both inclusive), where n is the total number of cities in each region type. This represents each city as a number between 0 and n-1.

 

Problem 2: German Credit Dataset (72 points)
Given a CSV data file as represented by the sample file GermanCredit.csv   

(https://rutgers.instructure.com/courses/133454/files/20252574/download?download_frd=1)  (1000 records), load it into a Pandas DataFrame, and perform the following tasks on it.

Important: Your code should be applicable to any extension of this sample, so make sure you don't hardcode anything that applies only to the values in this sample.

Preprocessing (31 pts)
1. [8 pts] Drop the 3 columns that contribute the least to the dataset. These would be the columns with the highest number of non-zero 'none' values. Break ties by going left to right in columns. (Your code should be generalizable to drop n columns, but for the rest of the analysis, you can call your code for n=3.)

2. [4 pts] Certain values in some of the columns contain unnecessary apostrophes (‘). Remove the apostrophes.

3.  [5 pts] The checking_status column has values in 4 categories: 'no checking', '<0', '0<=X<200', and '>=200'. Change these to 'No Checking', 'Low', 'Medium', and 'High' respectively.

4.  [5 pts] The savings_status column has values in 4 categories: 'no known savings', '<100', '100<=X<500', '500<=X<1000', and '>=1000'. Change these to 'No Savings', 'Low', 'Medium', 'High', and 'High' respectively. (Yes, the last two are both 'High').

5.  [4 pts] Change class column values from 'good' to '1' and 'bad' to '0'

6.  [5 pts] Change the employment column value 'unemployed' to 'Unemployed', and for the others, change to 'Amateur', 'Professional', 'Experienced' and 'Expert', depending on year range.

Analysis (17 pts)
For the following tasks, do preprocessing or changing of data types in the data frame as required.

1.  [5 pts] Often we need to find correlations between categorical attributes, i.e. attributes that have values that fall in one of several categories, such as "yes"/"no" for attr1, or "low","medium","high" for attr2.

One such correlation is to find counts in combinations of categorial values across attributes, as in how many instances are "yes" for attr1 and "low" for attr2. A good way to find such counts is to use

the Pandas crosstab  (https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.crosstab.html)  function. Do this for the following two counts.

a.  [3 pts] Get the count of each category of foreign workers (yes and no) for each class of credit (good and bad).

b.  [2 pts] Similarly, get the count of each category of employment for each category of saving_status.

 

2.  [4 pts] Find the average credit_amount of single males that have 4<=X<7 years of employment. You can leave the raw result as is, no need for rounding.

3.  [4 pts] Find the average credit duration for each of the job types. You can leave the raw result as is, no need for rounding.

4.  [4 pts] For the purpose 'education', what is the most common checking_status and savings_status?

Your code should print:

 

Visualization (24 pts)


 

1.  [9 pts] Plot subplots of two histograms: one with savings_status on the x-axis and personal_status as different colors, and another with checking_status on the x-axis and personal_status as different colors.

2.  [9 pts] For people having credit_amount more than 4000, plot a bar graph which maps property_magnitude (x-axis) to the average customer age for that magnitude (y-axis).

3.  [6 pts] For people with a "High" savings_status and age above 40, use subplots to plot the following pie charts:

a.  Personal status

b.  Credit history

c.  Job

 

Problem 3: Google Playstore Apps Dataset (63 points)
Given an Excel data file as represented by the sample file GooglePlaystore.xlsx   

(https://rutgers.instructure.com/courses/133454/files/20252996/download?download_frd=1)  (10K records),

load it into a Pandas DataFrame (use the Pandas read_excel method), and perform the following tasks on it.

Important: Your code should be applicable to any extension of this sample, so make sure you don't hardcode anything that applies only to the values in this sample.

Preprocessing (28 pts)
 

1.  [3 pts] Often there are outliers which do not match the overall data type. There is one record in this data where the "Reviews" has value "3.0M" which does not match the rest of the data. Remove that record.

2.  [4 pts] Remove rows where any of the columns has the value "Varies with device".

3.  [5 pts] The values in the Android version column should be floats. Strip the trailing non-numeric characters from all values (ie. the words " and up"), so the result is a number. If there are multiple decimal places (eg. "x.y.z"), keep only the first two parts (eg "x.y"). For example, the value "4.1 and up" should be changed to "4.1". The value "4.5.6 and up" should be changed to "4.5". The value "5.6.7" should be changed to "5.6".

4.  [5 pts] The "Installs" column must have integer values. For values that have commas, remove the

commas. For values that have a '+' at the end, remove the '+'. Keep only those rows that have an integer value after these edits.

5.  [5 pts] For missing rating values, if the number of reviews is less than 100 and installations is less than 50000, remove the row. Else, fill the missing value with the average value (rounded to 2 decimal places) for the Category of that row.

6.  [6 pts] Preprocess the Size column to convert the "M" (millions) and "K" (thousands) values into integers. For instance, 8.7M should be converted to 8700000 and 2.4K should be converted to 2400.

Analysis (19 pts)
For the following tasks, do preprocessing or changing of data types in the data frame as required.

1.  [4 pts] Describe (use DataFrame describe method) the category wise rating statistics. In other words, for each category, describe the statistics (count, mean, etc.) for ratings in that category.

2.  [11 pts] Extract all "Free" apps from the master data frame. Then write a function that, given a numeric column e.g 'Rating'), will create and return a dataframe for the top 3 free applications in each category based on that column. Call the function on each of these columns:

a.  Rating (gives top 3 most highly rated applications in each category)

b.  Installs (gives top 3 most installed applications in each category)

c.  Reviews (gives top 3 most reviewed applications in each category)

You don't need to do anything explicit to break ties.

Each of the returned dataframes have Category and App for the first two columns, and one of Rating (for a.), Installs (for b.), and Reviews (for c.) as the third column.

3.  [4 pts] Find the average, maximum and minimum price of the paid applications.

 

Visualization (16 pts)
1.  [9 pts] In the genre column, break the string of genres into a list. For example, ‘Art & Design; Creativity’ should be [‘Art & Design’, Creativity’].

Count the number of applications per genre and display it using a pie chart. Hint: Read about DataFrame.explode()

2.  [7 pts] Display a box plot of ratings for "Business" and "Education" categories. The boxplots should be in the same plot.

 

More products