Assignment Overview:
Pollution plays a role in the lives of all, and standards of living, health, and environmental consciousness are a concern of many. Thus, pollution data analysis has been a key interest to those including the likes of policy makers, scientists, and activists. This analysis over a period of time allows for a better understanding of trends and pollution effects, in turn influencing policy making, manufacturers, and investment decisions amongst others. Within this project you are given a dataset documented by the U.S EPA [1] containing records of pollution within the U.S for every day from 2000 until 2016. Four of the major pollutants -Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone- are documented. Given this data, we will create a program to preform a simple analysis of the data. We will determine the trends in pollutants, times, and locations. The details of the project can be found below.
Dataset:
Download pollution_tiny.csv and pollution_small.csv to access the dataset. The data is organized as follows [2]:
• Record ID: a unique record identifier
• State Code: The code allocated by the US EPA to each state
• County Code: The code of counties in a specific state allocated by the US EPA
• Site Num: The site number in a specific county allocated by the US EPA
• Address: Address of the monitoring site
• State: State of the monitoring site
• County: County of the monitoring site
• City: City of the monitoring site
• Date Local: Date of monitoring
The four pollutants (NO2, O3, SO2 and CO) each have 5 specific columns. Given NO2:
• NO2 Units : The unit type measured for NO2
• NO2 Mean : The arithmetic mean of concentration of NO2 within a given day • NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day
• NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day
• NO2 AQI : The calculated air quality index of NO2 within a given day
Limitations: It is important to note that this dataset does not provide data for every year consistently for a given state. This greatly impacts the accuracy of our results, however, the implementation of our project provides good practice in using dictionaries over a large dataset.
Functions:
The following are the functions that you are required to implement:
open_file() : This function takes no parameters. It's purpose is to prompt for a file name and attempt to open said file. If unable to do so, you should re-prompt. Return the file pointer upon successful opening of the file.
read_file( fp ) : This function takes the file pointer received from open_file() as a parameter and returns a dictionary which contains all data. Use csv reader to read the file—see notes below. Please note that you must use dictionaries for this project. You may organize your data as follows, where the states are keys and all other data is stored within a list. Because there are multiple records for any given state, you will have multiple record values, represented by r1-r3 in the example below, which are lists in their own right. In the example below we can show Michigan as having 3 records, the contents of the records are displayed by title, as we can see through r1. Note that we do not record the first four values in a line, i.e. do not include Record ID, State Code, County Code, Site Num, or Units in the values you return.
{Michigan: [ r1, r2, r3 ] }
r1 = [city, date, no2mean, o3mean, so2mean, comean]
Ignore two types of records
(i) for any pollutant whose AQI value is blank—that entire record is considered to be invalid.
(ii) Ignore duplicate records: records that have the same city and date. Keep the first such record in the file and discard any subsequent duplicates. Assume that duplicate records will be in adjacent lines in the file—otherwise your program will run too long and timeout in Mimir. See notes below.
Convert all mean values to floats. Do not round any values.
Important: all mean values should be in parts per billion so if the units are parts per million you need to multiply the value by 1000.
total_years( D, state ) : This function takes in the dictionary returned from the read_file() function and a state name that you should prompt the user to input in the main function. It calculates the total pollution for each of the 4 pollutants for the given state over the course of 16 years. You should sum the pollution mean values for every day of the year (e.g. NO2 mean) in order to calculate the total average pollution for each year. This function should return a list containing the average total pollution per year for each pollutant, as well as the maximum and minimum pollution values. The list is ordered by year. We will use the list for plotting, so structure it as follows (where zeros represent the values of NO2, O3, SO2, and CO respectively, and we have 16 lists to represent the totals for each year):
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0,
0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0,
0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
You need to find the maximum and minimum total pollution values over all types of pollutants, which will result in one max value and one min value.
In summary, you return a tuple with three items: (list, float, float), specifically (one list (containing 16 lists of totals), one max value, one min value).
cities( D, state, year ) : This function takes in the dictionary returned from the read_file() function, state name, and a year (int) that you should prompt the user to input in the main function. That is, the parameters are a dictionary, a string and an int. Within this function you will be finding the total average pollution per city for a state and year. Given the specified state, you should extract the state's cities as well as the corresponding pollution data for each city (mean pollution data is to be summed – similar to what you did for the previous function, total_years). You should return a dictionary containing the contents of the city and its corresponding pollution data. Your dictionary must be structured as follows: {City: [NO2,O3,SO2,CO] } That is, the key is a city and the value is a list of summed pollution data for the city.
months( D, state, year ) : This function takes in the dictionary returned from the read_file() function, state name (string), and a year (int) that you should prompt the user to input in main. Within this function we want to find the top 5 months with the greatest total pollution for each pollutant. You must return 4 sorted lists where each list has the top 5 months with the most pollution for each pollutant, largest first. The order of the lists is NO2, O3, SO2, CO. For sorting you may find itemgetter useful as you did in Project 7, and remember to sort largest to smallest. An easy way to do this function is to sort and then slice off the top five.
display(totals_list,maxval,minval,D_cities,top_months)This function displays the values calculated in the functions described above.
plot_years( list,minval,maxval ): This function is provided for you
and will plot the total average concentrations for each pollutant for a given state over 16 years. You should pass the list, minvalue and maxvalue returned from the total_years() function as parameters.
main(): Within this function you should call open_file() and read_file() to set up your data. Then, you should continuously carry out the following actions until the user enters quit/Quit.
• Prompt for a state and year in that order (‘quit’ to quit)
• Calculate total_years()
• Find top cities by cities()
• Find top months with months()
• Display results
• Prompt (yes/no) to generate a graph with plot_years()
Be sure to check whether the state name is valid, i.e. it is in the dictionary. If it does not exist within our dictionary, output an error message and allow them to continually enter a new state until they have entered a valid state.
The user must be able to enter ‘quit’ at either the state or the year prompt. Furthermore, if ‘quit’ is entered as a state, the program must stop without unnecessarily prompting for a year.
Notes:
• Use the csv reader. Using the csv reader each line read from the file will be a list of items. That is, the split() has already been done. We need to use the csv reader because the data file is messy so simply reading strings and splitting them on commas doesn’t work.
o import csv # place this at the top of your file o reader = csv.reader(fp) # using the file pointer fp from open_file()
o header = next(reader,None) # how to skip a line o for line_list in reader: # how you loop through the file
• How to check for duplicate cities and dates in read_file() assuming that they only occur in adjacent lines in the file. Create two variables to hold the previous values of city and date. Let’s name them previous_city and previous_date. o previous_city, previous_date = "","" # init before loop o if #some Boolean using previous_city and previous_date o # after using previous_city and previous_date set them to current city and date o previous_city = current_city o previous_date = current_date
Examples & Test Cases:
Function Test 1: read_file() import csv
fp = open("pollution_tiny.csv")
D_student = read_file(fp)
D_instructor = {'Michigan': [['Detroit', '3/31/2000', 34.708333, 2.0,
4.916667, 900.0], ['Detroit', '4/1/2000', 37.666667, 19.75, 9.333333, 325.0],
['Detroit', '4/2/2000', 25.833333, 25.666999999999998, 3.958333, 275.0]],
'Maine': [['Presque Isle', '1/1/2006', 2.808696, 30.667, 2.943478, 200.0],
['Presque Isle', '1/2/2006', 3.556522, 25.375, 1.713043, 200.0]]}
Function Test 2: total_years()
D = {'Michigan': [['Detroit', '3/31/2000', 34.708333, 0.002, 4.916667, 0.9],
['Detroit', '4/1/2000', 37.666667, 0.01975, 9.333333, 0.325], ['Detroit', '4/2/2000', 25.833333, 0.025667, 3.958333, 0.275]], 'Maine': [['Presque Isle', '1/1/2006', 2.808696, 0.030667, 2.943478, 0.2], ['Presque Isle',
'1/2/2006', 3.556522, 0.025375, 1.713043, 0.2]]}
T_student = total_years(D,'Michigan')
T_instructor= ([[98.208333, 0.047417, 18.208333, 1.5], [0, 0, 0, 0], [0, 0,
0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0],
[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0,
0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 98.208333, 0)
Function Test 3: total_years() reading from file
fp = open("pollution_small.csv")
D = read_file(fp)
T_student = total_years(D,'Michigan')
T_instructor = ([[4082.771702, 4206.028000000003, 975.7437869999992,
50522.54300000002], [3348.641588000001, 4125.926, 796.8429829999999,
58442.695999999996], [6414.392806000004, 10074.466999999995, 1209.747425,
130065.49799999999], [6058.463629, 9388.806999999993, 1041.455613,
138250.002], [4996.965389000002, 8887.057000000008, 824.6742960000004,
122338.31200000002], [5108.685554999998, 9692.338000000005, 885.188937,
106651.459], [4408.862753, 9565.044000000009, 675.2983520000001,
111601.37800000004], [8.125, 21.0, 2.333333, 229.167], [0, 0, 0, 0], [0, 0,
0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0],
[0, 0, 0, 0], [0, 0, 0, 0]], 138250.002, 0)
Function Test 4: cities()
D = {'Michigan': [['Detroit', '3/31/2000', 34.708333, 0.002, 4.916667, 0.9],
['Detroit', '4/1/2000', 37.666667, 0.01975, 9.333333, 0.325], ['Detroit', '4/2/2000', 25.833333, 0.025667, 3.958333, 0.275]], 'Maine': [['Presque Isle', '1/1/2006', 2.808696, 0.030667, 2.943478, 0.2], ['Presque Isle',
'1/2/2006', 3.556522, 0.025375, 1.713043, 0.2]]}
D_student = cities(D,'Michigan',2000)
D_instructor= {'Detroit': [98.208333, 0.047417, 18.208333, 1.5]}
Function Test 5: months()
fp = open("pollution_small.csv")
D = read_file(fp)
T_student = months(D,'Michigan',2005)
T_instructor = ([1007.4459379999998, 910.981578, 898.3775449999999, 886.9802479999998, 867.4222229999997], [2099.1289999999995,
1862.3470000000002, 1733.1389999999997, 1471.5560000000005, 1360.181], [208.13328399999997, 179.59141600000007, 160.52126900000005,
142.36984600000002, 110.60325999999998], [21283.334999999992, 19275.002,
19261.305, 18566.390000000003, 17852.926])
Test 1
Input a file name: pollution_small.csv
Enter a state ('quit' to quit): Michigan
Enter a year ('quit' to quit): 2006
Max and Min pollution
Minval Maxval
0.00 138250.00
Pollution totals by year
Year NO2 O3 SO2 CO
2001 4082.77 4206.03 975.74 50522.54
2002 3348.64 4125.93 796.84 58442.70
2003 6414.39 10074.47 1209.75 130065.50
2004 6058.46 9388.81 1041.46 138250.00
2005 4996.97 8887.06 824.67 122338.31
2006 5108.69 9692.34 885.19 106651.46
2007 4408.86 9565.04 675.30 111601.38
2008 8.12 21.00 2.33 229.17
Pollution by city
City NO2 O3 SO2 CO
Grand Rapids 2154.55 5455.61 144.00 64253.27 Detroit 2254.31 4109.43 531.30 47348.11
Top Months
NO2 O3 SO2 CO
851.79 2163.11 186.99 20743.06
776.95 1857.75 135.16 19833.33
763.79 1780.62 101.77 19662.50
723.59 1400.58 95.73 19175.00
712.19 1342.45 92.24 16266.67
Do you want to plot (yes/no)? no
Enter a state ('quit' to quit): quit
Test 2 (Error check 1)
Input a file name: pollution_tiny.csv
Enter a state ('quit' to quit): xxx Invalid state.
Enter a state ('quit' to quit): Vermont Invalid state.
Enter a state ('quit' to quit): Michigan
Enter a year ('quit' to quit): quit
Test 3 (Error check 2)
Input a file name: pollution_tiny.csv
Enter a state ('quit' to quit): quit
Test 4 (Error check 3)
Input a file name: pollution_tiny.csv
Enter a state ('quit' to quit): Michigan
Enter a year ('quit' to quit): 2000
Max and Min pollution
Minval Maxval
0.00 1500.00
Pollution totals by year
Year NO2 O3 SO2 CO
2001 98.21 47.42 18.21 1500.00
Pollution by city
City NO2 O3 SO2 CO
Detroit 98.21 47.42 18.21 1500.00
Top Months
NO2 O3 SO2 CO
63.50 45.42 13.29 900.00
34.71 2.00 4.92 600.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
Do you want to plot (yes/no)? no
Enter a state ('quit' to quit): Idaho Invalid state.
Enter a state ('quit' to quit): Maine
Enter a year ('quit' to quit): quit
Test 5: blind test
Test 6 (plot: not on Mimir)
Input a file name: pollution_small.csv
Enter a state ('quit' to quit): Michigan
Enter a year ('quit' to quit): 2007
Max and Min pollution
Minval Maxval
0.00 138250.00
Pollution totals by year
Year NO2 O3 SO2 CO
2001 4082.77 4206.03 975.74 50522.54
2002 3348.64 4125.93 796.84 58442.70
2003 6414.39 10074.47 1209.75 130065.50
2004 6058.46 9388.81 1041.46 138250.00
2005 4996.97 8887.06 824.67 122338.31
2006 5108.69 9692.34 885.19 106651.46
2007 4408.86 9565.04 675.30 111601.38
2008 8.12 21.00 2.33 229.17
Pollution by city
City NO2 O3 SO2 CO
Grand Rapids 8.12 21.00 2.33 229.17
Top Months
NO2 O3 SO2 CO
8.12 21.00 2.33 229.17
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
Do you want to plot (yes/no)? yes
Enter a state ('quit' to quit): quit
Grading Rubric General Requirements:
( 5 pts) Coding Standard 1-9 (descriptive comments, function headers, etc...)
Implementation:
( 2 pts) open_file function (no Mimir test) ( 8 pts) read_file function
( 4 pts) total_years function ( 4 pts) cities function
( 4 pts) months function
( 7 pts) Pass Test1 ( 2 pts) Pass Test2 ( 2 pts) Pass Test3
( 2 pts) Pass Test4
( 5 pts) Pass Test5 (Blind Test)
( 5 pts) Pass Test6 (Plotting – No Mimir test) References:
[1] https://www.epa.gov/
[2] https://www.kaggle.com/sogun3/uspollution