$30.99
Listing 1 shows a sample submission skeleton that you can use as a starting point for this assignment.
Listing 1: Sample Submission Skeleton
1
2
3 path = "./data/"
4
5 # Helpful functions
6
# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name): 9 dummies = pd.get_dummies(df[name]) 10 for x in columns:
dummy_name = "{}-{}".format(name, x)
df[dummy_name] = dummies[x]
drop(name, axis=1, inplace=True)
14
15
# Encode text values to a single dummy variable. The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the One column is added for 18 # each target value.
def encode_text_single_dummy(df, name, target_values):
for tv in target_values:
l = list(df[name].astype(str))
l = [1 if str(x) == str(tv) else 0 for x in l]
name2 = "{}-{}".format(name, tv)
df[name2] = l
25
26
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
le = preprocessing.LabelEncoder()
df[name] = le.fit_transform(df[name])
return classes_
32
33
# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
if mean is None: 37 mean = df[name].mean()
38
if sd is None:
sd = df[name].std()
41
42 df[name] = (df[name] - mean) / sd
43
44
45 # Convert all missing values in the specified column to the median 46 def missing_median(df, name): 47 med = df[name].median()
48 df[name] = df[name].fillna(med)
49
50
51 # Convert all missing values in the specified column to the default 52 def missing_default(df, name, default_value): 53 df[name] = df[name].fillna(default_value)
54
55
56 # Convert a Pandas dataframe to the x,y inputs that TensorFlow needs 57 def to_xy(df, target):
58 result = [] 59 for x in df.columns:
if x != target:
append(x)
# find out the type of the target column. Is it really this hard? :(
target_type = df[target].dtypes
target_type = target_type[0] if hasattr(target_type, ’__iter__’) else target_type
# Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
if target_type in (np.int64, np.int32):
# Classification
dummies = pd.get_dummies(df[target])
return as_matrix(result).astype(np.float32), dummies.as_matrix ().astype(np.float32)
else:
# Regression
return as_matrix(result).astype(np.float32), df.as_matrix([ target]).astype(np.float32) 73
74 # Nicely formatted time string 75 def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
80
81
# Regression chart.
def chart_regression(pred,y,sort=True):
t = pd.DataFrame({’pred’ : pred, ’y’ : y.flatten()})
if sort:
sort_values(by=[’y’],inplace=True)
a = plt.plot(t[’y’].tolist(),label=’expected’)
b = plt.plot(t[’pred’].tolist(),label=’prediction’)
ylabel(’output’)
legend()
show()
92
93 # Remove all rows where the specified column is +/- sd standard deviations 94 def remove_outliers(df, name, sd):
drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[ name].std()))]
drop(drop_rows, axis=0, inplace=True)
97
98
99 # Encode a column to a range between normalized_low and normalized_high. 100 def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
data_low=None, data_high=None):
if data_low is None:
data_low = min(df[name])
data_high = max(df[name])
105
106 df[name] = ((df[name] - data_low) / (data_high - data_low)) \ 107 * (normalized_high - normalized_low) + normalized_low
108
109 # Solution
110
def encode_toy_dataset(filename):
df = pd.read_csv(filename, na_values=[’NA’, ’?’])
encode_numeric_zscore(df, ’length’)
encode_numeric_zscore(df, ’width’)
encode_numeric_zscore(df, ’height’)
encode_text_dummy(df, ’metal’)
encode_text_dummy(df, ’shape’)
return df
119
120 # Encode the toy dataset 121 def question1():
print()
print("***Question 1***")
124
125 path = "./data/"
126
filename_read = os.path.join(path,"toy1.csv")
filename_write = os.path.join(path,"submit-jheaton-prog2q1.csv")
df = encode_toy_dataset(filename_read) # You just have to implement encode_toy_dataset above
to_csv(filename_write,index=False)
print("Wrote {} lines.".format(len(df)))
132
133
134 # Model the toy dataset, no cross validation 135 def question2():
print()
print("***Question 2***")
138
def question3():
print()
print("***Question 3***")
142
# Z-Score encode these using the mean/sd from the dataset (you got this in question 2)
testDF = pd.DataFrame([
{’length’:1, ’width’:2, ’height’: 3},
{’length’:3, ’width’:2, ’height’: 5},
{’length’:4, ’width’:1, ’height’: 3}
])
149
150
def question4():
print()
print("***Question 4***")
154
155
def question5():
print()
print("***Question 5***")
159
160
question1()
question2()
question3()
question4()
question5()
Listing 2 shows what the output from this assignment would look like. Your numbers might di er from mine slightly. Every question, except 2, also generates an output CSV file. For your submission please include your Jupyter notebook and any generated CSV files that the questions specified. Name your output CSV files something such as submit-jheaton-prog2q1.csv. Submit a ZIP file that contains your Jupyter notebook and 4 CSV files to Blackboard. This will be 5 files total.
Listing 2: Expected Output
***Question 1***
Wrote 10001 lines.
3
***Question 2***
Epoch 00144: early stopping
Final score (RMSE): 75.46247100830078
7
***Question 3***
length: (5.5258474152584744, 2.8609014041584113)
width: (5.5340465953404658, 2.8598366585224158)
height: (5.5337466253374661, 2.8719829476156122)
height length width
0 -0.882205 -1.581907 -1.235659
1 -0.185856 -0.882861 -1.235659
2 -0.882205 -0.533338 -1.585338
16
***Question 4***
Fold #1
Epoch 00060: early stopping
Fold score (RMSE): 0.21216803789138794
Fold #2
Epoch 00061: early stopping
Fold score (RMSE): 0.14340682327747345
Fold #3
Epoch 00028: early stopping
Fold score (RMSE): 0.3336745500564575
Fold #4
Epoch 00058: early stopping
Fold score (RMSE): 0.2133668214082718
Fold #5
Epoch 00077: early stopping
Fold score (RMSE): 0.1796143352985382
Final, out of sample score (RMSE): 0.22570167481899261 34
***Question 5***
Fold #1
Epoch 00182: early stopping
Fold score: 0.3625
Fold #2
Epoch 00425: early stopping
Fold score: 0.9875
Fold #3
Epoch 00169: early stopping
Fold score: 0.975
Fold #4
Epoch 00111: early stopping
Fold score: 0.8987341772151899
Fold #5
Epoch 00203: early stopping
Fold score: 0.8227848101265823
Final, out of sample score: 0.8090452261306532
Question 1
Use the dataset found here for this question: [click for toy dataset].
Encode the toy1.csv dataset. Generate dummy variables for the shape and metal. Encode height, width and length as z-scores. Include, but do not encode the weight. If this encoding is performed in a function, named encode_toy_dataset, you will have an easier time reusing the code from question 1 in question 2.
Write the output to a CSV file that you will submit with this assignment. The CSV file will look similar to Listing 3.
Listing 3: Question 2 Output Sample
Question 2
Use the dataset found here for this question: [click for toy dataset].
Use the encoded dataset from question 1 and train a neural network to predict weight. Use 25% of the data as validation and 75% as training, make sure you shu e the data. Report the RMSE error for the validation set. No CSV file need be generated for this question.
Question 3
Use the dataset found here for this question: [click for toy dataset].
Using the toy1.csv dataset calculate and report the mean and standard deviation for height, width and length. Calculate the z-scores for the dataframe given by Listing 4. Make sure that you use the mean and standard deviations you reported for this question. Write the results to a CSV file.
Listing 4: Question 3 Input Data
testDF = pd.DataFrame([
{’length’:1, ’width’:2, ’height’: 3},
{’length’:3, ’width’:2, ’height’: 5},
{’length’:4, ’width’:1, ’height’: 3}
])
...
Your resulting CSV file should look almost exactly like Listing 5.
Listing 5: Question 3 Output Sample
height,length,width
-0.8822049883269626,-1.5819074849494659,-1.2356589865858818
-0.18585564084337075,-0.8828608931337095,-1.2356589865858818
-0.8822049883269626,-0.5333375972258314,-1.5853375067165896
Question 4
Use the dataset found here for this question: [click for iris dataset].
Usually the iris.csv dataset is used to classify the species. Not this time! Use the fields species, sepal-l, sepal-w, and petal-l to predict petal-w. Use a 5-fold cross validation and report ONLY out-of-sample predictions to a CSV file. Make sure to shu e the data. Your generated CSV file should look similar to Listing 6. Encode each of the inputs in a way that makes sense (e.g. dummies, z-scores).
Listing 6: Question 4 Output Sample
sepal_l,sepal_w,petal_l,petal_w,species-Iris-setosa,species-Iris- versicolor,species-Iris-virginica,0,0
30995914214417364, -0.5903951331558184, 0.5336208818725668, 1.2,0.0,1.0,0.0,1.2,1.444551944732666
-0.1730940663922016, 1.7038864723719687, -1.1658086782311483, -
0.3,1.0,0.0,0.0,0.3,0.\
...
Question 5
Use the dataset found here for this question: [click for auto mpg dataset].
Usually the auto-mpg.csv dataset is used to regress the mpg. Not this time! Use the fields to predict how many cylinders the car has. Treat this as a classification problem, where there is a class for each number of cylinders. Use a 5-fold cross validation and report ONLY out-ofsample predictions to a CSV file. Make sure to shu e the data. Your generated CSV file should look similar to Listing 7. Encode each of the inputs in a way that makes sense (e.g. dummies, z-scores). Report the final out of sample accuracy score.
Listing 7: Question 4 Output Sample
mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name ,ideal,predict
-0.7055506566787514, 8, 1.0892327311042995, 0.6722714619460141, 6300768256149949, -1.2938698102195594, 70, -0.7142457922976494, chevrolet chevelle malibu,8,8
-1.0893794720944747, 8, 1.5016242793620063, 1.5879594901955474, -
0.8532590135498572, -1.4751810504376373, 70, -0.7142457922976494,buick skylark 320,8,8
-0.7055506566787514, 8, 1.1947282434492943, 1.19552176380289, -
0.5497784722839334, -1.6564922906557151, 70, -0.7142457922976494, plymouth satellite,8,8
...