Upload a pdf to Canvas
Questions 2 to 10 will be graded automatically and must be answered according to the instructions.
Question 2: Types of Attributes
(a) Number of courses registered by a student in a given semester.
(b) Speed of a car (in miles per hour).
(c) Decibel as a measure of sound intensity.
(d) Hurricane intensity according to the Sar-Simpson Hurricane Scale.
(e) Social security number.
Question 3: Types of Attributes
Classify the following attributes as: 1) discrete or continuous, 2) qualitative or quantitative, 3) nominal, ordinal, interval, or ratio. Choose the most comprehensive attribute. If the attribute is both interval or ratio, choose ratio.
The answer to each subquestion is a list of three strings.
(b) Movie ratings provided by users (1-star, 2-star, 3-star, or 4-star).
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry orfrustrated).
(d) Average number of hours a user spent on the Internet in a week.
(e) IP address of a machine.
(f) Richter scale (in terms of energy release during an earthquake).
(g) Salary above the median salary of all employees in an organization.
Question 4: State the type of each attribute given below before and after we have performed the following transformation. The answer to each subquestion is a list of two strings. The state of attributes are one of ”nominal”, ”ratio”, ”ordinal”, or ”interval”.
(a) Hair color of a person is mapped to the following values: black = 0,brown = 1, red = 2, blonde = 3, grey = 4, white = 5.
(b) Grade of a student (from 0 to 100) is mapped to the following scale: A = 4.0, A- = 3.5, B = 3.0, B- = 2.5, C = 2.0, C- = 1.5, D = 1.0, D= 0.5, E = 0.0
(c) Age of a person is discretized to the following scale: Age < 12, 12 ≤ Age < 21, 21 ≤ Age < 45, 45 ≤ Age < 65, Age > 65.
(d) Annual income of a person is discretized to the following scale: Income< $20K, $20K ≤ Income < $60K, $60K ≤ Income < $120K, $120K ≤ Age < $250K, Age ≥ $250K.
(e) Height of a person is changed from meters to feet.
(f) Height of a person is changed from meters to (Short, Medium, Tall) .
(g) Height of a person is changed from feet to number of inches above 4 feet.
(h) Weight of a person is standardized by subtracting it with the mean ofthe weight for all people and dividing by its standard deviation.
Question 5: Data Preprocessing
Consider the following dataset that contains the age and gender information for 9 users who visited a given website. Answers to each subquestion is a dictionary with keys: ”bin1”, ”bin2”, and ”bin3”. Each value is a list of integers.
(b) Repeat the previous question using the equal frequency approach.
(c) Repeat question (a) using a supervised discretization approach (withGender as class attribute). Specically, choose the bins in such a way that their members are as pure as possible (i.e., belonging to the same class).
Question 6:
Consider an attribute X of a data set that takes the values {x1,x2,··· ,x9} (sorted in increasing order of magnitude). We apply two methods (equal interval width and equal frequency) to discretize the attribute into three bins. The bins obtained are shown below:
Equal Width: {x1,x2,x3}, {x4,x5,x6,x7,x8}, {x9}
Equal Frequency {x1,x2,x3}, {x4,x5,x6}, {x7,x8,x9}
The answer to each subquestion is a dictionary with two keys: ’equal width’ and ’equal freq’. The values of each key is a list of two values: a string and an integer. The string is either ’Change’ or ’No change’. The value of the integer is chosen among (1,..., 10), chosen according to the list below:
1. The transformation leads to an inversion of the original order of values.
2. The distance between xi and xi+1 does not change uniformly.
3. The average value X¯ becomes the smallest value post-transformation. 4. The relative ordering of points changes
5. The transformation causes negative values to become positive and viceversa.
6. The transformation results in all values becoming equal.
7. The distance between xi and xi+1 change uniformly.
8. The standard deviation σX becomes zero after the transformation.
9. No change in the relative ordering of points
10. The maximum and minimum values of X get swapped after the transformation.
Subquestions to answer:
(a) X → X − x¯, (i.e., if the attribute values are centered).
, (i.e., if the attribute values are standardized).
(i.e., if the values are standardized and exponenti-
ated).
Question 7:
An e-commerce company is interested in identifying the highest spending customers at its online store using association rule mining. One of the rules identified is:
21 ≤ Age < 45
AND
NumberOfVisits > 50 → AmountSpent > $500,
where the Age attribute was discretized into 5 bins, NumberOfVisits was discretized into 8 bins, and AmountSpent was discretized into 8 bins. The confidence of an association rule (A,B) → C is defined as
Confidence((
where P(C|A,B) is the conditional probability of C given A and B, P(A,B,C) is the joint probability of A, B, and C, and P(A,B) is the joint probability of A and B. The probabilities are empirically estimated based on their relative frequencies in the data. For example, P(AmountSpent > $500) is given by the proportion of online users who visited the store and spent more than $500.
The answers to the first two subquestions is a string, either ’increase/decrease’, ’non-decreasing’, or ’non-increasing’. The answer ot the third subquestion is a list of tuples (or lists) of two integers each: the boundaries of the bin. If the boundary of a bin is at either positive or negative infinity, replace the appropriate integer by either ’-infinity’ or ’infinity’.
(a) Suppose we increase the number of bins for the Age attribute from 5 to 6 so that the discretized Age in the rule becomes 21 ≤ Age < 30 instead of 21 ≤ Age < 45, will the confidence of the rule be nonincreasing, non-decreasing, stays the same, or could go either way (increase/decrease)?
(b) Suppose we increase the number of bins for the AmountSpent attribute from 8 to 10, so that the right-hand side of the rule becomes $500 < AmountSpent < $1000, will the confidence of the rule be non-increasing, non-decreasing, stays the same, or could go either way (increase/decrease)?
(c) Suppose the values for NumberOfVisits attribute are distributed according to a Poisson distribution with a mean value equals to 4. If we discretize the attribute into 4 bins using the equal frequency approach, what are the bin values after discretization? Hint: you need to refer to the cumulative distribution table for Poisson distribution to answer the question.
Question 8: Measures of Similarity and Dissimilarity Consider the following bindary vectors:
x1 = (1,1,1,1,1) (1)
x2 = (1,1,1,0,0) y1 = (0,0,0,0,0)
y2 = (0,0,0,1,1)y3 = (0,1,0,1,1)
The answer to each subquestion should either be a list of two points or the string ’equally similar’. The two points are taken from ’x1’, ’x2’, ’y1’, ’y2’,
’y3’.
(a) According to Jaccard coefficient, which pair of vectors--(x1,x2) or (y1,y2)—are more similar to each other?
(b) According to simple matching coecient, which pair of vectors—(x1,x2) or (y1,y2)–are more similar to each other?
(c) According to simple Euclidean distance, which pair of vectors—(x1,x2) or (y1,y2)–are more similar to each other?
(d) According to simple Euclidean distance, which pair of vectors—(x1,y1) or (x2,y3)–are more similar to each other?
Question 9: Which similarity or distance measure is most eective for each of the domains given below:
The answer to each subquestion should be a string, one of: ’SMC’, ’Jaccard’, ’Euclidean’, ’Cosine Similarity’
(a) Which measure, Jaccard or Simple Matching Coecient, is most appropriate to compare how similar are the answers provided by students in an exam. Assume that the answers to all the questions in the exam are either True or False.
(b) Which measure, Jaccard or Simple Matching Coecient, is most appropriate to compare how similar are the locations visited by tourists at an amusement park. Assume the location information is stored as binary yes/no attributes (yes means a location was visited by the tourist and no means a location has not been visited).
(c) (c) Which measure, Euclidean distance or correlation coecient, is mostappropriate to compare two ows in a network trac. For each ow, we record information about the number of packets transmitted, number of bytes transferred, number of acknowledgments sent, and duration of the session.
(e) Which measure, Euclidean distance or cosine similarity, is most appropriate to compare the similarity of items bought by customers at a grocery store. Assume each customer is represented by a 0/1 binary vector of items (where a 1 means the customer had previously bought the item).
Question 10: Ten True/False questions. Please answer True or False, and provide a one sentence justification of your choice. The answer to each subquestion is a boolean, either True or False.
(a) Noise is not a problem with count data. Explain.
(b) For any two sets of real values, such as two vectors of size n ¿ 0, thecorrelation is a value between -1 and 1. Explain.
(c) For reducing the size of a daily time series, it would be better to samplethan aggregate since sampling is a simpler process. Explain.
d) Noise and outliers are sometimes the same. Explain.
(e) If an object is an outlier, then it is noise.
(f) A binary attribute with values 0 or 1 is also an asymmetric binaryattribute.
(g) If vectors of counts have a cosine measure of 1, the objects are identical.Explain.
(h) Discrete variables cannot be ratio.
(i) Quantitative variables are continuous.
(j) Converting ordinal variables to asymmetric binary variables does notlose any information.