$30
About The Data
We will be working with a simulated data set related to social media sites. The data are stored in several files:
Profiles.csv: Information about the users with some fields from their profiles.
Connections.csv: Information about which users are connected to other users.
Registrations.csv: Information about history of the user’s account registrations (logins) over time.
Header The first row of the data set includes the column names, and each subsequent row includes one observation of values. Here is a selection of 20 lines from each data file:
Here is a brief description of each variable across the three files:
Profiles Variables:
· id: A unique identifying string for each user.
· density: The type of area the user lives in, with categories of Urban, Suburban, and Rural areas.
· gender: female (F) or male (M).
· has_profile_photo: 1 if yes, 0 if no.
· num_photos: This is the number of photos the user has uploaded to the site.
· date_created: This is the date that the user first joined the site.
Connections Variables:
· id: A unique identifying string for each user.
· connection_id: This is the identifier of another user that the user listed under id is connected to.
This site chooses to use one-way connections. A user can connect to a second user’s profile without requiring that the second user reciprocally connect to the first one. So, for any row in the Connections data, the user labeled with id is following the user labeled with connection_id. In some cases, pairs of users are mutually following each other, but this is by no means required. For mutual connections, the users will be coupled in two different rows in the two possible orders. Each connection for a single user is recorded in a separate row.
Registrations Variables:
· id: A unique identifying string for each user.
· registration.time: This is the date and time that a user registered by logging in to the site. Each registration for a user is recorded in a separate row.
Question 1: Classifying Connections
How often do users mutually follow each other, and how often are the connections one-way? We want to investigate this. For the investigation, we’ll say that a two-way connection requires two one-way connections (two rows of data) but only counts once. Therefore, the number of overall connections (total one-way plus total two-way) will be less than the overall number of rows of data in the Connections file. With this in mind, answer these questions.
What percentage of all connections are one-way connections, and what percentage of all connections are two-way connections?
Question 2: Recommending Connections
Which connections should we recommend to the user with id CLKcSSSC? One way is to find the unconnected users who are connected to users that user CLKcSSSC is also connected to. Create a table of all the users who satisfy all of the following criteria:
have at least 30 connections in common with user CLKcSSSC’s connections, and
are not already connected with user CLKcSSSC.
The list should show the ids of the recommended users and the number of common connections they have with user CLKcSSSC. Order the list in decreasing order of mutual connections. Make sure not to include CLKcSSSC on the list of recommendations!
Question 3: Influential Connections
In social networks, some users are considered influential. They tend to have more connections, and their content can be widely viewed and shared. For our purposes, we will define the influential users as those who:
Have at least 200 photos, and
Have at least 150 connections.
Among all users (both influential and not so influential), how many users are connected to at least 250 influential users?
Question 4: Early Utilizers
Starting from the time when the account for each user was created, what percentage of all users logged in at least 35 times during the first 7? Round your answer to 1 decimal point, e.g. 84.2%.
Hints: Within the lubridate library, you can use the function days to add a specified number of days to the registration times. The first week ends before (less than) the user’s first registration time plus 7 days. The registration that occurred when the account was created counts toward the overall total for this period.
Question 5: Imbalanced Connections
What percentage of users have at least 100 more followers than the number of users that they are following? Round the answer to 1 decimal place, e.g. 84.2%.
Question 6: Active Users
What percentage of unique users in the sample were active (with at least 1 registration) between 00:00:00 of January 1st, 2017 and 23:59:59 on January 7th, 2017? Round the percentage to 1 decimal place, e.g. 84.2%
Hint: For any given date in character format (e.g. “1999-07-01”), you can calculate a date in the future with the as.Date function: as.Date(“1999-07-01”) + 3 would result in “1999-07-04”.
Question 7: Burning the Midnight Oil
Across all days, what percentage of all registrations occur between the hours of 00:00:00 and 05:59:59, inclusive of both endpoints? Round your answer to 1 decimal place, e.g. 84.2%. Hint: Use the hour() function to classify the time of day.
Question 8: Retention Rates
What percentage of users were retained at 183 days (half a year)? To answer this question, we will use a 7 day window. Any user who had at least one registration in the period of time that was at least 183 days and less than 190 days from their first registration would be considered retained. Round your answer to 1 decimal place, e.g. 84.2%.
Note: The evaluation window would begin at exactly 183 days after the first registration. This period lasts for 7 days. This window would include the left end-point but not the right end-point. The registration times are listed in the data set rounded to the nearest second. If the user had at least 1 registration during this window, the user would be considered retained at 183 days (approximately 6 months).
Hint: You may use the days() function to add time to a user’s initial registration time.
Question 9: False Positive Rates
In the previous question, we estimated the rate of retention at 6 months using a 7-day window for evaluation. What is the rate of false positives for the 7-day window? In other words, what percentage of users who were considered not retained at 6 months using a 7-day window later had a registration? Round the results to 2 decimal places, e.g. 84.23%.
Question 10: Modeling Retention
Build a logistic regression model for retention at 6 months. Classify users as retained at 6 months if they have any account registrations at times at least 183 days after their account was created. Include the following variables:
density
age_group
gender
num_photos (categories: 0-24, 25-49, 50-99, 100-249, 250-499, 500+) (current status)
average daily registrations in the first week. (To simplify matters, let this be the total number of registrations in the first week divided by 7, regardless of whether the user’s retention truly lasted 7 days or not.)
number of connections the user currently has
number of users currently connected to this user
Display the odds ratios, confidence intervals for the odds ratios, and p-values for the coefficients, rounded to 3 digits. Then briefly comment on the results.