$35
Assignment – SQL and R
Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.
This is by design a very open-ended assignment. In general, there’s no need here to ask “Can I…?” questions about your proposed approach. A variety of reasonable approaches are acceptable. You could for example access the SQL data directly from R, or you could create an intermediate .CSV file. I should be able to generate the SQL table(s) and data from your provided code—if you use a graphical user interface to create and populate tables, it should have a mechanism to generate corresponding SQL code.
This assignment does not need to be 100% reproducible. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.
Handling missing data is a foundational skill when working with SQL or R. To receive full credit, you should demonstrate a reasonable approach for handling missing data. After all, how likely is it that all five of your friends have seen all six movies?
You’re encouraged to optionally find other ways to make your solution better. For example, consider incorporating one or more of the following suggestions into your solution:
• Use survey software to gather the information.
•
Are you able to use a password without having to share the password with people who are viewing your code? There are a lot of interesting approaches that you can uncover with a little bit of research.
• While it’s acceptable to create a single SQL table, can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
• Is there any benefit in standardizing ratings? How might you approach this?
You should post any code (e.g. SQL and R Markdown) in a GitHub repository, and provide a link in your assignment submission. For this assignment, you are not required to post your code to rpubs.com.
You may work in a small group on this assignment. If you work in a group, each group member should indicate who they worked with, and all group members should individually submit their week 2 assignment.
Please start early, and do work that you would want to include in a “presentations portfolio” that you might share in a job interview with a potential employer! You are encouraged to share thoughts, ask, and answer clarifying questions in this week’s “R and SQL” forum.
(Optional) Reading related to this assignment
• James Le, “The 4 Recommendation Engines That Can Predict Your Movie Tastes”, May 1, 2018. https://towardsdatascience.com/the-4-recommendation-engines-that-can-predict-your-movie-tastes109dc4e10c52 This a nice backgrounder on movie recommendation engines. We’ll learn more about recommender systems later in the course.
• Steve Blank, “The Customer Development Process. 2 Minutes to See Why”, Jul 29, 2014.
https://www.youtube.com/watch?v=xr2zFXblSRM&t=27s. In this [<3 minute] YouTube video “lean startup” founder Steve Blank talks about the importance of getting out of the building to talk to customers. I’d encourage you to adopt this “builder mentality” in your own data science work whenever it’s practical, by collecting data yourself, whether it’s related to a “business experiment” or a “scientific experiment.”
Assignment – SQL and R Page 1