Starting from:

$30

CSC343-Phase 1 Solved



Here are some criteria that should influence your choice of dataset.

•    Pick data that will be easy to retrieve.

In Phase 2, you will retrieve the data. For now, just start thinking about how you will do that. You could get your data by writing code that uses an API to retrieve information from an online site, but this is beyond what is expected of you. It is fine to use data that has already been assembled, and is in a format that you know how to work with, such as csv or json, or even just a formatted text file.

•    Pick a dataset that is easy to interpret.

Sometimes a dataset comes with a “data dictionary” that explains the format and meaning of the data. Other times, that is defined poorly or not at all. It’s possible to infer quite a bit about a dataset’s meaning. For example, by reading the content of one table, it may seem that a certain attribute is a reference to something else, or that an attribute is a key. Once the data is loaded, you could test out these sorts of hypotheses by running queries to look for exceptions. This whole process can raise other uncertainties and end up being very laborious. I strongly recommend picking a dataset that is easy to interpret.

•    Pick a dataset that is rich enough.

It will take a bit of effort to find a dataset that has enough in it to do something interesting with. Here are some requirements for your dataset:

–    It must be open data. (Try Googling “open data xxx” where xxx is a location or a topic.)

–    It does not have to be large in quantity (number of rows), but shouldn’t be so small that you could answer your investigative questions just by looking at the data.

–    It must be rich in structure: The final schema must have at least 4 tables and at least 3 referential integrity constraints. (You won’t know whether or not your schema will have this many until you at least sketch out the schema.) Don’t impose 4 tables on the data just to meet this criterion; division into tables should be for a reason, such as dealing with missing values or avoiding redundancy.

If you find you don’t have enough structure, think about whether there is some other data you could connect to what you have. For example, if you have dates, could you find another dataset that records the weather by date, and might the weather be relevant? Maybe you’ll find that there is a relationship between the weather and the other data that you have. Or if you have postal codes, can you find out average household income or population density or education level by postal code? Use your imagination to identify other relevant data. Note that you are welcome to use multiple sources of data

Define your investigative questions
Now come up with 3 specific questions that you would like to answer using your dataset(s). Each question should be specific, not just a general area to explore. But it should also be somewhat open-ended – we want the results of your first query to naturally lead to follow-up questions. In the end, each investigative question will be addressed by a series of queries.

We want you to dig deeply into the data. Ideally, the answers you find would be of interest to someone whose work involves this data.

Design your schema
Then, design a relational schema for your domain, written using relational notation (like the schema in the Relational Algebra worksheet). There are many possible schemas for any interesting dataset, so you will have to make design choices. Later on, we’ll learn a formal design process. For now, use your common sense and follow as many of these general principles as you can:

•    Avoid redundancy. That is, avoid a schema that results in the same information being repeated. For instance, suppose the movies database we have discussed in class had a table like this:

mID
title
director
year
length
actor
nationality
role
1
Shining
Kubrick
1980
146
Nicholson
American
Jack Torrance
6
American Graffiti
Lucas
1973
110
Ford
American
Bob Falfa
5
Star Wars IV
Lucas
1977
126
Ford
American
Han Solo
5
Star Wars IV
Lucas
1977
126
Fisher
American
Princess Leia Organa
8
Star Wars V
Lucas
1980
126
Ford
American
Han Solo
There is a lot of redundancy in this table. (Can you identify it?) It can be avoided by splitting the data into tables, as in the schema on the worksheet.

•    Avoid designing your schema in such a way that there are attributes that won’t always have a value, either because the data doesn’t exist or is just missing in the dataset. For example, in a relation about students, we wouldn’t want to have an attribute that identifies their spouse, since many or most students are not married; for them no spouse exists. We would instead put that information in a different table, with a row only for those students who do have a spouse. Or suppose we have customers who we identify by their email address, and some of them have told us their phone number. We wouldn’t want to include a column for phone number since we often have no value for it (even if a value actually exists). Again, we would instead put that information in a different table, with a row only for those customers whose phone number is known.

•    Use constraints to prevent data that is clearly nonsensical from being included in your database. Don’t forget this! Last year, quite a few students did.

•    Define a key for every relation.

You may find there is tension between some of these principles. Where that occurs, use your judgment to make a trade-off. Keep a written record of important trade-offs and other decisions made; you will use that in your Justification of your schema, and later in your presentation.

Your dataset may come organized into csv files (or in some other format) that can be translated directly into a relational schema that is already a good design. This is fine. However, you must justify why it is a good design nonetheless.

Document your schema
There are 3 parts to documenting your schema:

1.    Comments: Include a comment for every relation that explains exactly what a row in the relation means inyour domain. I will provide a schema from an old assignment as a good example of this. Pick good attribute names that will make it easy to understand what each attribute is.

2.    Data dictionary: In the example schema, where needed, we elaborate on the attributes in the relationcomments. For the project schema, you will instead explain your attributes by defining a data dictionary, in tabular form, for each relation. For each attribute, it must include: the attribute name, a description of what it represents in your domain, its data type (you can give this in plain English and later translate it into a SQL type), whether or not a value will always be known (this should always be “yes” if you structure the schema well), a default value if one exists, and what are the allowable values. Figure 1 gives an example based on the Submission table from the example schema. Use the columns and content shown in figure 1, and include one of these for each relation in your schema.

Your data dictionary will not only record your understanding of the data, but will help you think about the kinds of constraints that you will need when you implement the schema in SQL.

3.    Justification: Provide a justification of why you divided the data into tables in the way that you did. If youtranslated the structure of the dataset directly without change, explain why the result is a good design. If you split things up that were together, or put things together that were apart, explain why. If you had to invent a key to uniquely identify items, explain this also. Include anything else that your TA will need in order to understand your choices.

Figure 1: Example of a data dictionary

Submission

Attribute
Description
Type
Required
Default
sID
The ID of a file submission on 

MarkUS
INT
Yes
 
fileName
The name of the file that was submitted
TEXT
Yes
 
userName
The user name of the user who submitted the file.
TEXT
Yes
 
gID
The group ID of the group for which the file was submitted
INT
Yes
 
when
The date and time when the file was submitted
TIMESTAMP
Yes
 

More products