$29.99
Before the lab you should re-read the relevant lecture slides and their accompanying examples.
Set up for the lab by creating a new directory called lab08 and changing to this directory.
$ mkdir lab08
$ cd lab08
There are some provided files for this lab which you can fetch with this command:
$ 2041 fetch lab08
If you're not working at CSE, you can download the provided files as a zip file or a tar file.
In these exercises you will work with a dataset containing sing lyrics.
This dataset contains the lyrics of the songs of 10 well-known artists.
$ unzip lyrics.zip Archive: lyrics.zip creating: lyrics/ inflating: lyrics/David_Bowie.txt inflating: lyrics/Adele.txt inflating: lyrics/Metallica.txt inflating: lyrics/Rage_Against_The_Machine.txt inflating: lyrics/Taylor_Swift.txt inflating: lyrics/Keith_Urban.txt inflating: lyrics/Ed_Sheeran.txt inflating: lyrics/Justin_Bieber.txt inflating: lyrics/Rihanna.txt inflating: lyrics/Leonard_Cohen.txt inflating: song0.txt inflating: song1.txt inflating: song2.txt inflating: song3.txt inflating: song4.txt
The dataset also contains lyrics from 5 songs where we don't know the artists.
Each is from one of the artists in the dataset but they are not from a song in the dataset.
As a first step in this analysis, write a Python script total_words.py which counts the total number of words in its stdin.
For the purposes of this program (and the following programs) we will define a word to be a maximal, non-empty, contiguous, sequence of alphabetic characters ( [a-zA-Z] ).
Any characters other than [a-zA-Z] separate words.
So for example the phrase " The soul's desire " contains 4 words: ("The", "soul", "s", "desire")
$ ./total_words.py < lyrics/Justin_Bieber.txt
46589 words
$ ./total_words.py < lyrics/Metallica.txt
38096 words
$ ./total_words.py < lyrics/Rihanna.txt
53157 words
When you think your program is working, you can use autotest to run some simple automated tests:
$ 2041 autotest total_words
When you are finished working on this exercise, you must submit your work by running give :
$ give cs2041 lab08_total_words total_words.py
Write a Python script count_word.py that counts the number of times a specified word is found in its stdin
The word you should count will be specified as a command line argument.
Your program should ignore the case of words.
$ ./count_word.py death < lyrics/Metallica.txt death occurred 69 times
$ ./count_word.py death < lyrics/Justin_Bieber.txt death occurred 0 times
$ ./count_word.py love < lyrics/Ed_Sheeran.txt love occurred 218 times
$ ./count_word.py love < lyrics/Rage_Against_The_Machine.txt love occurred 4 times
When you think your program is working, you can use autotest to run some simple automated tests:
$ 2041 autotest count_word
When you are finished working on this exercise, you must submit your work by running give :
$ give cs2041 lab08_count_word count_word.py
Write a Python script frequency.py thar prints the frequency with which each artist uses a word specified as an argument.
So if Justin Bieber uses the word "love" 493 times in the 46583 words of his songs, then its frequency is
When you think your program is working, you can use autotest to run some simple automated tests:
$ 2041 autotest frequency
When you are finished working on this exercise, you must submit your work by running give :
$ give cs2041 lab08_frequency frequency.py
Given that David Bowie uses:
the word "truth" with frequency 0.000146727 the word "is" with frequency 0.005898407 the word "beauty" with frequency 0.000264108 we can estimate the probability of Bowie writing the phrase "truth is beauty" as:
0.000146727 * 0.005898407 * 0.000264108 = 2.28573738067596e-10
We could similarly estimate probabilities for each of the other 9 artists and then determine which of the 10 artists is most likely to sing "truth is beauty" (it's Leonard Cohen).
A sidenote: we are actually making a large simplifying assumption in calculating this probability.
It is often called the bag of words model.
A common solution to this underflow is instead to work with the log of the numbers.
So instead we will calculate the log of the probability of the phrase. You do this by adding the log of the probabilities of each word.
For example, you calculate the log-probability of Bowie singing the phrase "Truth is beauty." like this:
log(0.000146727) + log(0.005898407) + log(0.000264108) = -22.1991622527613
Log-probabilities can be used directly to determine the most likely artist, as the artist with the highest log-probability will also have the highest probability.
Another problem is that we might be given a word that an artist has not used in the dataset we have.
You should avoid this when estimating probabilities by adding 1 to the count of occurrences of each word.
So for example we'd estimate the probability of Ed Sheeran using the word fear as (0+1)/18205 and the probability of Metallica using the word fear as (39+1)/38082.
This is a simple version of Additive smoothing.
Write a Python script log_probability.py which given a phrase (sequence of words) as arguments, prints the estimated log of the probability that each artist would use this phrase.
When you think your program is working, you can use autotest to run some simple automated tests:
$ 2041 autotest log_probability
When you are finished working on this exercise, you must submit your work by running give :
$ give cs2041 lab08_log_probability log_probability.py
Write a Python script identify_artist.py that given 1 or more files (each containing part of a song), prints the most likely artist to have sung those words.
For each file given as argument, you should go through all artists and for each calculate the log-probability that the artist sung those words.
You calculate the log-probability that the artist sung the words in theilfe, by for each word in the file calculating the log-probability of that artist using that word, and summing all the the log-probabilities.
You should print the artist with the highest log-probability.
Your program should produce exactly this output:
$ ./identify_artist.py song?.txt song0.txt most resembles the work of Adele (log-probability=-352.4) song1.txt most resembles the work of Rihanna (log-probability=-254.9) song2.txt most resembles the work of Ed Sheeran (log-probability=-206.6) song3.txt most resembles the work of Justin Bieber (log-probability=-1089.8) song4.txt most resembles the work of Leonard Cohen (log-probability=-493.8)
When you think your program is working, you can use autotest to run some simple automated tests:
$ 2041 autotest identify_artist
When you are finished working on this exercise, you must submit your work by running give :
$ give cs2041 lab08_identify_artist identify_artist.py
When you are finished each exercises make sure you submit your work by running give .
You can run give multiple times. Only your last submission will be marked.
Don't submit any exercises you haven't attempted.
You check the files you have submitted here.
After automarking is run by the lecturer you can view your results here. The resulting mark will also be available via give's web interface.
Lab Marks
When all components of a lab are automarked you should be able to view the the marks via give's web interface or by running this command on a CSE machine:
$ 2041 classrun -sturec
For all enquiries, please email the class account at cs2041@cse.unsw.edu.au
CRICOS Provider 00098G