Starting from:

$30

CSCI804-Assignment 4 Document Retrieval and Movie Ratings Solved

TASK ONE : Document Retrieval
The field of information retrieval is concerned with finding relevant electronic documents based upon a query. For example, given a group of keywords (the query), a search engine retrieves Web pages (documents) and display them sorted by relevance to the query. This technology requires a way to compare a document with the query to see which is most relevant to the query.

A simple way to make this comparison is to compute the binary cosine coefficient. The coefficient is a value between 0 and 1, where 1 indicates that the query is very similar to the document and 0 indicates that the query has no keywords in common with the document. This approach treats each document as a set of words. For example, given the following sample document:

“Chocolate ice cream, chocolate milk, and chocolate bars are 

delicious”

This document would be parsed into a set of keywords, where case is ignored, punctuation discarded,

{chocolate, ice, cream, milk, and, bars, are, delicious}. An identical process is performed on the query to turn it into a set of keywords.

Once we have a query Q represented as a set of words and a document D represented as a set of words, the similarity (relevance) between Q and D is computed by:

relevance =  

  where Q and D represents the number of words in Q and D respectively,  Q∩D  is the number of words appeared in both Q and D (intersection of Q and D).

Select appropriate STL containers and write a program that takes a set of keywords (any number of words) that represent a query. The program should then compare the query to all the document files (whose names end with extension .txt) specified in the file called listofdocs.txt and output the relevance and the documents in a descending order of the relevance. If a document contains more than 10 words, then just output the first 10 words of the document and a symbol “…” at the end.

For this task you should submit DocRetrieval.cpp. Your code must compile on Banshee with the instruction

$ g++ DocRetrieval.cpp -o DocRetrieval 

and should run as

$ ./DocRetrieval keywords1 keywords2 keywords3 … 

For example, if the listofdocs.txt lists four documents that are in the same directory as the program:

Kyle01.txt 

Kyle02.txt 

Kyle03.txt 

Kyle04.txt 

Note: check the text files for their contents.  

Run the program as follow

$ ./DocRetrieval kyle radio 2Day girl 

The output would look like:

(Kyle04.txt - 32.44%) THE radio network Austereo has pulled the top-rating 2Day FM … 

(Kyle03.txt - 23.15%) THE top-rating radio station 2Day FM and its owner, Austereo … 

(Kyle01.txt - 8.98%) The Ten Network has dumped embattled host Kyle 

Sandilands as … 

(Kyle02.txt - 0.00%) Word around the traps yesterday was that Monday night's televisual … 

 

TASK TWO : Movie Ratings
You have collected files of movie ratings where each movie is rated from 1 (bad) to 5 (excellent).  The first line of each file is a number that identifies how many ratings are in the file.  Each rating then consists of two lines:  the name of the movie followed by the numeric rating from 1 to 5. Here is a sample rating file with four unique movies and seven ratings:

 

File: ratings.txt 

---------------------- 



Harry Potter and the Order of the Phoenix 



Harry Potter and the Order of the Phoenix 



The Bourne Ultimatum 



Harry Potter and the Order of the Phoenix 



The Bourne Ultimatum 



Wall-E 



Glitter 



-------------------------------- 

 

Choose a proper STL container and write a program that reads multiple files in this format, calculates the average rating for each movie, and outputs the average along with the number of reviews. Here is the desired output for the sample data:

 

Glitter: 1 review, average of 1/5 

Harry Potter and the Order of the Phoenix: 3 reviews, average of 4.3/5 

The Bourne Ultimatum: 2 reviews, average of 3.5/5 

Wall-E: 1 review, average of 4/5 

 


More products