Starting from:

$30

CS60092-Assignment 2 Ranked Retrieval for Free Text Queries IR Solved

This assignment is on building tf-idf based ranked retrieval system to answer free text queries. It is highly recommended that you use python for this assignment as libraries like nltk will make many things easier (stop word removal and lemmatization). However, if you use any other language, you most probably have to design these modules yourselves which might not perform as good as nltk library in python.

•    Find your dataset and other required informations.

–   Dataset: The dataset contains 1000 text files of the same dataset you used for last assignment.

–   Static Quality Score: A python list, containing static quality score of 1000 documents (To know more, please follow chapter 7.1.4 of the textbook by Manning) dictionary where key is document and value is its g(d) value.

–   Leaders: A python list, containing index of 30 leader documents. (To know more, please follow chapter 7.1.6 of the textbook by Manning)

•    Remove stop words, punctuation marks, make everything to lowercase and perform lemmatization to generate tokens from the document (use nltk library in python).

•    Tasks

–   Let tf idft,d = tft,d ×idft where tfd,t = log10(1 +tf˜ d,t) and idft = log10(N/dft), tf˜ d,t denotes number of times term t appears in document d.

–   Build InvertedPositionalIndex, that is, a python dictionary with (t,idft) as keys and (d, tft,d) as postings (consider t as term and d as document).

–   Build ChampionListLocal, that is, a python dictionary that contains a list for each term, containing the index of top 50 documents with highest tft,d values.

–   Build ChampionListGlobal, that is, a python dictionary that contains a list for each term, containing the index of top 50 documents with highest g(d)+tf idft,d values.

•    Answering free text query: The queries to be answered are free text queries. Remove stop words, punctuation marks, make all lowercase and then apply lemmatization on the query text. Let the resulting query after the first step be Q. Now find the top-10 relevant documents according to each of the following scoring schemes.

–   tf idf score(Q, d) =   (dd))|, while V(Q)(t) = idft if t ∈ Q, 0 otherwise, V(d)(t) = tf idft,d,|x| denotes euclidean norm.

–   Local Champion List Score(Q, d) =   (dd))|, while V(Q)(t) = idft if t ∈ Q, 0 otherwise, V(d)(t) = tf idft,d, |x| denotes euclidean norm and we will be scoring only documents in A = {d|d ∈ LocalChampionList(t), t ∈ Q}

–   Global Champion List Score(Q, d) =   (dd))|, while V(Q)(t) = idft if t ∈ Q, 0 otherwise, V(d)(t) = tf idft,d, |x| denotes euclidean norm. and we will be scoring only documents in A = {d|d ∈ GlobalChampionList(t), t ∈ Q}

–   Cluster Prunning Scheme (To know more, please follow chapter 7.1.6 of the textbook by Manning):

*    Index of Leaders contains list of leader file names.

*    Let your query be Q. Let us define L(Q) = d if tf idf score(d, Q) = maxd∈IndexOfLeaderstf idf score(d, Q)

*    Find Followers(L) = {d|tf idf score(d, L) tf idf score(d, L¯), L¯ ∈

IndexOfLeaders

*    Cluster Prunning Score(Q, d) =   (dd))|, while V(Q)(t) = idft if t ∈ Q, 0 otherwise, V(d)(t) = tf idft,d, |x| denotes euclidean norm and we will be scoring only documents in A =

{d|d ∈ L(Q)∪Followers(L)}

More products