$25
Web Intelligence
Users might mention other users in their tweets by using the @ symbol followed by the name of the other user, such as @jack in the example. Multiple other users might be mentioned. The mention graph is therefore a directed graph where the vertices are the users and the edges represent the mentions. Thus userA − [mentions]− > userB. The number of times that userA mentions userB can be transformed into a weight. You do not need to consider the time when creating the mention graph.
You need to do the following (marks are allocated to each part out of 100%):
1. Extract the mention graph from the Twitter data (marks 20%):
• parse the .tsv file using Python to get the individual tweets; • generate the adjacency list for each user based on the alias. A user is adjacent to another user if he/she is mentioned in a tweet. Keep track of the number of times that some userA mentions userB;
• use the adjacency list to create the mention graph in NetworkX. Use the counts as weights on the edges;
• in the jupyter notebook provide information about any challenges that you might have encountered when parsing and cleaning the data and when creating the graph, and how you solved them. Discuss as well any decisions that you might have made.
2. To start analysing the underlying structure of the graph, compute thefollowing statistics for the graph (marks 30%):
• number of nodes and edges;
• indegree and outdegree;
• degree distribution;
• average path length;
• global clustering coefficient.
In the jupyter notebook provide information about any challenges that you might have encountered when computing these statistics, and how you solved them. Is the graph connected? Are there “giant” components[2] in the graph? What can you say about the graph based on the computed statistics?
3. Determine the top 10 users (marks 30%): based on the degree, closeness and betweenness centrality measures. You need to deal with directionality to facilitate computation. Discuss how you did this.
4. Visualise the graph (or the most important components) (marks 20%): use some size-color-valuation scheme, whereby nodes with higher centrality (say betweenness) will have a specific color and size. The user’s alias can be used as the node label and the edge width can represent the weight (based on mention counts).
This task is being allocated a total of 50 marks.
1.2 Task 2: Information Retrieval
For this task, you are to use the WES dataset available from http://pikes. fbk.eu/ke4ir to build a simple Information Retrieval engine that uses the Vector Space model to find documents related to a user query. You need to use the following files from the dataset:
• The document collection – https://knowledgestore.fbk.eu/files/ ke4ir/docs-raw-texts.zip; and
• The set of queries – https://knowledgestore.fbk.eu/files/ke4ir/ queries-raw-texts.zip.
Both of these files are being made available on the VLE.
An IR engine consists of TWO parts – the document indexing part, and the querying component.
For the document indexing part, you need to implement these process steps:
1. Parse the document to extract the data in the XML’s < raw > tag;
2. Tokenise the documents’ content;
3. Perform case-folding, stop-word removal and stemming;
4. Build the term by document matrix containing the TF.IDF weight for each term within each document.
For the querying component, you need to implement the following process steps:
• Get a user query – note that it can be set within the notebook, directly into a variable named query;
• Preprocess the user query (tokenisation, casefolding, stop-word removal and stemming);
• Use cosine similarity to calculate the similarity between the query and each document returned in the previous step;
• Output the list of documents as a ranked list.
Note that you can use NLTK or any other library to help in the tokenisation and preprocessing of the text. However, you need to implement your own TF.IDF weighting and Cosine Similarity measures.
o have a significant fraction of the nodes.