Starting from:

$30

COMP90051-Project 1 Solved

Pairwise relationships are prevalent in real life. For example, friendships between people, communication links between computers and pairwise similarity of images. Networks provide a way to represent a group of relationships. The entities in question are represented as network nodes and the pairwise relations as edges.

In real network data, there are often missing edges between nodes. This can be due to a bug or deficiency in the data collection process, a lack of resources to collect all pairwise relations or simply there is uncertainty about those relationships. Analysis performed on incomplete networks with missing edges can bias the final output, e.g., if we want to find the shortest path between two cities in a road network, but we are missing information of major highways between these cities, then no algorithm will able to find this actual shortest path.

Furthermore, we might want to predict if an edge will form between two nodes in the future. For example, in disease transmission networks, if health authorities determine a high likelihood of a transmission edge forming between an infected and uninfected person, then the authorities might wish to vaccinate the uninfected person.

In this way, being able to predict and correct for missing edges is an important task.

Your task:

In this project, you will be learning from a training network and trying to predict whether edges exist among test node pairs.

The training network is a partial crawl of the Twitter social network from several years ago. The nodes in the network—Twitter users—have been given randomly assigned IDs, and a directed edge from node A to B represents that user A follows B. The training network is a subgraph of the entire network. Starting from several random seed nodes, we proceeded to obtain the friends of the seeds, then their friends’ friends, and so on for several iterations.

The test data is a list of 2,000 edges, and your task is to predict if each of those test edges are really edges in the Twitter network or are fake ones. 1,000 of these test edges are real and withheld from the training network, while the other 1,000 do not actually exist.

To make the project fun, we will run it as a Kaggle in-class competition. Your assessment will be partially based on your final ranking in the privately-held competition, partially based on your absolute performance and partially based on your report.

1         Data Format
All data will be available in raw text. The training graph data will given in a (tab delimited) edge list format, where each row represents a node and its out neighbours. For example:

1       2

2       3

                                                                                                  4     3     5     1

represents the network illustrated in Figure 1.

The test edge set is in a (tab-delimited) edge list format, where each represents an edge (source node, target node). Given this 2,000-row edge list, your implemented algorithm should take the test list in and return a 2,001



Figure 1: Network diagram for the adjacency list example.

row CSV file that has a) in the first row, the string “Id,Prediction”; b) in all subsequent rows, a consecutive integer ID a comma then a float in the range [0,1]. These floats are your “guesses” or predictions as to whether the corresponding test edge was from the Twitter network or not. Higher predictions correspond to being more confident that the edge is real.

For example, given the test edge set of {(3,1),(3,4)} as represented in CSV format by

Id,Source,Sink

1,3,1

2,3,4

if your prediction probabilities are 0.1 for edge (3,1), 0.99 for edge (3,4), then your output file should be:

Id,Prediction

1,0.1

2,0.99

The test set will be used to generate an AUC for your performance; you may submit test predictions multiple times per day (if you wish). During the competition AUC on a 30% subset of the test set will be used to rank you in the public leaderboard. We will use the complete test set to determine your final AUC and ranking. The split of test set during/after the competition, is used to discourage you from constructing algorithms that overfit on the leaderboard. The training graph “train.txt”, the test edges “test-public.txt”, and a sample submission file “sample.csv” will be available within the Kaggle competition website. In addition to using the competition testing and to prevent overfitting, we encourage you to generate your own test edge sets from the training graph, and test your algorithms with that.

2         Week 1: Links and Check List
Competition link: https://www.kaggle.com/t/5a5da7cccc734eb59ea18d0c35630c84

Team registration: https://goo.gl/forms/YoGdYFcvWwkaqjad2

Random pool: https://goo.gl/forms/JDyukjHqgFtuE0gq2

3         Report
:

1.    A brief description of the problem and introduction of any notation that you adopt in the report.

2.    Description of your final approach(s) to link prediction, the motivation and reasoning behind it, and whyyou think it performed well/not well in the competition.

3.    Any other alternatives you considered and why you chose your final approach over these (this may be inthe form of empirical evaluation, but it must be to support your reasoning - examples like “method A, got AUC 0.6 and method B, got AUC 0.7, hence we use method B”, with no further explanation, will be marked down).

Your description of the algorithm should be clear and concise. You should write it at a level that a postgraduate student can read and understand without difficulty. If you use any existing algorithms, please do not rewrite the complete description, but provide a summary that shows your understanding and references to the relevant literature. In the report, we will be interested in seeing evidence of your thought processes and reasoning for choosing one algorithm over another.

Dedicate space to describing the features you used and tried, any interesting details about software setup or your experimental pipeline, and any problems you encountered and what you learned. In many cases these issues are at least as important as the learning algorithm, if not more important.

More products