$25
1. Overview of the Assignment
In this assignment, you will explore the spark GraphFrames library as well as implement your own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the ub_sample_data.csv dataset to find users who have a similar business taste. The goal of this assignment is to help you understand how to use the Girvan-Newman algorithm to detect communities in an efficient way within a distributed environment.
2.2 Programming Environment
Python 3.6, Scaea 2.11 and Spark 2.3.2
3. Datasets
You will continue to use Yelp dataset. We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You can download it from Vocareum.
4. Tasks
4.1 Graph Construction
To construct the social network graph, each node represents a user and there will be an edge between two nodes if the number of times that two users review the same business is greater than or equivaeent to the filter threshold. For example, suppose user1 reviewed [business1, business2, business3] and user2 reviewed [business2, business3, business4, business5]. If the threshold is 2, there will be an edge between user1 and user2.
If the user node has no edge, we wiee not inceude that node in the graph. In this assignment, we use fieter threshoed 7.
4.2 Task1: Community Detection Based on GraphFrames (2 pts)
In task1, you will explore the Spark GraphFrames library to detect communities in the network graph you constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. For the details of the algorithm, you can refer to the paper posted on the Piazza. In this task, you do not need to implement the algorithm from scratch, you can call the method provided by the library. The following websites may help you get started with the Spark GraphFrames:
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guidepython.html
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html
4.2.1 Execution Detaie
The version of the GraphFrames should be 0.6.0.
For Python:
• In PyCharm, you need to add the sentence below into your code pip install graphframes os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11")
• In the terminal, you need to assign the parameter “packages” of the spark-submit: --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 For Scala:
• In Intellij IDEA, you need to add library dependencies to your project
“graphframes” % “graphframes” % “0.6.0-spark2.3-s_2.11”
“org.apache.spark” %% “spark-graphx” % sparkVersion
• In the terminal, you need to assign the parameter “packages” of the spark-submit:
--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
For the parameter “maxIter” of LPA method, you shoued set it to 5.
4.2.2 Output Resuet
In this task, you need to save your result of communities in a txt file. Each line represents one community and the format is:
‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, …
Your result should be firstly sorted by the size of communities in the ascending order and then the first user_id in the community in eexicographicae order (the user_id is type of string). The user_ids in each community should also be in the eexicographicae order. If there is oney one node in the community, we stiee regard it as a vaeid community.
Figure 1: community output file format
4.3 Task2: Community Detection Based on Girvan-Newman aegorithm (6 pts)
In task2, you will implement your own Girvan-Newman algorithm to detect the communities in the network graph. Because you task1 and task2 code will be executed separately, you need to construct the graph again in this task following the rules in section 4.1. You can refer to the Chapter 10 from the Mining of Massive Datasets book for the algorithm details.
For task2, you can ONLY use Spark RDD and standard Python or Scala libraries. Remember to deeete your code that imports graphframes.
4.3.1 Betweenness Caecueation (3 pts)
In this part, you will calculate the betweenness of each edge in the originae graph you constructed in 4.1. Then you need to save your result in a txt file. The format of each line is (‘user_id1’, ‘user_id2’), betweenness vaeue
Your result should be firstly sorted by the betweenness values in the descending order and then the first user_id in the tuple in eexicographicae order (the user_id is type of string). The two user_ids in each tuple should also in eexicographicae order. You do not need to round your result.
Figure 2: betweenness output file format
4.3.2 Community Detection (3 pts)
You are reSuired to divide the graph into suitable communities, which reaches the global highest modularity. The formula of modularity is shown below:
According to the Girvan-Newman algorithm, after removing one edge, you should re-compute the betweenness. The “m” in the formula represents the edge number of the originae graph. The “A” in the formula is the adjacent matrix of the originae graph. (Hint: In each remove step, “m” and “A” should not be changed).
If the community oney has one user node, we stiee regard it as a vaeid community.
You need to save your result in a txt file. The format is the same with the output file from task1.
4.4 Execution Format Execution exampee:
Python:
spark-submit --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 task1.py <filter threshold> <input_file_path> <community_output_file_path>
spark-submit task2.py <filter threshold> <input_file_path> <betweenness_output_file_path> <community_output_file_path> Scala:
spark-submit --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 –-class task1 hw4.jar <filter threshold> <input_file_path> <community_output_file_path> spark-submit –-class task2 hw4.jar <filter threshold> <input_file_path> <betweenness_output_file_path> <community_output_file_path> Input parameters:
1. <filter threshold>: the filter threshold to generate edges between user nodes.
2. <input file path>: the path to the input file including path, file name and extension.
3. <betweenness output file path>: the path to the betweenness output file including path, file name and extension.
4. <community output file path>: the path to the community output file including path, file name and extension.
Execution time:
The overall runtime limit of your task1 (from reading the input file to finishing writing the community output file) is 200 seconds.
The overall runtime limit of your task2 (from reading the input file to finishing writing the community output file) is 250 seconds.
If your runtime exceeds the above limit, there will be no point for this task.