$20
Part A (Parallel Spatial Join with PostGIS)
We load the points and rectangles datasets into two PostgreSQL tables: points and rectangles respectively. The points table has a geometry type column geom which denotes Point geometry. The rectangles table has a geometry type column geom which denotes Envelop geometry (rectangle). You are given a Python file Assignment2_Interface.py with an incomplete function parallelJoin. You need to complete this function. It has following attributes: pointsTable, rectsTable, outputTable, outputPath, and openConnection. The highlevel objective is to find the number of points inside each rectangle save it to the outputTable and outputPath. Perform the following tasks:
• Create four partitions/fragments of both pointsTable and rectsTable. You should consider space partitioning such that all points or rectangles of a partition should lie within the corresponding fragment. Fragments doesn’t need to satisfy disjointness property.
• Run four parallel threads. Each thread should perform a spatial join between a fragment of pointsTable and a fragment of rectsTable. The purpose of the join is to find the number of points (pointsTable.geom) inside each rectangle or Envelop (rectsTable.geom) within the corresponding fragment. You must make use of ST_Contains method supported by PostGIS.
• Sort the output of each fragment in the ascending order of counts of points inside the parallel threads.
• Merge the outputs of four parallel joins into outputTable in the ascending order of counts of points. You can design the structure of the outputTable as you wish, but it should have a column named points_count containing counts of points.
• Write the counts of points into the outputPath in the ascending order. You don’t need to write rectangle coordinates.
Set up Instructions for Part A: You should have PostgreSQL >=13 installed on your device. On top of it, you should install PostGIS extension.
Naming Conventions to be Followed: You should not change the following naming conventions:
• Database Name: dds_assignment2
• Postgres Username: postgres
• Postgres Password: 12345
Part B (Spatial Join with Map Reduce on Apache Sedona)
Similar to Part A, you need to perform a spatial join between the points dataset and rectangles dataset in order to count the number of points located within each rectangle. This time, you will perform the task with the help of map and reduce functionalities supported by Spark.
Task: You are given a Apache Sedona project named Map-Reduce-Apache-Sedona. Find the Scala file SparkMapReduce.scala. Complete the incomplete function runMapReduce(). Instead of using group by operation, you must make use of Spark map and reducebyKey operations to complete the task. Covert the resultant RDD back to a DataFrame before returning the result. The returned DataFrame should contain the number of points within each rectangle in an ascending order of count values. You don’t need to return rectangles.
Setting Up Hadoop and Apache Spark Test Environment:
The following setup instructions are specific to Ubuntu operating system:
• Install Java version >= 1.8 and set up the JAVA_HOME environment variable. If you don't know how to set up the JAVA_HOME environment variable, follow the link: https://askubuntu.com/questions/175514/how-to-set-java-home-for-java. To check whether set up is done, run the command echo $JAVA_HOME on your command line. You should see the path.
• Download Hadoop-2.7.7 from the link: https://archive.apache.org/dist/hadoop/common/hadoop2.7.7/hadoop-2.7.7.tar.gz and extract it to your desired location. Now, you need to set up the HADOOP_HOME environment variable. Run the command sudo nano ~/.bashrc on the command line. Copy the statement export HADOOP_HOME=path to the folder hadoop-2.7.7 in the opened file. Save and close the file. Run the command source ~/.bashrc and check the correctness of the setup with echo $HADOOP_HOME command.
• Download Spark-2.4.7 from the link: https://drive.google.com/file/d/1XlRcJNoGTaf0YvVv8pwZzzjrCqG-IRA/view?usp=sharing and extract it to your desired location. Now, you need to set up the SPARK_HOME environment variable. Run the command sudo nano ~/.bashrc on the command line. Copy the statement export SPARK_HOME=path to the folder spark-2.4.7-bin-hadoop2.7 in the opened file. Save and close the file. Run the command source ~/.bashrc and check the correctness of the setup with echo $SPARK_HOME command. If you browse the path to the SPARK_HOME, you should see spark-submit under the bin folder which is required for your testing.
• Go to project root folder
• Run sbt clean assembly. You may need to install sbt in order to run this command.
• Find the packaged jar in "./target/scala-2.11/Map-Reduce-Apache-Sedona-assembly-0.1.jar"
• Now, you can run the jar on Spark with spark-submit command. If you already have set up the Hadoop and Spark test environment, you should be able to run the spark-submit command from your command line. Submit the jar to Spark using Spark command "./bin/spark-submit". A pseudo code example: ./bin/spark-submit PATH_TO_JAR_Map-Reduce-Apache-Sedona-assembly-0.1.jar output mapreduce PATH_TO_points.csv PATH_TO_rectangles.csv
How to debug your code in IDE:
• Use IntelliJ Idea with Scala plug-in or any other Scala IDE.
• In some cases, you may need to go to "build.sbt" file and change % "provided" to % "compile" in order to debug your code in IDE
• Run your code in IDE