Starting from:

$30

INF551-Homework 5 Hadoop MapReduce and Apache Spark Solved

1.   Write a MapReduce program AvgExp.java that implements the following SQL query:

SELECT Continent, avg(LifeExpectancy) FROM country where GNP 10000 group by Continent having count(*) = 5;

               You can take WordCount.java as the template. But note the following:

                              •    You may want to use split function of Java instead of StringTokenizer:

o https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.

lang.String)

•        Replace IntWritable with FloatWritable

•        It is OK that your implementation does not utilize a combiner. 

•        Name your jar file “ae.jar”.

Execution format: hadoop jar ae.jar AvgExp.java <input-hw5 <output-hw5            Ignore the angle brackets.

        Where the <input-hw5 directory stores a single file “country.csv”. Submission: AvgExp.java ae.jar

2.      For each of the following questions, write a Spark program in Python. You can assume that all three csv files, country.csv, city.csv, and countrylanguage.csv (note files names are all lowercase letters), are available in the same directory where you execute the code. Note that you should NOT use Spark DataFrames or Spark SQL for this homework.

a.      Find the 10 most populated countries in a given continent. Return the names of countries and their populations in the descending order of populations. Name your script “pop10.py”.

•        Execution format: spark-submit pop10.py “Asia”

•        This find the top-10 most populated countries in Asia.

•        Sample output:

China, 1277558000.0

India, 1013662000.0

 

...

b.      ] Find names of countries which do not have any languages recorded (in the country language table). Output one line per country. Name your script “no-lang.py”.

c.      Find names of countries which have at least 10 unofficial languages. Order the countries by the descending order of the number of unofficial languages. Name your script “unofficial10”.py.

Implement the SQL query in Question 1. Name your script “avgexp.py”.

More products