$24.99
Assignment #9
Readings
Read (From the Free Books and Chapters section of our blackboard site):
• Learning Spark, Ch. 8 (pp. 207-234)
• Spark: The Definitive Guide (pp.26-32)
Worth: 5 points + 5 points extra credit
Assignments should be uploaded via the Blackboard portal
Exercise 1) 5 points
Read the article “Real-time stream processing for Big Data” available on the blackboard in the ‘Articles’ section and then answer the following questions:
a) (1.25 points) What is the Kappa architecture and how does it differ from the lambda architecture?
b) (1.25 points) What are the advantages and drawbacks of pure streaming versus micro-batch real-time processing systems?
c) (1.25 points) In few sentences describe the data processing pipeline in Storm.
d) (1.25 points) How does Spark streaming shift the Spark batch processing approach to work on real-time data streams?
Exercise 2) 5 points
Refer to the python-Kafka Documentation from the Free Books and Chapters section of our blackboard site
Step A – Start an EMR cluster
Start up an EMR/Hadoop cluster as previously, but instead of choosing the “Core Hadoop” configuration chose the “Spark” configuration (see below), otherwise proceed as before.
Step B – Copy the Kafka software to the EMR master node
Download the kafka_2.13-3.0.0.tgz file from the blackboard to you PC/MAC. Use the secure copy (scp) program to move this file to the /home/hadoop directory of the master node. Here is an example of how your command line might look (yours will be somewhat different because your master node DNS name, key-pair and kafka file locations will vary):
3.0.0.tgz hadoop@ec2-3-218-249-33.compute-1.amazonaws.com:/home/hadoop Step C – Install the Kafka software and start it
Open up a terminal connection to your EMR master node. Over the course of this exercise, you will need to open up three separate terminal connections to your EMR master node. This is the first, which we will call Kafka-Term:
Enter the following command:
tar -xzf kafka_2.13-3.0.0.tgz
Note, this will create a new directory (kafka_2.13-3.0.0) holding the kakfa software release.
Then enter this command:
pip install kafka-python
This installs the kafka-python package.
Now enter the following commands into the terminal:
cd kafka_2.13-3.0.0
bin/zookeeper-server-start.sh config/zookeeper.properties & bin/kafka-server-start.sh config/server.properties &
Remember to end each line with the ampersand (&). This starts up a zookeeper instance and the kafka server. Lots of messages should appear. You might need to tap the return/enter key after messages appear to see the Linux prompt again.
Just leave this terminal window alone after you enter these commands. As you interact with kafka this terminal will display low level diagnostic messages which you can ignore.
Step D – Prepare to run Kafka producers and consumers
Open a second terminal connection to the EMR master node. Going forward we will call this terminal connection: Producer-Term.
Open a third terminal connection to the EMR master node. Going forward we will call this terminal connection: Consumer-Term.
Step E – Create a Kafka topic
In the Producer-Term, enter the following command:
cd kafka_2.13-3.0.0
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --bootstrap-server localhost:9092 --topic sample
Here we create a new kafka topic called ‘sample’. You can use this command to create a topic with any name you like. Try creating a few more topics.
To list the topics that you created you can enter the following into the Producer-Term (note some default topics already exist):
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
a)
In the Producer-Term (or some other way) write a small program, call it ‘put.py’, using the vi text or some other way of putting a python program onto the EMR master node. If you like you could use a text editor on your PC/MAC to write the program and then scp it over to your EMR master name.
This program should implement a kafka producer that writes three messages to the topic ‘sample’. Recall that you need to convert values and keys to type bytes. The three messages should have keys and values as follows:
Key Value
‘MYID’ Your student id
‘MYNAME’ Your name
‘MYEYECOLOR’ Your eye color (make it up if you can’t remember)
Execute this program in the Producer-Term, use the command line (you might need to provide a full pathname depending on where your python program is such as /home/Hadoop/someplace/put.py):
python put.py
Submit the program as your answer to ‘part a’ of this exercise. b)
In the Consumer-Term, write another small program, call it ‘get.py’, using the vi text or some other way of putting a python program onto the EMR master node.
This program should implement a kafka consumer that reads the messages you wrote previously from the topic ‘sample’ and writes them to the terminal.
The output should look something like this:
Key=MYID, Value=’your id number’
Key=MYNAME, Value=’your name’
Key=MYEYECOLOR, Value=’your eye color’
Execute this program in the Consumer-Term. Use the command line:
python get.py
Note, if needed you can terminate the program by entering ‘ctrl-c’.
Submit the program and a screenshot of its output as your answer to ‘part b’ of this exercise.
c)
Remember to terminate your EMR cluster!