Starting from:

$30

INF553-Assignment 5 Streaming Solved

In this assignment we’re going to implement some streaming algorithms on Twitter stream.

1.1        Environment Requirements
Python: 2.7 Scala: 2.11 Spark: 2.2.1

There will be 10% bonus if you also Scala for both Task1 and Task2 (i.e. 10 - 11; 9 - 10).

 

2         Task 1: Twitter Streaming
You will use Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and use the sampling to track the popular tags on tweets and calculate the average length of tweets.

Detail
In this task, we assume that the memory can only save 100 tweets in memory, so we need to use the fixed size sampling method to only sampling part of the tweets in the streaming.

Below is the Algorithm of Reservoir Sampling you need to implement in the Task 1.



Figure 2: Reservoir Sampling Algorithm

In task 1, you need to have a list(Reservoir) has the capacity limit of 100, which can only save 100 tweets.

When the streaming of the tweets coming, for the first 100 tweets, we can directly save them in the list. After that, for the nth tweet, with probability , keep the nth tweet, else discard it.

If you will keep the nth tweets, it will replace one of the tweet in list, you need to randomly pick one to be replaced.

Mission & Result Sample
After fully save the list, each time when you choose to keep an new tweet and replace one in the list, you need to statistic the hottest 5 tags and the average length of the tweets in the list. And print them out.

The API tweepy can extract the tags and content of tweets, you can find some guide in the documents will be mentioned below. Below is the sample of the output.



Figure 3: Result Sample of Task 1

Set up
•   Creating credentials for Twitter APIs

In order to get tweets from Twitter, register on https://apps.twitter.com/ by clicking on “Create new app” and then fill the form click on “Create your Twitter app.”

Second, go to the newly created app and open the “Keys and Access Tokens” tab. Then click on “Generate my access token.” You will need to set these tokens as arguments in the code.

•   Library dependencies

You can use “tweepy” (python), “spark-streaming-twitter” (Scala) http://docs.tweepy.org/en/v3.5.0/streaming how to.html http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ and “spark-streaming” for this task.

Execution Example
Following we present examples of how you can run your program with spark submit both when your application is a Java/Scala program or a Python script.

Example of running application with spark-submit

Notice that the argument class of the spark-submit specifies the main class of your application and it is followed by the jar file of the application.

Please use TwitterStreaming as class name



Figure 4: Command Line Format for python



Figure 5: Command Line Format for Scala

3         Task 2: Bloom Filtering Algorithm
You are required to implement the Bloom Filtering algorithm to estimate whether the hashtags in coming tweets have shown up before.

The details of the Bloom Filtering Algorithm can be found in the streaming lecture slide or can found it on Google.

You need to find a set of proper hash functions and the number of hash functions for the Bloom Filtering.

More guide of Spark Stream please refer to this:

https://spark.apache.org/docs/latest/streaming-programming-guide.html

Mission & Result Sample
In this Task, you need to use Bloom Filtering Algorithm to estimate whether the hashtags in the coming tweets have shown up in previous tweets.

You also need to maintain the previous hashtag set in order to calculate the False Positive.

In Spark Streaming, set the batch interval of 10 seconds as below:

ssc = StreamingContext ( sc ,                  10)

Every 10 second, while you get batch data in spark streaming, using Bloom Filtering algorithm to estimate whether the hashtag in the coming tweets have appeared in the previous tweets, and print out the number of the correct estimation and the number of the incorrect estimation, then calculate the false positive and printout.

Hint:         You can set the level of the log to OFF to eliminate the extra message.

sc . setLogLevel ( logLevel=”OFF”)

Execution Example
Following we present examples of how you can run your program with spark submit both when your application is a Java/Scala program or a Python script.

Example of running application with spark-submit

Notice that the argument class of the spark-submit specifies the main class of your application and it is followed by the jar file of the application.

Please use BloomFiltering as class name



Figure 6: Command Line Format for python



Figure 7: Command Line Format for Scala

Description File
Please include the following content in your description file:

1.                   Mention the Spark version and Python version

2.                   Describe how to run your program for both tasks

More products