Starting from:

$25

CS4000-Homework 7 Needle in a Haystack: Hidden Messages, Hadoop Compilation and Testing Solved

Email and other electronic media are often filtered for offensive language using fast, but relatively simple means (think finite automata). However, these simple filters are easily tricked. For example, if your filter was looking for the phrase “bad language,” you could trick it by inserting characters between the letters, e.g., by writing “..b..a..d...l..a..n..g..u..a..g..e...”. In this case, this text is easily recognizable by a human as the words “bad language”, but harder (but not impossible) to recognize by a simple filter.

For this assignment, you are looking for potential “secret messages” embedded into tweets (both real and simulated). In particular, you are trying to find which Twitter users are posting “secret messages” to their followers. You discovered that, given a string, such as “secretMESSAGE”, you can see whether those letters are embedded (in that order) in a tweet in O(n) steps, where n is the number of characters in the tweet. You wrote a couple (2) of Hadoop programs to output the number of times each Twitter user tweeted a message that had the string “secretMESSAGE” embedded somewhere in the message. Unfortunately, your little brother Ike got into your account and messed up the program. The first program won’t compile. The second program runs, but it doesn’t get the right answer. You’ll have to fix them.

The Data Format
For this assignment, the Twitter data that you will be using has been pre-filtered to remove information about the Twitter user (screen_name) and the text of the tweet. Furthermore, tabs and newlines within the tweet have been converted to spaces. So, the data files for this assignment are provided in a line oriented format, where each line contains the screen name of a Twitter user and the text of tweet, where the screen name and the tweet are separated by a tab character \t.

Part I: Compiling Java Hadoop Programs
Your first program SecretMessage.java looks for the fixed string “secretMESSAGE” in tweets. The provided program does not compile. It also has some warnings.

1.    Fix the compilation errors in the file SecretMessage.java. What lines did they occur on?

2.    Remove the compilation warnings in the file SecretMessage.java. What fixes did you make? Explain.

3.    Run the Hadoop program on the files processed_tweets.txt and random_data.txt. What is the output?

4.    How many blocks are used to store random_data.txt on the Hadoop Distributed File System?

Part II: Fixing a Broken Hadoop Program
The second program GeneralMessage.java is intended to be a general version of the first program. As the first commandline parameter, the user passes in a string that replaces the hard-coded “secretMESSAGE” in the first program. To run this second program, issue the command yarn jar program.jar GeneralMessage secretMESSAGE Input Output

Here, “secretMESSAGE” could be anything.

The current program does not run correctly. Fix this program in such a way that it performs just like the original program when using the string “secretMESSAGE” as a commandline parameter, and so that it works on other strings as well.

1.    Fix the program GeneralMessage.java so that it compiles and runs correctly. What lines in the program did you fix? Be specific.

2.    Run the corrected program on the files processed_tweets.txt and random_data.txt and the string “ScaryMess”. How many users have tweets that match that string? Give one of them.

3.    Run the corrected program on the files processed_tweets.txt and random_data.txt and the string “SecretMessage”. How many users have tweets that match that string?

Give one of them.

More products