Starting from:

$30

Big Data Homework 4 -Solved

Folder /home/public/enron contains individual emails of former Enron employees. There is one file per 
email and the name of the employee is listed as the subfolder name. These emails were used in the 
actual trail but the judge decided to release them for public consumption. (I have all of them but I am 
sharing with you only a few of them so that potentially you do not need to write a script to load the 
data. You are encouraged to write a script, but because you have only a few emails you can insert them 
manually.) 
Load in an appropriate hbase model the following attributes of each email: name of the employee 
(obtained from the name of the folder), email address of the sender, date of the email, send to as string, 
and email body as string. 
Recently I read in the newspaper FakeInnovations that entrepreneur Bogus John Enron wants to fund a 
company with the same employees (those jailed will work from the jail). Bogus John needs an hbase 
database that will track all the emails. The management of the company wants to be able to quickly 
query emails for a user, all emails during a time period, and all emails for a given user during a period of 
time. 
You have to perform the following tasks: 
1. Create an hbase database model. 
2. Import all emails in hbase. 
3. Return the bodies of all emails for a user of your choice (as a single text file). 
4. Return the bodies of all emails written during a particular month of your choice (as a single text 
file). 
5. Return the bodies of all emails of a given user during a particular month both of your choice (as 
a single text file). 
Here are 2 options that you can choose from. 
1. Write an hbase script for all of the tasks. This is the route of least effort but also least rewarding. 
2. The hbase on wolf offers restful access. You can either use python or java to perform the task. 
Python: use module starbase: https://github.com/barseghyanartur/starbase or HappyBase 
https://happybase.readthedocs.io/en/latest/ 
The restful interface on wolf for hbase runs on port 20550 and thus you have to initialize the 
connection as c = Connection(port=20550) 
To fetch a single record, use ‘get’ but to fetch a range of records, use ‘scan.’ 

More products