$30
Folder /home/public/enron contains individual emails of former Enron employees. There is one file per
email and the name of the employee is listed as the subfolder name. These emails were used in the
actual trail but the judge decided to release them for public consumption. (I have all of them but I am
sharing with you only a few of them so that potentially you do not need to write a script to load the
data. You are encouraged to write a script, but because you have only a few emails you can insert them
manually.)
Load in an appropriate hbase model the following attributes of each email: name of the employee
(obtained from the name of the folder), email address of the sender, date of the email, send to as string,
and email body as string.
Recently I read in the newspaper FakeInnovations that entrepreneur Bogus John Enron wants to fund a
company with the same employees (those jailed will work from the jail). Bogus John needs an hbase
database that will track all the emails. The management of the company wants to be able to quickly
query emails for a user, all emails during a time period, and all emails for a given user during a period of
time.
You have to perform the following tasks:
1. Create an hbase database model.
2. Import all emails in hbase.
3. Return the bodies of all emails for a user of your choice (as a single text file).
4. Return the bodies of all emails written during a particular month of your choice (as a single text
file).
5. Return the bodies of all emails of a given user during a particular month both of your choice (as
a single text file).
Here are 2 options that you can choose from.
1. Write an hbase script for all of the tasks. This is the route of least effort but also least rewarding.
2. The hbase on wolf offers restful access. You can either use python or java to perform the task.
Python: use module starbase: https://github.com/barseghyanartur/starbase or HappyBase
https://happybase.readthedocs.io/en/latest/
The restful interface on wolf for hbase runs on port 20550 and thus you have to initialize the
connection as c = Connection(port=20550)
To fetch a single record, use ‘get’ but to fetch a range of records, use ‘scan.’