$30
Goal. The main goal of this assignment is to get hands-on experience with a few tools for experimental research in cloud networking and the MapReduce programming paradigm.
1 OpenFlow functionality with Mininet
Experimentation is an important part of networking research. However, large-scale cloud experiments can sometimesbehardtoachieve, e.g., duetolackofmachines. Inthissection, youwilllearnhowtouseMininet1, a relatively new experimental platform that can scale to hundreds or more emulated “nodes” running on a single machine. Mininet takes advantage of Linux support for network namespaces[2] to virtualize the network on a single machine, so that di erent processes on the same machine can see their own network environments (like network interfaces, ARP tables, routing tables, etc.), distinct from other processes. Combined with the Mininet software, this enables a single machine to emulate a network of switches and hosts. The emulated processes, however, do see the same real/physical file system.
Mininet is designed with OpenFlow[3] in mind. In this exercise, you will gain a basic understanding of OpenFlow and create a custom OpenFlow controller to control your switches. Quite simply, OpenFlow allows for “programmable” network devices, e.g., switches. With Mininet, each switch will connect to the controller specified when the switch is launched. When the switch receives an Ethernet frame, it consults its forwarding table for what to do with the frame. If it cannot determine what to do with the frame, the switch sends the frame (and some extra information such as the input switch port) to the controller, which will then instruct the switch on what to do with the frame. To avoid this extra work on every such frame, the controller can install a new rule/match in the switch’s forwarding table, so that the switch can forward future similar frames without having to contact the controller.
1.1 Prepare the Mininet VM and OpenFlow Controller
1. Install VirtualBox from https://www.virtualbox.org/wiki/Downloads. VMware should also work; adjust your VM configurations accordingly. (It is possible to install Mininet directly on your Linux system, but for simplicity we’ll use the virtual machine here.)
2. DownloadandunziptheVMwithMininetalreadyinstalledfromhttp://onlab.vicci.org/mininet-vm/ mininet-2.2.1-150420-ubuntu-14.04-server-amd64.zip
You can also download VM images with Minine from https://github.com/mininet/mininet/ releases
3. In VirtualBox, import the “ovf” template just unzipped. For the newly imported machine, go to
“Settings"allowed to obtain an IP address from your local network, you can alternatively use “Bridge Adapter."! “Network" and make “Adapter 1" a “NAT" (Network Address Translation). If your VM is
For more information on networking with VirtualBox, see http://www.virtualbox.org/manual/ ch06.html
4. Start the VM
5. Log in with mininet for both username and password 6. Make sure eth0 is up:
(a) run the command: ifconfig eth0
(b) check the inet addr field. If it does not have an IP address, then run the command: sudo dhclient eth0 and repeat step (a).
7. The downloaded image should have POX preinstalled. POX is a platform that allows you to write your own OpenFlow controller using Python. Please check home folder and see if there is a folder called “pox”. If not, please do: git clone https://github.com/noxrepo/pox
For more information on POX, see https://openflow.stanford.edu/display/ONL/POX+Wiki
8. Install a GUI in the VM:
(a) Install the GUI
sudo apt-get update
sudo apt-get install openbox xinit -y
(b) Start it startx
(c) To create a new terminal, right-click on the desktop and select “Terminal emulator”
Alternately you may use SSH to log in to the VM remotely, with GUI (X11) forwarding. With SSH, you will need to enable X-forwarding (e.g., ssh -X on *NIX hosts) when you ssh into the VM. NOTE: this requires you have an X server running on the host. See a description of how
to do this on various platforms at https://github.com/mininet/openflow-tutorial/wiki/ Installing-Required-Software. Alternative for some versions of Mac OS X: install the Developer Tools (a free download from the App Store) and open /Applications/Utilities/X11.
1.2 Create a hub in Mininet using POX
In this exercise you will create a Mininet network with 3 hosts connecting via a switch. Using POX, you will program the switch to behave like a hub, which simply forwards incoming packets to every port except the one on which it entered.
First, you can familiarize yourself with Mininet by following http://mininet.org/walkthrough/.
To start Mininet with the topology we want:
• First clean up the network: sudo mn -c
• Then create a network with the topology we want: sudo mn --topo single,3 --mac --switch ovsk --controller remote This will create a network with the following topology:
host h1 ------switch s1 ---- controller c0 host h2 -------/ / host h3 --------/
After you create this network, you will be entering the Mininet console. You can type help in the
console to see a list of commands provided by Mininet. We will later use some of these commands.
Now let’s run POX controller. Create another terminal (right-click on the desktop and select “Terminal emulator”). Go to the directory you installed POX in this new terminal, and then start POX with basic hub function:
pox/pox.py log.level --DEBUG forwarding.hub
The argument log.level --DEBUG enables verbose logging and forwarding.hub asks POX to start
the hub component. It takes up to 15 seconds for switches to connect to the controller. When a OpenFlow switch has connected, POX will print something like:
INFO:openflow.of_01:[00-00-00-00-00-01 1] connected INFO:forwarding.hub:Hubifying 00-00-00-00-00-01
To verify the hub behavior, we use tcpdump, a common packet analyzer that intercepts and prints packet information. To do this, we first create an xterm (terminal emulator in X Window System) for each host in Mininet and view packets in each. To start an xterm for each host, type the following command in the Mininet console:
xterm h1 h2 h3
You may want to arrange xterms properly so that you can see them on the screen at once. You may need to reduce the terminal height to fit a laptop screen. In the xterms for h1 and h2, run tcpdump to capture and print all the packets:
tcpdump -XX -n -i h1-eth0 and tcpdump -XX -n -i h2-eth0
In the xterm for h3, send a ping to h1: ping -c1 10.0.0.1
The ping packets are going to the controller, which floods the packet out all interfaces but the received one. Because of this hub behavior, you should see identical ARP and ICMP packets in both xterms running tcpdump. • [1 points] A.1.1 What will happen if you ping a non-existent host that doesn’t reply ICMP requests? For example, do the following command in the xterm for h3: ping -c1 10.0.0.9
Now let’s take a look at the hub code at pox/pox/forwarding/hub.py. Make sure to get familiar with the code because many POX API functions used here will help you answer the later questions. We describe several important API functions here, and you can find more information about POX APIs at https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-POXAPIs.
• connection.send() function sends an OpenFlow message to a switch.
Whentheconnectionbetweenaswitchandthecontrollerestablished, thecodewillinvoke_handle_ConnectionUp() function that implements the hub logic.
• ofp_flow_mod OpenFlow message
This tells a switch to install a flow entry, which matches some fields of incoming packet headers and executes some actions on matching packets. Important fields include:
– actions: A list of actions that apply to matching packets (e.g., ofp_action_output described below).
– match: An ofp_match object (described below).
– priority: When a packet matches on more than one non-exact flow entry, only the highest
priority entry will be used. Here, higher values are higher priority.
• ofp_action_output class
This is an action for use with of.ofp_flow_mod. You can use it to assign a switch port that you want to send the packet out of. It can also take “special” port numbers, e.g., we use OFPP_FLOOD to send the packet out all ports but the received one.
• ofp_match class (not used in the hub code but is useful in the assignment) This is an object that specifies packet header fields and input port to match on. All fields here are optional, i.e., if you do not specify a field, it becomes a “wildcard” field and will match on anything. Some important objects in this class:
– dl_src: The data link layer (MAC) source address
– dl_dst: The data link layer (MAC) destination address
– in_port: The packet input switch port
Example to match packets with source MAC address 00:00:00:00:00:01 in a OpenFlow message msg:
msg.match.dl_src = EthAddr("00:00:00:00:00:01")
1.3 Create a firewall
A firewall is used as a barrier to protect networked computers by blocking the malicious network tra c generated by viruses and worms. In this assignment, you are asked to implement a data link layer firewall to block certain tra c.
To start this, you will find a skeleton class file at http://soudeh.net/teaching/cloud/spring_ 2020/files/a1/firewall.py. This skeleton class is currently not blocking any tra c and you will need to modify this skeleton code to add your own logic later. To test the firewall, put the firewall.py in the pox/pox/misc directory and run the POX controller:
./pox.py log.level --DEBUG forwarding.hub misc.firewall
Note: You may need to use the layer 2 MAC learning instead of the hub, i.e., you can replace the
command above with:
./pox.py log.level --DEBUG forwarding.l2_learning misc.firewall
After the connection between the controller and the switch is established, we can verify the connectivity between all pairs of hosts by typing pingall in the Mininet console. Note that when ping cannot get through a pair of hosts, you need to wait for the timeout, which takes about 10 seconds.
• [1points]A.1.2Modifythefirewall(firewall.py)toblocktra cwithsourceMACaddress00:00:00:00:00:02 and destination MAC address 00:00:00:00:00:03. To show the result, you can use the command pingall and copy the output to your report. (Hint 1: this only takes a few lines of code. Hint 2: if you did not specify any action in a OpenFlow message, then matching packets will be dropped.)
Note: To get your files o the VM, you can scp or ftp them to some other machine. Or you can install the GUI (instructions in PDF) and then "sudo apt-get install firefox" and then launch the GUI and use firefox to upload/email the files o the machine.
2 Playing with MapReduce
When discussing data analytic systems, we will see MapReduce as a paradigm for large-scale data processing. This part of the assignment provides practical experience writing MapReduce jobs.
2.1 Overview
The National Do Not Call Registry[4] is a national database of telephone numbers of individuals who do not want to be contacted by telemarketers. Unfortunately, robocalls and spoofing are on the rise, leading to a record number of complaints in recent years.
In an e ort to stop unwanted calls, law enforcement recently seized evidence from a cloud provider that assists businesses in contacting their customers.[5] In the excitement of the raid, however, service metadata about the records and cloud consumers was damaged, leaving only portions of logs recording phone calls.
Your goal is to uncover which cloud consumers are violating the law!
Fragments of the original call logs have been partially pieced together. Each log entry is a single line
that provides the following information:
1. the date and time of a call,
2. the company responsible for the call,
3. the originating phone number for the call,
4. the recipient’s phone number, and 5. the duration of the call (in seconds).
For example, the log entry indicates that the Acme Corporation used (429) 785-4094 to place a 9-second call to (429) 826-1640 slightly before 6 p.m. on 9 April 2017. Of course, this information alone is insu cient to determine if this phone call is legitimate: one must know that 1) the first telephone number has been reported for spam calls or 2) the second telephone number is part of the Do Not Call Registry and reported this specific call as unwanted.
Law enforcement has identified that the following numbers reported unwanted calls during the time frame captured in the call log: (216) 684-9356, (404) 934-5110, (589) 371-5037, and (945) 792-0329. You may assume that all calls that appear in the log and were placed to these numbers are violations of the Do Not Call Registry. That is, no business contacted these numbers for legitimate reasons such as an order confirmation.
As part of your forensics investigation, you must answer the following questions:
• A.2.1 On how many occasions did companies violate the Do Not Call Registry?
• A.2.2 How many numbers should be blocked / marked as spam to reduce the number of unwanted calls?
• A.2.3 Which telephone numbers received the most spam calls?
• A.2.4 Which telephone numbers are responsible for the most spam calls?
• A.2.5 Which hours of the day are spam calls most likely?
Although you technically need not use MapReduce to answer these questions, large-scale data analysis practically necessitates a parallel processing framework.
A.2.1. On how many occasions did companies violate the Do Not Call Registry? For each company that violated the Do Not Call Registry list the number of known unwanted calls placed by that company.
That is, how many times did each company contact one of the numbers that reported unwanted calls?
Order your results lexicographically by company (i.e., alphabetically by company name).
Your answer should resemble the following:
Acme Corporation 4 ...
A.2.2. How many numbers should be blocked / marked as spam to reduce the number of unwanted calls? What is the total number of telephone numbers that should be blocked for each company? These telephone numbers should be all numbers used by the companies guilty of violating the Do Not Call Registry. That is, if the company violated the Do Not Call Registry once, then assume that all its calls should be marked as spam. Order your results lexicographically by company (i.e., alphabetically by company name). Your answer should resemble the following:
Acme Corporation 6411
...
A.2.3. Which telephone numbers received the most spam calls? List the top-3 telephone numbers receiving spam calls and how many spam calls each received. Order your results in decreasing order of the telephone number receiving the most calls.
Your answer should resemble the following:
...
A.2.4. Which telephone numbers are responsible for the most spam calls? List the top-3 telephone numbers placing spam calls and how many calls originated from each. Order your results in decreasing order of the telephone number responsible for the most spam calls. Your answer should resemble the following:
...
A.2.5. Which hours of the day are spam calls most likely? List the top-3 hours that had the most spam calls and how many spam calls were placed in each hour. Order the results in decreasing order of the number of calls.
Your answer should resemble the following:
...
A Hadoop
Apache Hadoop[6] is an open source software framework for big data processing. It has several components, but the two most critical to this assignment are its implementation of MapReduce and the Hadoop Distributed File System (HDFS), which are based on MapReduce [1] and the Google File System (GFS) [2] respectively.
For simplicity in this assignment, we’ll use Hadoop’s streaming application programming interface (API) which allows the use of any executable script to define the map and reduce operations. The streaming API uses the standard input and output streams to pass information among jobs. More specifically, the map operation coverts lines of input (text-based and terminated by a line break) into a series of key-value pairs, one per line of output. After these key-value pairs are sorted (automatically by Hadoop), the reduce operation aggregates them to produce a final value for each unique key. By convention, the streaming API uses the first tab character on a line to delimit the key and value.
An advantage of the streaming API is that you can use command line utilities to test your map and reduce operations. For example, the following shell command executes two Python scripts using a (small) local data file:
cat /path/to/data | python map.py | sort | python reduce.py
where cat prints the specified data files on standard out, map.py defines the map operation, sort sorts the script’s output in ascending order, and reduce.py defines the reduce operation. Of course, none of these steps are parallelized in this case, but Hadoop will perform the various operations in parallel when processing multiple data files.
B MapReduce
Writing a MapReduce job using the streaming API is straightforward. The canonical MapReduce example is counting words so we’ll use it to illustrate the process.
As previously mentioned, the map operation reads input from standard in and outputs a series of key-value pairs, one per line. The following Python code implements this operation for counting words:
1 #!/usr/bin/env python
2
3 import sys
4 5
6 for line in sys.stdin:
7 line = line.strip() # remove leading and trailing whitespace
8
9 # split line using whitespace as delimiters
10 tokens = line.split() 11 # iterate over tokens 12 for token in tokens:
13 print("{token}\t{count}".format(token=token, count=1)) That’s it! This Python code outputs a stream of tokens with a ‘1’ to indicate that each token was encountered once in the line of text. (If the same token appears multiple times, then it will be listed multiple times.) For example, the input
a man a plan a canal panama becomes
a i|1
man i|1
a i|1
plan i|1 a i|1
canal i|1 panama i|1 where i| indicates a tab character (i.e., \t).
The reduce operation reads the key-value pairs and aggregates them to pro-
duce a final value for each key. Of course, its input must be sorted to produce the correct results. The following Python code implements this operation for counting words:
1 #!/usr/bin/env python
2
3 import sys
4 5
6 def emit(token, count):
7 print('{token}\t{count}'.format(token=token, count=count))
8
9
10 11
previous = None
12
for line in sys.stdin:
13
line = line.strip()
# remove leading and trailing whitespace
14
15 token, count = line.split('\t', 1) # split key-value pair
16 try: 17 count = int(count) 18 except ValueError:
19 continue
20
21 if previous == token: 22 total = total + count 23 else:
24 if previous:
25 emit(previous, total)
26
27 previous = token
28 total = count
29
30 emit(token, total)
In essence, this script simply checks to see if the prior token is the same as the current token, incrementing the total count when they match and outputting the total when they di er.
Try writing these scripts and testing them as follows:
echo "a man a plan a canal panama" | map.py | sort | reduce.py
You should see the following result:
a i|3
canal i|1 man i|1 panama i|1 plan i|1
(Note: Both scripts must be executable to invoke them in this fashion.)