Starting from:

$25

CS4830 Lab 3-Dataflow Solved

Code for Demo shown in class:


import apache_beam as beam from apache_beam.io import ReadFromText from apache_beam.io import WriteToText from apache_beam.options.pipeline_options import PipelineOptions from apache_beam.options.pipeline_options import GoogleCloudOptions from apache_beam.options.pipeline_options import StandardOptions options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = '<projectid' google_cloud_options.job_name = '<job name' google_cloud_options.temp_location="us-central1" options.view_as(StandardOptions).runner = 'DataflowRunner' with beam.Pipeline(options=options)as p: 

    lines = p | 'Read' beam.io.ReadFromText( 'gs://iitmbd/out.txt' ) | 'Write' beam.io.WriteToText( '<output_path' 

 ) 

 

We encourage you to go through the Cloud Dataflow Mode​          l documentation before starting the assignment. It will introduce you to some transforms and reducers required in the assignment.


1.Write a Python code to count lines of the file that is placed in the IITMBD bucket (gs://iitmbd/out.txt) using Dataflow and provide the screenshot of the file that is generated in your bucket.                             


2.Write a Python code to get the average number of words in a line of the file that is placed in the IITMBD bucket (gs://iitmbd/out.txt) using Dataflow and provide the screenshot of the file that is generated in your bucket.
 

3.Provide the screenshot for the execution graph created by Dataflow in the background for the pipeline object created for the questions 1 and 2.                



4.Explain the pipeline used in the first two questions. What issues did you face while trying to make the code work for the first two questions and how did you resolve them?



5.[Bonus] Trigger a dataflow using GCF for any one of the first two questions.                                        


Create a PDF file containing answers to the above questions. Zip it along with the output files (for the dataflow task) and your Python files. Then, submit this zip file on moodle.

More products