Starting from:

$25

CSYE7245 -Lab 2 GCP-Datalab/Dataflow Solved

This lab demonstrates GCP services like Datalab, Dataflow and BigQuery for implementing data analysis and preprocessing for machine learning.

                    

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, file storage, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning.

 

Google Cloud Platform provides infrastructure as a service, platform as a service, and serverless computing environments.

 

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. It runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.

 

Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up processing pipelines for integrating, preparing and analyzing large data sets, such as those found in Web analytics or big data analytics applications.

 

Storing and querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google's infrastructure. 

 Dataset
We used public Natality dataset to create an ML model to predict a baby's weight given a number of factors about the pregnancy and the baby's mother.

 

We cloned the https://github.com/GoogleCloudPlatform/training-data-analyst github path and used training-data-analyst/blogs/babyweight/babyweight.ipynb notebook for our data processing and model creation.

 

 
Experiment Setup
Prerequisites
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  

 

 

 

 

 

 

 

 

Enabled the BigQuery, AI Platform, Cloud Source Repositories, Dataflow, and Datalab APIs.
  



 
Launching Datalab
 

Following are the steps to launch Datalab:

 

●      Open Cloud Shell Editor

 

  

 

●      Retrieved Google Cloud Project id using the below command

  

●      Created a Datalab instance 

  

●      Connection was established and instance of Datalab created at port 8081

  

Cloning Datalab Notebook
●      Opened a new notebook and copy pasted below command

!git clone https://github.com/GoogleCloudPlatform/training-data-analyst

  

Use Cases
Data Exploration, Preprocessing and Visualization in Datalab
  Project ID and Bucket setup in notebook

●      In the first cell, set the variable PROJECT to your project ID.

●      Set the variable BUCKET to your bucket name in the first cell. For your bucket name, use your project ID as a prefix and my-bucket: project-ID-my-bucket

●      Leave REGION as us-central1.

  

 

 

Fetching data in dataframe using BigQuery

 

  

 

Visualizing count and average of babies

  

 

 

 

 

 

 

Visualizing correlation between Mother’s age and number of babies and there average weight

  

 

 

 

 

 

 

 

Visualizing plurality trends

  

 

 

 

 

 

 

 

 

 

 

 

Correlation between gestation period and babies weight and count

  

 
Preprocessing using apache beam
We modified the data such that we can simulate what is known if no ultrasound has been performed. If I didn't need preprocessing, I could have used the web console. Also, I prefer to script it out rather than run queries on the user interface. Therefore, I am using Cloud Dataflow for the preprocessing.

  

  

  

 

Results
Job took around 48mins to finish

 

  

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Throughput Metrices

  

 

CPU Utilization

  

Data loaded in the GCP bucket in csv form
  

More products