$25
This lab demonstrates the functionalities of Airflow to programmatically automate, author, schedule and monitor workflows.
● Airflow is a platform to programmatically automate, schedule and monitor workflows.
● In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
● The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
● Rich command line utilities make performing complex surgeries on DAGs a snap.
● The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Experiment Setup
Create a new project and install the required dependencies.
pip install apache-airflow
EXPORT AIRFLOW_HOME
Enter the path of the present working directory
Initialize the instance
airflow db init
Create an admin user
airflow users create \
--username admin \
--firstname YourName \
--lastname YourLastName \
--role Admin \
--email example@example.com
Start the Daemon in the background.
airflow webserver -D
It usually runs on port 8080
To check whether Airflow Daemon is running:
List the services running on port 8080
lsof -i tcp:8080
Start the scheduler
airflow scheduler
Check the web server on 127.0.0.1:8080
8. Create folder dags inside AIRFLOW_HOME
Place the DemoDag python file under the ‘dags’ folder.
9. Kill and Start the scheduler again to show the dags on the web server
lsof -i tcp:8080
Kill the pid of the running services on port 8080
Start the web server again by ‘airflow webserver -D’
The file can now be seen under the ‘dags’ folder.
● Dags can be scheduled and run every minute or hourly/daily
● You can also pause/unpause the dag depending on the requirement
Trigger your dag
13. Check the logs for additional information
14. Adding tasks to a DAG
Adding task_2 by making changes in the code and clicking the ‘Update’ button
Checking logs for additional information
Restructuring the code
Airflow TFX
1. Installing ‘requirements.txt’
You can now see the ‘taxi_pipeline’ dag in the dags folder.
3. The Tree view looks something like this:
Successful execution of ‘taxi pipeline’.
● ExampleGen ingests and splits the input dataset.
● StatisticsGen calculates statistics for the dataset.
● SchemaGen SchemaGen examines the statistics and creates a data schema.
● ExampleValidator looks for anomalies and missing values in the dataset.
Lessons learned
1. This lab helps us understand how Airflow allows users to create workflows with high granularity and track the progress as they execute.
2. Airflow enables us to have a platform that can run and automate all the jobs on a schedule.
3. You can also add/transform jobs as and when required.