menu
arrow_back

Launching Dataproc Jobs with Cloud Composer

90m access · 60m completion
Connection Details

7 Credits

info_outline
This lab costs 7 Credits to run. You can purchase credits or a subscription under My Account.

01:30:00

Launching Dataproc Jobs with Cloud Composer

GSP286

Google Cloud Self-Paced Labs

Overview

In this lab you'll use Google Cloud Composer to automate the transform and load steps of an ETL data pipeline. The pipeline will create a Dataproc cluster, perform transformations on extracted data (via a Dataproc PySpark job), and then upload the results to BigQuery. You'll then trigger this pipeline using either:

  • HTTP POST request to a Cloud Composer endpoint
  • Recurring schedule (similar to cron job)

Cloud Composer workflows are comprised of DAGs (Directed Acyclic Graphs). You will create your own DAG, including design considerations, as well as implementation details, to ensure that your prototype meets the requirements.

What you will build

You're going to build an Apache Airflow DAG that will:

  • Begin running when triggered from an on-prem POST request
  • Spin up a Dataproc cluster
  • Run a Pyspark job on cluster
  • Tear down cluster when job completes
  • Upload Pyspark output to BigQuery
  • Remove any remaining intermediate files

78cbe88a24d8f0b0.png

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab
home
Home
school
Catalog
menu
More
More