menu
arrow_back

Launching Dataproc Jobs with Cloud Composer

打开谷歌控制台

Caution: When you are in the console, do not deviate from the lab instructions. Doing so may cause your account to be blocked. Learn more.

Launching Dataproc Jobs with Cloud Composer

1 小时 30 分钟 7 积分

GSP286

Google Cloud Self-Paced Labs

Overview

In this lab you'll use Google Cloud Composer to automate the transform and load steps of an ETL data pipeline. The pipeline will create a Dataproc cluster, perform transformations on extracted data (via a Dataproc PySpark job), and then upload the results to BigQuery. You'll then trigger this pipeline using either:

  • HTTP POST request to a Cloud Composer endpoint
  • Recurring schedule (similar to cron job)

Cloud Composer workflows are comprised of DAGs (Directed Acyclic Graphs). You will create your own DAG, including design considerations, as well as implementation details, to ensure that your prototype meets the requirements.

What you will build

You're going to build an Apache Airflow DAG that will:

  • Begin running when triggered from an on-prem POST request
  • Spin up a Dataproc cluster
  • Run a Pyspark job on cluster
  • Tear down cluster when job completes
  • Upload Pyspark output to BigQuery
  • Remove any remaining intermediate files

78cbe88a24d8f0b0.png

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab
得分

—/25

Create a Cloud Storage bucket

Run Step

/ 5

Export the Data

Run Step

/ 5

Create Cloud Composer environment

Run Step

/ 5

Uploading the DAG to Cloud Storage

Run Step

/ 5

Triggering the DAG

Run Step

/ 5

home
Home
school
Catalog
menu
More
More