Distributed Image Processing in Cloud Dataproc
In this hands-on lab, you will learn how to use Apache Spark on Cloud Dataproc to distribute a computationally intensive image processing task onto a cluster of machines. This lab is part of a series of labs on processing scientific data.
What you'll learn
How to create a managed Cloud Dataproc cluster with Apache Spark pre-installed.
How to build and run jobs that use external packages that aren't already installed on your cluster.
How to shut down your cluster.
This is an advanced level lab. Familiarity with Cloud Dataproc and Apache Spark is recommended, but not required. If you're looking to get up to speed in these services, be sure to check out the following labs:
- Dataproc: Qwik Start - Command Line
- Dataproc: Qwik Start - Console
- Introduction to Cloud Dataproc: Hadoop and Spark on Google Cloud Platform
Once you're ready, scroll down to learn more about the services that you'll be using in this lab.
- Temporary Access
- Bite Sized
Create a development machine in Compute Engine
Install Software in the development machine
Create a GCS bucket
Download some sample images into your bucket
Create a Cloud Dataproc cluster
Submit your job to Cloud Dataproc