arrow_back

Evaluating a Data Model

Join Sign in

Evaluating a Data Model

45 minutes 5 Credits

GSP204

Google Cloud selp-paced labs logo

Overview

In this lab you partition a dataset into two datasets, a training dataset and a test dataset. You use the training dataset to develop a model. You use the test dataset to evaluate the accuracy of the model, and then to evaluate predictive models in a repeatable manner.

After you use the training set to create a model, you evaluate the model against the test dataset. You store the data in BigQuery, and use Jupyterlab to perform the analysis.

This lab uses a dataset provided by US Bureau of Transport Statistics. The dataset provides historic information about internal flights in the United States and can be used to demonstrate a wide range of data science concepts and techniques.

BigQuery is a RESTful web service that enables interactive analysis of massive datasets. BigQuery works in conjunction with Google Storage. See BigQuery for more information.

Jupyterlab is a web-based interactive devlopment environment for notebooks, code and data. See Jupyterlab.

Objectives

  • Partition a BigQuery dataset into a training dataset and a test dataset

  • Create a predictive model using the training dataset

  • Evaluate the predictive model using the test dataset

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

What you need

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
  • Time to complete the lab.

Note: If you already have your own personal Google Cloud account or project, do not use it for this lab.

Note: If you are using a Chrome OS device, open an Incognito window to run this lab.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is a panel populated with the temporary credentials that you must use for this lab.

    Open Google Console

  2. Copy the username, and then click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Sign in

    Tip: Open the tabs in separate windows, side-by-side.

  3. In the Sign in page, paste the username that you copied from the left panel. Then copy and paste the password.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Training credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).

  4. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Task 1. Prepare your environment

This lab uses a set of code samples and scripts developed for Data Science on the Google Cloud Platform, 2nd Edition from O'Reilly Media, Inc. For this lab, you clone the sample repository from Github to Cloud Shell and carry out all of the lab tasks from there.

Selecting a training dataset

Before selecting your training dataset, you must decide on a repeatable mechanism to select a subset of the data as the training dataset, which you use to create predictive models. You use the remaining data in the dataset as the test set, which you use to evaluate the effectiveness of your models.

You also need to provide a representative selection of the data. Possible options:

  • Add a WHERE clause to select records using BigQuery's RAND() function. Unfortunately that makes it virtually impossible to select an independent evaluation test dataset. Also, the RAND() function returns different results every time it's run, which makes it impossible to compare model performance between different runs.

  • Select a specific date range, or just the first N records in the table. Unfortunately, you could inadvertently select training data that omits on some date-based variation in the data.

  • Identify a specific set of dates, each of which is initially chosen at random, as the training dataset. The dates are saved in a separate database table. This allows you to carry out multiple replays of training and test queries in a consistent manner. Best approach.

To identify all of the unique dates in the dataset:

  1. In the Cloud Console, on the Navigation menu (Navigation menu), click BigQuery to open BigQuery console.

  2. In the left panel, notice your project name is the same as the GCP Project ID, for example, qwiklabs-gcp-01-90fdc8ba3e8b. Click to expand your project to see that the dsongcp dataset was imported for you during the lab setup.

  3. Add the following query into the Query editor, then click RUN.

SELECT DISTINCT(FL_DATE) AS FL_DATE FROM `dsongcp.flights_tzcorr` ORDER BY FL_DATE

To randomly select 70% of these dates to be the training days:

  1. Paste the following query into the Query editor and click RUN.

SELECT FL_DATE, IF(ABS(MOD(FARM_FINGERPRINT(CAST(FL_DATE AS STRING)), 100)) < 70, 'True', 'False') AS is_train_day FROM ( SELECT DISTINCT(FL_DATE) AS FL_DATE FROM dsongcp.flights_tzcorr) ORDER BY FL_DATE LIMIT 5

In the preceding query, the hash value of each of the unique days from the inner query is computed using the FarmHash library. The is_train_day field is set to True if the last two digits of this hash value are less than 70.

  1. Paste the following query into the Query editor and click RUN to save this result as a table in BigQuery.

CREATE OR REPLACE TABLE dsongcp.trainday AS SELECT FL_DATE, IF(ABS(MOD(FARM_FINGERPRINT(CAST(FL_DATE AS STRING)), 100)) < 70, 'True', 'False') AS is_train_day FROM ( SELECT DISTINCT(FL_DATE) AS FL_DATE FROM dsongcp.flights_tzcorr) ORDER BY FL_DATE

Click Check my progress below to check your lab progress. Create a table

Task 2. Create the model using the Training dataset

Use Jupyterlab to create the model.

  1. In the Google Cloud Console, on the Navigation Menu, click Vertex AI > Workbench.

  2. On the Notebook instances page, click New Notebook > TensorFlow Enterprise > TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.

  3. In the New notebook instance dialog, confirm the name of the deep learning VM, and then click Create. The new VM will take 2-3 minutes to start.

  4. Click Open JupyterLab. A JupyterLab window will open in a new tab.

Click Check my progress below to check your lab progress. Create a notebook instances

  1. Click the Python 3 card under Notebook.

You use Python 3 code for this Notebook.

  1. Copy and paste the following code into the empty cell to import the necessary modules and initialize a BigQuery client.

The BigQuery client sends and receives messages from the BigQuery API.

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from google.cloud import bigquery as bq
  1. Run the cell by either press Shift + Enter or click the triangle in the top ribbon. Both run the cell and provide the next empty cell.

  2. In the new cell, enter the following code to create a variable with the new BigQuery query using just the training day partition of the dataset:

%%bigquery depdelay SELECT DEP_DELAY, APPROX_QUANTILES(ARR_DELAY, 101)[OFFSET(70)] AS arrival_delay, COUNT(ARR_DELAY) AS numflights FROM dsongcp.flights_tzcorr JOIN dsongcp.trainday USING(FL_DATE) WHERE is_train_day = 'True' GROUP BY DEP_DELAY HAVING numflights > 370 ORDER BY DEP_DELAY
  1. Run the cell.

  2. Enter the following code into the new cell to plot the intersection of the 15 minute delay line with the 30% arrival delay probability line for the training dataset:

ax = plt.axes() depdelay.plot(kind='line', x='DEP_DELAY', y='arrival_delay', ax=ax, ylim=(0,30), xlim=(0,30), legend=False) ax.set_xlabel('Departure Delay (minutes)') ax.set_ylabel('> 30% prob of this Arrival Delay (minutes)'); x = np.arange(0, 30) y = np.ones_like(x) * 15 ax.plot(x, y, color='r'); y = np.arange(0, 30) x = np.ones_like(y) * 16 ax.plot(x, y, 'g.');
  1. Run the cell.

Ignore the deprecation warning.

On running it, the threshold (the x-axis value of the intersection point) remains consistent, as depicted in below figure.

Intersection of the 15 minute delay line with the 30% arrival delay probability line

Notice you get the same answer—16 minutes—after creating the empirical probabilistic model on just the training data. This is good.

In the next secion, you test this model against the independent test dataset.

Task 3. Evaluate the data model

Evaluate how well your recommendation of 16 minutes does in terms of predicting an arrival delay of 15 minutes or more. To do that, find the number of times you would have wrongly canceled a meeting or missed a meeting. Compute these numbers using this Query on days that are not training days to compute these numbers:

  1. Paste the following code into the empty cell:

%%bigquery df_eval SELECT SUM(IF(DEP_DELAY < 16 AND arr_delay < 15, 1, 0)) AS correct_nocancel, SUM(IF(DEP_DELAY < 16 AND arr_delay >= 15, 1, 0)) AS wrong_nocancel, SUM(IF(DEP_DELAY >= 16 AND arr_delay < 15, 1, 0)) AS wrong_cancel, SUM(IF(DEP_DELAY >= 16 AND arr_delay >= 15, 1, 0)) AS correct_cancel FROM ( SELECT DEP_DELAY, ARR_DELAY FROM dsongcp.flights_tzcorr JOIN dsongcp.trainday USING(FL_DATE) WHERE is_train_day = 'False' )
  1. Run the cell.

  2. Paste the following code into the new empty cell:

print(df_eval['correct_nocancel'] / (df_eval['correct_nocancel'] + df_eval['wrong_nocancel'])) print(df_eval['correct_cancel'] / (df_eval['correct_cancel'] + df_eval['wrong_cancel']))
  1. Run the cell with Shift + Enter.

  2. Paste the following code into the new empty cell and run the cell.

df_eval.head()

This gives the following results:

Table showing cancel nocancel data

Note that this query does not exclude the outlier data. You are evaluating real data against the model to see how well it performs in all situations.

  1. Enter the following code in the new cell to create a query string that captures the contingency table that allows you to score the effectiveness of the model:

%%bigquery df_eval SELECT SUM(IF(DEP_DELAY = 15 AND arr_delay < 15, 1, 0)) AS correct_nocancel, SUM(IF(DEP_DELAY = 15 AND arr_delay >= 15, 1, 0)) AS wrong_nocancel, SUM(IF(DEP_DELAY = 16 AND arr_delay < 15, 1, 0)) AS wrong_cancel, SUM(IF(DEP_DELAY = 16 AND arr_delay >= 15, 1, 0)) AS correct_cancel FROM ( SELECT DEP_DELAY, ARR_DELAY FROM dsongcp.flights_tzcorr JOIN dsongcp.trainday USING(FL_DATE) WHERE is_train_day = 'False' )
  1. Run the cell. Now enter and run the following cell:

df_eval.head()

Cell Output

  1. Now enter the following code to display the ratio of correct to incorrect calls for the full evaluation dataset:

print(df_eval['correct_nocancel'] / (df_eval['correct_nocancel'] + df_eval['wrong_nocancel'])) print(df_eval['correct_cancel'] / (df_eval['correct_cancel'] + df_eval['wrong_cancel']))
  1. Run the cell.

This reports the following:

Cell Output

Click Check my progress below to check your lab progress. Evaluating the data model

As expected, you are correct to not cancel the meeting 72% of the time, close to our target of 70%. You chose the departure delay threshold of 16 minutes on the training dataset because you expected to be 70% correct in not canceling if you do so, you now proved on an independent dataset that this is the case.

This model achieves the 70% correctness measure that was our target but does so by canceling fewer flights than the contingency-table-based model of Chapter 3.

Congratulations!

Now you know how to use Jupyterlab to query and visualize data from BigQuery.

Data Science on Google Cloud Badge

Finish your Quest

This self-paced lab is part of the Quest, Data Science on Google Cloud. A Quest is a series of related labs that form a learning path. Completing this Quest earns you the badge above, to recognize your achievement. You can make your badge (or badges) public and link to them in your online resume or social media account. Enroll in this Quest and get immediate completion credit if you've taken this lab. See other available Quests.

Take your next lab

Continue your Quest with Visualizing Data with Google Data Studio , or check out:

Next steps / learn more

Google Cloud Training & Certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated April 1, 2022
Lab Last Tested March 18, 2022

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.