arrow_back

Practice transformation methods

Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Practice transformation methods

Lab 1 hora 30 minutos universal_currency_alt 2 créditos show_chart Introdutório
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses
important icon IMPORTANT:

desktop/labtop icon Make sure to complete this hands-on lab on a desktop/laptop only.

check icon There are only 5 attempts permitted per lab.

quiz target icon As a reminder – it is common to not get every question correct on your first try, and even to need to redo a task; this is part of the learning process.

timer icon Once a lab is started, the timer cannot be paused. After 1 hour and 30 minutes, the lab will end and you’ll need to start again.

tip icon For more information review the Lab technical tips reading.

Activity overview

As a cloud data analyst, you'll use data transformations to change the format, structure, or content of data to prepare for storage and analysis.

In general, data transformation techniques help data professionals better understand data’s distribution, main characteristics, and overall quality; for this reason, data transformation techniques are often the first step that analysts take during exploratory data analysis.

Some common transformation techniques used for exploratory data analysis are limiting, sampling, and aggregation.

Data limiting is a technique that restricts the number of rows returned in a query. This is useful when you want to limit the amount of data that is displayed; and, in some cases, data limiting can improve the query’s speed and performance.

Data sampling is a technique of selecting a segment of a dataset that is representative of the entire dataset in order to better understand its characteristics.

Data aggregation is a technique used to summarize the data in a more manageable format.

In this lab activity, you’ll explore different ways that you can use these techniques with SQL in BigQuery to explore the data and identify potential data quality issues.

Scenario

As a cloud data analyst for TheLook eCommerce, you’ve been asked to collaborate with a cross-functional team that includes merchandising, logistics, and marketing experts. This team is tasked with finding ways to improve delivery times and increase customer satisfaction across TheLook eCommerce’s entire product line.

You have prepared a report analyzing the number of returns for the team. But Meredith, the head merchandiser, has raised a concern that the number of products returned may not be correct.

To figure out the problem, you have been asked to explore the thelook_ecommerce dataset, which contains various tables related to product information, orders, and order items. Your job is to identify potential issues, such as duplicate data, that may be impacting the results Meredith has noted.To do this you'll use SQL to limit, sample, and aggregate the data.

Here’s how you'll do this task: First, you'll explore the products table. Next, you'll retrieve the total number of rows, and the number of products with distinct names. Third, you’ll determine the number of items per category. Then, you’ll filter the data to remove the categories with a small number of items. Fifth, you’ll sample the products table. Finally, you’ll explore the data contained in the order_items table.

Setup

Before you click Start Lab

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud console

  1. Click the Start Lab button. On the left is the Lab Details panel with the following:

    • Time remaining
    • The Open Google Cloud console button
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
    Note: If you need to pay for the lab, a pop-up opens for you to select your payment method.
  2. Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.

    Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.

    Note: If the Choose an account dialog displays, click Use Another Account.
  3. If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.

{{{user_0.username | "Google Cloud username"}}}

You can also find the Google Cloud username in the Lab Details panel.

  1. Copy the Google Cloud password below and paste it into the Welcome dialog. Click Next.
{{{user_0.password | "Google Cloud password"}}}

You can also find the Google Cloud password in the Lab Details panel.

Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  1. Click through the subsequent pages:
    • Accept the terms and conditions
    • Do not add recovery options or two-factor authentication (because this is a temporary account)
    • Do not sign up for free trials

After a few moments, the Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Google Cloud console menu with the Navigation menu icon highlighted

Task 1. Explore data in a table using data limiting

In this task, you’ll explore data in a table using data limiting.

  1. In the Google Cloud console, from the Navigation menu (Navigation Menu), select BigQuery.
Note: The Welcome to BigQuery in the Cloud Console message box may appear, providing links to the quickstart guide and the release notes for UI updates. Click Done to proceed.

The Navigation menu, Explorer pane, and Query Editor display.

  1. Expand the list of datasets by clicking the expander arrow next to the project ID.
  2. Scroll down to the thelook_ecommerce dataset.
  3. Expand the thelook_ecommerce dataset. The tables within this dataset are displayed.
  4. Select the products table. The table schema displays where you can explore the information and schema of the table.

Now, explore the data by executing a query that returns ten rows from the products table. This can assist you to gain some insight into the contents of the table.

Note: To quickly inspect the contents of a table in BigQuery without running a query, you can also use the Preview tab. The Preview tab displays the first 1000 rows of a table.
  1. In the Query Editor, click on the Compose a new query (+) icon. The Untitled tab opens.
  2. Copy and paste the following command into the Untitled tab:
SELECT * FROM `thelook_ecommerce.products` LIMIT 10;

This query limits the results to the first 10 rows of the products table in the thelook_ecommerce dataset.

  1. Click Run.
Note: Each time you run a new query in the Query Editor, you can either replace the old query by copying and pasting the new query over the previous one in the same Untitled tab, or you can click on the Compose a new query (+) icon to open a new Untitled tab to run the query in.

BigQuery also offers the capability to choose a query from multiple queries within the Query Editor pane, which proves beneficial when you have numerous queries in a single editor tab.

Click Check my progress to verify that you have completed this task correctly.

Explore data in a table using data limiting

Task 2. Identify duplicate rows using COUNT(DISTINCT name)

In this task, you’ll determine the total number of rows, and the number of products with distinct names, that are contained in the products table.

  1. Create a new query; copy the following query into the Query Editor:
SELECT COUNT(*) AS NumberOfRows, COUNT(DISTINCT name) AS NumberofProducts FROM `thelook_ecommerce.products`;

This query returns the total number of entries in the products table, and the number of unique product names.

  1. Click Run.

In BigQuery, the COUNT(DISTINCT name) SQL aggregation function is used to calculate the number of unique values in the name column of a dataset. It returns the count of distinct (unique) values present in that column.

This can help you to identify any duplicate products present in the table. Identifying and fixing duplicate data is an important step in data analysis because duplicate data can lead to skewed results and errors during analysis.

Click Check my progress to verify that you have completed this task correctly.

Identify duplicate columns using COUNT(DISTINCT name)

Task 3. Use GROUP BY to identify number of items in category

In this task, you’ll determine the number of items per category in the products table, by aggregating the number of products by category using the GROUP BY SQL keyword.

  1. Create a new query; copy the following query into the Query Editor:
SELECT category, COUNT(*) AS itemCount FROM `thelook_ecommerce.products` GROUP BY category;

This query groups the products in the products table by category and counts the number of products in each category.

  1. Click Run.

  1. Create a new query; copy the following query into the Query Editor:
SELECT segment, COUNT(*) AS itemCount FROM `thelook_ecommerce.products` GROUP BY segment;

This query again to identify which segment each item has been assigned. Note that category has been replaced with segment in both the SELECT and GROUP BY clauses.

  1. Click Run.

Click Check my progress to verify that you have completed this task correctly.

Use GROUP BY to identify number of items in category

Task 4. Filter the data using GROUP BY and HAVING

In this task, you’ll filter data to remove the categories with a small number of items before sampling the dataset using the GROUP BY and HAVING SQL keywords.

  1. Create a new query; copy the following query into the Query Editor:
SELECT category, COUNT(*) AS itemCount FROM `thelook_ecommerce.products` GROUP BY category HAVING itemCount > 1000;

This query returns the categories with a large number of items. Currently, the threshold is set at 1000, so only categories with more than 1000 items will be returned. You can adjust this threshold to as large or small of a number as you want to get the results you need.

  1. Click Run.

Click Check my progress to verify that you have completed this task correctly.

Filter the data using GROUP BY and HAVING

Task 5. Sample a BigQuery table using TABLESAMPLE

In this task, you'll use data sampling to retrieve a random subset of rows from the products table.

Table sampling is useful for a variety of purposes, such as exploring data, testing queries, or getting a quick overview of a large dataset.

Sampling returns a random selection of rows while avoiding the costs associated with scanning and processing an entire table. This is because sampling only reads a subset of the data, which can significantly reduce the amount of time and resources required to run the query.

Unlike the LIMIT clause, which you used in a previous task, TABLESAMPLE returns a random subset of data from a table. This means that the results of a TABLESAMPLE query may vary each time it is run.

BigQuery does not cache the results of queries that include a TABLESAMPLE clause. This is because the results of a sampling query are always random, so caching them would not be helpful.

  1. Create a new query; copy the following query into the Query Editor:
SELECT * FROM `thelook_ecommerce.products` TABLESAMPLE SYSTEM (10 PERCENT);
  1. Click Run.

Click Check my progress to verify that you have completed this task correctly.

Sample a BigQuery table using TABLESAMPLE

Task 6. Explore the order_items table

In this task, you'll explore the data contained in the order_items table.

  1. Create a new query; copy the following query into the Query Editor:
SELECT * FROM `thelook_ecommerce.order_items` LIMIT 10;

This query returns the first 10 rows of the order_items table.

  1. Click Run.

  2. Create a new query; copy the following query into the Query Editor:

SELECT status, COUNT(*) AS total_orders FROM `thelook_ecommerce.order_items` GROUP BY status;

This query returns the aggregate count of orders across various statuses.

  1. Click Run.

  2. Create a new query; copy the following query into the Query Editor:

SELECT user_id, SUM(sale_price) AS total_amount FROM `thelook_ecommerce.order_items` GROUP BY user_id ORDER BY total_amount DESC LIMIT 1;

This query returns the user ID associated with the greatest total order value.

  1. Click Run.

Click Check my progress to verify that you have completed this task correctly.

Explore the order_items table

Conclusion

Great work!

You’ve successfully explored the data and identified data quality issues in the sales data. This is a great first step in making sure that the sales data used for decision-making is optimized.

First, you explored the data using limiting to return a limited number of results.

Second, you identified duplicate rows using the COUNT(DISTINCT name) aggregation technique.

Third, you identified the number of items per category in the products table using GROUP BY.

Fourth, you filtered the data using GROUP BY and HAVING.

Fifth, you sampled a table to return a random subset of the data.

Finally, you explored the data contained in the order_items table.

You’re well on your way to understanding how to use data limiting, sampling, and aggregation to better understand your data and your transformation needs.

End your lab

Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.

Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.