$2,999

Data Engineering on Google Cloud Platform, Toronto

Event Information

Share this event

Date and Time

Location

Location

Toronto

Canada

Friends Who Are Going
Event description

Description

This four-day instructor-led class provides participants a hands-on introduction to designing and building data processing systems on Google Cloud Platform. Through a combination of presentations, demos, and hand-on labs, participants will learn how to design data processing systems, build end-to-end data pipelines, analyze data and carry out machine learning. The course covers structured, unstructured, and streaming data.

Objectives

This course teaches participants the following skills:

  • Design and build data processing systems on Google Cloud Platform
  • Process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow
  • Derive business insights from extremely large datasets using Google BigQuery
  • Train, evaluate and predict using machine learning models using Tensorflow and Cloud ML
  • Leverage unstructured data using Spark and ML APIs on Cloud Dataproc
  • Enable instant insights from streaming data

Prerequisites

To get the most of out of this course, participants should have:

  • Completed Google Cloud Fundamentals- Big Data and Machine Learning course OR have equivalent experience
  • Basic proficiency with common query language such as SQL
  • Experience with data modeling, extract, transform, load activities Developing applications using a common programming language such Python
  • Familiarity with Machine Learning and/or statistics

Audience

This class is intended for experienced developers who are responsible for managing big data transformations including:

  • Extracting, Loading, Transforming, cleaning, and validating data
  • Designing pipelines and architectures for data processing
  • Creating and maintaining machine learning and statistical models
  • Querying datasets, visualizing query results and creating reports

Course Outline

Module 1: Serverless data analysis with BigQuery

  • What is BigQuery
  • Advanced Capabilities
  • Performance and pricing
  • Lab: Queries and Functions
  • Lab: Load and Export data

Module 2: Serverless, autoscaling data pipelines with Dataflow

  • Introduction to Dataflow and capabilities
  • Lab: Data pipeline
  • Lab: MapReduce in Dataflow
  • Lab: Side inputs
  • Lab: Streaming

Module 3: Getting started with Machine Learning

  • What is machine learning (ML)
  • Effective ML: concepts, types
  • Evaluating ML
  • ML datasets: generalization
  • Lab: Explore and create ML datasets

Module 4: Building ML models with Tensorflow

  • Getting started with TensorFlow
  • Lab: Using tf.learn
  • TensorFlow graphs and loops + lab
  • Lab: Using low-level TensorFlow + early stopping
  • Monitoring ML training
  • Lab: Charts and graphs of TensorFlow training

Module 5: Scaling ML models with CloudML

  • Why Cloud ML?
  • Packaging up a TensorFlow model
  • End-to-end training
  • Lab: Run a ML model locally and on cloud

Module 6: Feature Engineering

  • Creating good features
  • Transforming inputs
  • Synthetic features
  • Preprocessing with Cloud ML
  • Lab: Feature engineering

Module 7: ML architectures

  • Wide and deep
  • Image analysis
  • Lab: Custom image classification with transfer learning
  • Embeddings and sequences
  • Recommendation systems

Module 8: Google Cloud Dataproc Overview

  • Introducing Google Cloud Dataproc
  • Creating and managing clusters
  • Defining master and worker nodes
  • Leveraging custom machine types and preemptible worker nodes
  • Creating clusters with the Web Console
  • Scripting clusters with the CLI
  • Using the Dataproc REST API
  • Dataproc pricing
  • Scaling and deleting Clusters
  • Lab: Creating Hadoop Clusters with Google Cloud Dataproc

Module 9: Running Dataproc Jobs

  • Controlling application versions
  • Submitting jobs
  • Accessing HDFS and GCS
  • Hadoop
  • Spark and PySpark
  • Pig and Hive
  • Logging and monitoring jobs
  • Accessing onto master and worker nodes with SSH
  • Working with PySpark REPL (command-line interpreter)
  • Lab: Running Hadoop and Spark Jobs with Dataproc

Module 10: Integrating Dataproc with Google Cloud Platform

  • Initialization actions
  • Programming Jupyter/Datalab notebooks
  • Accessing Google Cloud Storage
  • Leveraging relational data with Google Cloud SQL
  • Reading and writing streaming Data with Google BigTable
  • Querying Data from Google BigQuery
  • Making Google API Calls from notebooks
  • Lab: Big Data Analysis with Dataproc

Module 11: Making Sense of Unstructured Data with Google’s Machine Learning APIs

  • Google’s Machine Learning APIs
  • Common ML Use Cases
  • Vision API
  • Natural Language API
  • Translate
  • Speech API
  • Lab: Adding Machine Learning Capabilities to Big Data Analysis

Module 12: Need for real-time streaming analytics

  • What is Streaming Analytics?
  • Use-cases
  • Batch vs Streaming (Real-time)
  • Related terminologies
  • GCP products that help build for high availability, resiliency, high-throughput, real-timestreaming analytics (review of Pub/Sub and Dataflow)
  • Lab: Setup project, enable APIs, setup storage

Module 13: Architecture of streaming pipelines

  • Streaming architectures and considerations
  • Choosing the right components
  • Lab: Explore the dataset
  • Windowing
  • Streaming aggregation
  • Events, triggers
  • Lab: Create architecture reference

Module 14: Stream data and events into PubSub

  • Topics and Subscriptions
  • Publishing events into Pub/Sub
  • Lab: Streaming data ingest into PubSub
  • Subscribing options: Push vs Pull
  • Alerts

Module 15: Build a stream processing pipeline

  • Pipelines, PCollections and Transforms
  • Windows, Events, and Triggers
  • Aggregation statistics
  • Streaming analytics with BigQuery
  • Low-volume alerts
  • Lab: alerting scenario for anomalies

Module 16: High throughput and low-latency with Bigtable

  • Latency considerations
  • Lab: create streaming data processing pipelines with Dataflow
  • What is Bigtable
  • Designing row keys
  • Performance considerations
  • Lab: high-volume event processing

Module 17: High throughput and low-latency with Bigtable

  • What is Google Data Studio?
  • From data to decisions
  • Lab: build a real-time dashboard to visualize processed data
Share with friends

Date and Time

Location

Toronto

Canada

Save This Event

Event Saved