Spark and Delta Lake Workshop
Event Information
About this event
Abstract
This 2-day workshop is intended to teach you what Apache Spark™ and Delta Lake are and how to use them in your data architectures for reliable large-scale distributed data pipelines. This course will show the features of Delta Lake that, alongside Spark SQL and Spark Structured Streaming, introduce ACID transactions and time travel (data versioning) to your ETL batch and streaming workloads. Slides, demos, exercises, and Q&A sessions should all together help you understand the concepts of the modern data lakehouse architecture.
Motivation
Whether you are new to the field of data analytics & data science, you know that working with large amounts of data is a critical need for businesses today. For the first time, SF Bay ACM is partnering with Databricks to bring to you this exciting workshop on Apache Spark and Delta Lake. These two technologies combine to bring the power of petabytes of data at your finger-tip.
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.Jan 26, 2022 [reference]
Sponsorship
Databricks is partially sponsoring this event, so we also have a rare opportunity to support our professional development activities with a significant price drop. Check below for details.
NOTE: While this is a virtual class, we will cap it at classroom size so that there is a strong focus on learning. There is a nominal charge for the 6 hours of lecture - please sign up early as we will keep the attendee count low. This is NOT a MOOC. Registration also includes a 1-year SFBay ACM membership ($20 value)
Content: You will have access to all the notebooks, training material for hands-on workshop training.
Who is the course for?
- Solution Architects
- Data Engineers
- Data Scientists
Structure
- Six 55-min modules (10-min break between modules)
- 15-min talk/20-min labs / 15-min Q&A / 5-min buffer
Requirements
- Sign up for Databricks Community Edition
- Should have experience with SQL and Python
Saturday - Day 1: 10am-11:30am, Pacific Time
Module 1. The Fundamentals of Apache Spark
- Introduction to Databricks Community Edition
- Loading and saving datasets (/databricks-datasets) [SQL]
- Basic DataFrame Transformations [SQL]
- Working with Spark tables [SQL]
Module 2. Intermediate Spark SQL
- Aggregations [SQL]
- Joins [SQL]
- Basics of web UI
Module 3. Advanced Spark SQL
- Windowed Aggregation [SQL]
- Introduction to Spark Structured Streaming [Python, SQL]
Sunday (Delta Lake)
Module 4: Introduction to Delta Lake
- Bringing Reliability to Data Lakes (Concepts)
- Convert existing tables to Delta Lake [SQL]
- Unified Batch and Streaming [Python, SQL]
Module 5: DML and Schema
- Create, Insert, Update, Delete, Merge
- Schema Enforcement and Evolution
Module 6: SQL and the Transaction Log
- Delta Lake SQL
- Time Travel
- Transaction Log Fundamentals
Organizer & SFBay ACM Prof Dev Chair: Yashesh Shroff @yashroff
For more information about Registration, please contact SF Bay Chapter of the ACM, yshroff at g | m | a i l
We look forward to seeing you at the workshop!