Note that this is the Day 1 content of Databricks Apache Spark for Machine Learning and Data Science (Spark 301) Course.
The course is written using Scala 2.11, Python 2.x and Spark 2.0. All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.
This class is designed as an intermediate Spark course for engineers, data scientists, and analysts with less than a few months of Spark experience.
All participants need to bring a laptop with updated versions of Chrome or Firefox (Internet Explorer and Safari are not supported). Participants should familiarize themselves with basic Scala syntax before the training.
General Apache Spark
- Improve performance through judicious use of caching and applying best practices.
- Troubleshoot slow running DataFrame queries using explain-plan and the Spark UI.
- Visualize how jobs are broken into stages and tasks and executed within Spark.
- Troubleshoot errors and program crashes using executor logs, driver stack traces, and local-mode runtimes.
- Troubleshoot Spark jobs using the administration UIs and logs inside Databricks.
- Find answers to common Spark and Databricks questions using the documentation and other resources.
Extracting, Processing and Analysing Data
- Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames.
- Extract structured data from unstructured data sources by parsing using Datasets (where possible) or RDDs (if not possible with Datasets), with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
- Extend the capabilities of DataFrames using user defined functions (UDFs and UDAFs) in Python and Scala.
- Resolve missing fields in DataFrame rows using filtering and imputation.
- Apply best practices for data analytics using Spark
- Perform exploratory data analysis (EDA) using DataFrames and Datasets to:
- Compute descriptive statistics
- Identify data quality issues
- Better understand a dataset.
- Integrate visualizations into a Spark application using Databricks and popular visualization libraries (d3, ggplot, matplotlib)
- Develop dashboards to provide “at-a-glance” summaries and reports.
• Spark Intro and Ecosystem
• Lab: Getting connected and learning the environment
• RDDs, DAGs, Executors, and Spark Architecture
• Lab: Extract-Transform-Load Operations (Map transformation)
• DataFrames and Spark SQL
• Lab: Exploring data w/ Spark SQL + Simple Visualizations
• Lab: DataFrames
• Spark Machine Learning (DataFrame Pipelines & the legacy RDD API)
• Lab: Linear Regression with Spark MLlib Pipelines
Servian is a Databricks Consulting Partner providing advisory, consulting and managed services in Apache Spark™ across Australia and New Zealand.
As the exclusive Databricks certified Training Partner in the region Servian offer both Public and Private corporate classes on Apache Spark™. Spark classes offered by Servian will be delivered by Databricks certified instructors using Databricks course material.