Leapfrog your competition, gain Apache Spark 2.0 skills.
The newly released Spark 2.0 enables workshop participants to build unified big data applications, combining machine learning, batch, streaming, and interactive analytics on all their datasets. With Spark, developers can write sophisticated distributed and parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.
Gain Competitive Advantage from Ecosystem mastery
Apache Spark 2.0 is the next-generation successor to Hadoop MapReduce. Spark is a powerful, open source processing engine for large datasets, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.
Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Spark ecosystem learning topics such as:
- Use Spark shell for interactive data analysis
- How Spark parallelizes task execution
- Learn about RDDs, DataFrames and Datasets
- Writing Spark applications in Scala
- How Spark runs with cluster managers, e.g. Spark Standalone and Hadoop YARN
- Applying Machine Learning to data at rest and in motion
- Processing streaming data with Spark
Some of the Hands-on Labs will help you master the follwoing Microsoft Azure architecture
Audience & Prerequisites
This course is best suited for developers and data analysts, engineers with prior knowledge and experience in Scala, Java, R or Python languages. Course examples and exercises are presented in Scala, so knowledge of one of these programming languages is required.
Instructor Spark Class Notes at Mastering Apache Spark.
Early birds (till Nov 15th ) $999. Regular fee: $2500. For more information email: firstname.lastname@example.org
- Understanding of the benefits of Spark in the Big Data ecosystem
- Provision and configure your own Spark production clustered environment
- Ability to master Spark fundamentals: RDDs, partitions, jobs, stages, tasks
- Understanding of the role of DAGScheduler, TaskScheduler, and SchedulerBackends
- Conducting Monitoring of Spark Apps using Spark’s web UI, SparkListeners, and log analysis
- Conduct data analytics using Spark SQL
- Train and run Machine Learning models using Spark MLlib
- Managing streaming data using Spark Streaming
- Learning about security and RPC communication layer to mix Pyton, R and other apps.
WORKSHOP Day 1
The Elements of Apache Spark’s Architecture – 4 hours
LEARNING / HANDS-ON (depending on the audience)
- RDD and DataFrame, Dataset (Spark 2.0), Structured Streaming (Spark 2.0)
- Jobs, Stages, Tasks, Shuffling
- DAGScheduler, TaskScheduler and SchedulerBackends
- Spark Modules (Spark SQL and Spark MLlib, less about Spark Streaming and Spark GraphX)
- Spark and cluster managers – Hadoop YARN, Apache Mesos and Spark Standalone
- Deployment Modes (client vs cluster)
Monitoring Spark Apps (using web UI, SparkListeners and log analysis) – 3 hours
LEARNING / HANDS-ON
- web UI
- SparkListeners (including developing custom SparkListeners)
- Log analysis
Spark Setup and Your First Spark Application - 1 hour
LEARNING / HANDS-ON
- Setting Up Deployment Environment
- Developing Spark SQL Applications using Scala, sbt, and IntelliJ IDEA
WORKSHOP Day 2
Introduction to Scala (and sbt) – 2 hours
- Developing Scala Applications introducing Scala Standard API
- Working with Files
- Scala Collection API
- Customizing sbt projects (using plugins)
Developing Spark SQL Applications using Datasets – 3 hours
HANDS-ON (Based on Pre-Assessment)
- Working with Structured Datasets in CSV and JSON files
- Using Dataset API
- Using User-Defined Functions (UDF)
Developing Machine Learning Pipelines using Spark MLlib - 3 hours
- Create your first ML Pipeline
- Train a Logistic Regression model
- Using Random Forest and Classification algorithms
- Secured web UI
- Secured RPC
- Hadoop YARN
Early birds (till Nov 15th) $999, Fees: $2500
email: email@example.com or go to www.valueamplify.com
If the class doesn't reach a minum number you will be refunded.
Developer and trainer for Apache Spark, Scala, sbt, Hadoop YARN with some experience in Apache Kafka, Apache Hive, Apache Mesos, Akka, and Docker.
See the Spark Class Notes at Mastering Apache Spark.
- Jacek contributions to Apache Spark 2.0 – https://github.com/apache/spark/commits?author=jaceklaskowski
- Speaker at Spark Summit in NYC 2016 – https://spark-summit.org/east-2016/speakers/jacek-laskowski/
- Submitted 5 talks about Apache Spark 2.0 to Spark Summit Europe https://spark-summit.org/eu-2016/
(e.g. Spark MLlib without Machine Learning (theory) for Scala Developers -- Transformers, Estimators, and Pipelines)
Spark & Scala Workshops taught
- Plymouth Meeting
Feedback from the class
Adj. Prof. Giuseppe Mascarella,
16 years at Microsoft, manager and trainer of MTC and MCS
Teaching Social Media Analytics at FAU (Florida Atlantic University).
MS in Industrial Engineering, Major: Statistical Quality Control
Taught Machine Learning Recommenders, Churn and Predictive Maintenance at SQL PASS Analytics events.