Hands-on Introduction To The New Apache SPARK 2.0 on site
Event Information
Description
Leapfrog your competition, gain Apache Spark 2.0 skills.
The newly released Spark 2.0 enables workshop participants to build unified big data applications, combining machine learning, batch, streaming, and interactive analytics on all their datasets. With Spark, developers can write sophisticated distributed and parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.
Gain Competitive Advantage from Ecosystem mastery
Apache Spark 2.0 is the next-generation successor to Hadoop MapReduce. Spark is a powerful, open source processing engine for large datasets, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.
Hands-On Practice
Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Spark ecosystem learning topics such as:
- Use Spark shell for interactive data analysis
- How Spark parallelizes task execution
- Learn about RDDs, DataFrames and Datasets
- Writing Spark applications in Scala
- How Spark runs with cluster managers, e.g. Spark Standalone and Hadoop YARN
- Applying Machine Learning to data at rest and in motion
- Processing streaming data with Spark
Audience & Prerequisites
This course is best suited for developers, data analysts, and engineers with some prior knowledge and experience in R or Python languages. Course examples and exercises are presented in Scala, so some knowledge of one of these programming languages is required. Other type of students will benefit from the class but in a few excercises they need to be ok to acccept to be an auditor in Pair Programming style.
Instructor Spark Class Notes at Mastering Apache Spark.
**Registration
Early birds (till Sept 6th ) $999. Regular fee: $2500, for more info email: registration@valueamplify.com
_________________________________________________________________________________
Takeaways
- Understanding of the benefits of Spark in the Big Data ecosystem
- Provision and configure your own Spark production clustered environment
- Ability to master Spark fundamentals: RDDs, partitions, jobs, stages, tasks
- Understanding of the role of DAGScheduler, TaskScheduler, and SchedulerBackends
- Conducting Monitoring of Spark Apps using Spark’s web UI, SparkListeners, and log analysis
- Conduct data analytics using Spark SQL
- Train and run Machine Learning models using Spark MLlib
- Managing streaming data using Spark Streaming
- Learning about security and RPC communication layer to mix Pyton, R and other apps.
_________________________________________________________________________________________
WORKSHOP Day 1
The Elements of Apache Spark’s Architecture – 4 hours
LEARNING / HANDS-ON (depending on the audience)
- RDD and DataFrame, Dataset (Spark 2.0), Structured Streaming (Spark 2.0)
- Jobs, Stages, Tasks, Shuffling
- DAGScheduler, TaskScheduler and SchedulerBackends
- Spark Modules (Spark SQL and Spark MLlib, less about Spark Streaming and Spark GraphX)
- Spark and cluster managers – Hadoop YARN, Apache Mesos and Spark Standalone
- Deployment Modes (client vs cluster)
Monitoring Spark Apps (using web UI, SparkListeners and log analysis) – 3 hours
LEARNING / HANDS-ON
- web UI
- SparkListeners (including developing custom SparkListeners)
- Log analysis
Spark Setup and Your First Spark Application - 1 hour
LEARNING / HANDS-ON
- Setting Up Deployment Environment
- Developing Spark SQL Applications using Scala, sbt, and IntelliJ IDEA
________________________________________________________________________________________
WORKSHOP Day 2
Introduction to Scala (and sbt) – 2 hours
HANDS-ON
- Developing Scala Applications introducing Scala Standard API
- Working with Files
- Scala Collection API
- Customizing sbt projects (using plugins)
Developing Spark SQL Applications using Datasets – 3 hours
HANDS-ON (Based on Pre-Assessment)
- Working with Structured Datasets in CSV and JSON files
- Using Dataset API
- Using User-Defined Functions (UDF)
________________________________________
WORKSHOP DAY 3
Developing Machine Learning Pipelines using Spark MLlib - 3 hours
HANDS-ON
- Create your first ML Pipeline
- Train a Logistic Regression model
- Using Random Forest and Classification algorithms
Spark Security
- Secured web UI
- Secured RPC
- Hadoop YARN
______________________________________________________________________________________
**Registration
Early birds (till Aug 30h) $999, Fees: $2500
email: registration@valueamplify.com www.valueamplify.com
If the class doesn't reach a minum number you will be refunded.
__________________________________________________________________
Trainer
Jacek Laskowski
Developer and trainer for Apache Spark, Scala, sbt, Hadoop YARN with some experience in Apache Kafka, Apache Hive, Apache Mesos, Akka, and Docker.
See the Spark Class Notes at Mastering Apache Spark.
- Jacek contributions to Apache Spark 2.0 – https://github.com/apache/spark/commits?author=jaceklaskowski
- Speaker at Spark Summit in NYC 2016 – https://spark-summit.org/east-2016/speakers/jacek-laskowski/
- Submitted 5 talks about Apache Spark 2.0 to Spark Summit Europe https://spark-summit.org/eu-2016/
(e.g. Spark MLlib without Machine Learning (theory) for Scala Developers -- Transformers, Estimators, and Pipelines)
Spark & Scala Workshops taught
- Toronto
- Mississauga
- Plymouth Meeting
- Montreal
- London
Feedback from the class
Notes from the IMS Health employees in Plymouth Meeting PA on Spark / Scala Workshop: “Jacek, you are a great teacher.”
_____________________________________________________________________
Instructional Designer
Adj. Prof. Giuseppe Mascarella,
giuseppe@valueamplify.com
16 years at Microsoft, manager and trainer of MTC and MCS
Teaching Social Media Analytics at FAU (Florida Atlantic University).
MS in Industrial Engineering, Major: Statistical Quality Control
Taught Machine Learning Recommenders, Churn and Predictive Maintenance at SQL PASS Analytics events.
Recording of a session on Recommenders
Recording of a session customer Churn predictions with Azure ML