Note this is the Day 2 and Day 3 course of Databricks Apache Spark for Machine Learning and Data Science course (Spark 301)
This hands-on, 2-day Apache Spark training targets experienced Data Scientists wishing to perform data analysis at scale using Apache Spark. This course covers employing exploratory data analysis (EDA), building machine learning models, evaluating models, and performing cross validation.
The course is written using Scala 2.11, Python 2.x and Spark 2.0. All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.
- Data scientists
- Software engineers with some machine learning background
All participants need to bring a laptop with updated versions of Chrome or Firefox (Internet Explorer and Safari are not supported). Participants should familiarize themselves with basic Scala syntax before the training. Participants should have some understanding of machine learning.
- Learn to apply various regression and classification models, both supervised and unsupervised.
- Train analytical models with Spark MLlib’s DataFrame-based estimators including: linear regression, decision trees, logistic regression, and k-means.
- Use Spark MLlib transformers to perform pre-processing on a dataset prior to training, including: standardization, normalization, one-hot encoding, and binarization.
- Create Spark MLlib pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
- Evaluate model accuracy by dividing data into training and test datasets and computing metrics using Spark MLlib evaluators.
- Tune training hyper-parameters by integrating cross-validation into Spark MLlib Pipelines.
- Compute using RDD-based Spark MLlib functionality not present in the MLlib DataFrame API, by converting DataFrames to RDDs and applying RDD transformations and actions. (Optional Module)
- Troubleshoot and tune machine learning algorithms in Spark.
- Understand and build a general machine learning pipeline for Spark.
First Machine Learning Example
LDA Topic Modeling
Graphs, GraphX and GraphFrames
Servian is a Databricks Consulting Partner providing advisory, consulting and managed services in Apache Spark™ across Australia and New Zealand.
As the exclusive Databricks certified Training Partner in the region Servian offer both Public and Private corporate classes on Apache Spark™. Spark classes offered by Servian will be delivered by Databricks certified instructors using Databricks course material.