HDP Analyst: Data Science - Hortonworks Official Curriculum
Event Information
Description
COURSE OVERVIEW
This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.
COURSE CONTENT
DAY 1: AN INTRODUCTION TO HADOOP AND DATASCIENCE
OBJECTIVES
-
Using Hadoop for Data Science
-
The Hadoop Distributed File System
-
The MapReduce Framework
-
Hadoop 2 and YARN
-
Machine Learning from Data
LABS
-
Setting up the Lab Environment
-
Using HDFS Commands
-
Demonstration: Understanding MapReduce
-
Using Apache Mahout for Machine Learning
DAY 2: AN INTRODUCTION TO APACHE PIG AND PYTHON
OBJECTIVES
-
Introduction to Apache Pig
-
Python Programming
-
Analyzing Data with Python
-
Running Python on Hadoop
-
Machine Learning Algorithms
LABS
-
Getting Started with Apache Pig
-
Using the IPython Notebook
-
Demonstration: Understanding the NumPy Package
-
Demonstration: The Pandas Library
-
Performing Data Analysis with Python
-
Interpolating Data Points
-
Defining User Defined Functions in Python
-
Streaming Python with Apache Pig
-
Exploring Data with Apache Pig
-
Demonstration: Classification with Scikit-Learn
-
Computing K-Nearest Neighbor
-
Generating a K-Means Clustering
DAY 3: MACHINE LEARNING ALGORITHMS
OBJECTIVES
-
Machine Learning Algorithms Continued
-
Natural Language Processing
-
Apache SparkMLib
-
Talking Data Science to Production
LABS
-
Demonstration: POS Tagging Using a Decision Tree
-
Using the Python Natural Language Toolkit
-
Classifying Text Using Naïve Bayes
-
Using Spark Transformations andActions
-
Using Spark MLib
-
Creating a Spam Classifier Using Spark MLib