Hadoop 3-Day Developer Course
The Hadoop 3-Day Developer Course gives you the essential skills to develop applications on Hadoop. Our goal is to make you productive as quickly as possible, while setting the stage for your future growth as a Hadoop developer.
This course gives you the essential grounding in the principles of Hadoop and its MapReduce computation model, HDFS for distributed file storage, the roles of other essential tools, and how to write applications effectively using these tools. If your team is new to Hadoop, you’ll learn the skills necessary to transition your existing data management applications over to Hadoop quickly and then start leveraging its capabilities for data analysis.
The following prerequisites ensure that you will gain the maximum benefit from the course.
- Programming experience: This is a developers course. We will write Java, Hive, and Pig applications. Prior Java experience is strongly recommended.
- Linux shell experience: Basic Linux shell (bash) commands will be used extensively. Some prior experience is recommended.
- Experience with SQL databases: SQL experience is helpful for learning Hive and Pig, but not essential.
What You Must Bring
We will log into remote EMR clusters to build, test, and run our applications. You will also be provided with all the exercise software so you can view it on your laptop, if desired.
Bring your laptop with the following software installed in advance.
- JDK 1.6 or 1.7: The JDK (Java Development Kit) version 1.6 or newer (not just the JRE - Java Runtime Environment). (http://www.oracle.com/technetwork/java/javase/downloads/index.html).
- Ant: The Java-based ant build tool, version 1.7 or newer, if you want to build and test the Java exercises on your laptop. (http://ant.apache.org/)
- A programmer’s source code editor: Whatever you prefer. Either Eclipse or IntelliJ IDEA is recommended for the Java exercises and project files for both environments will be provided. You might find a separate programmer’s text editor to be more convenient for Hive and Pig exercises.
What You Will Learn
Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, you will learn the following:
- Overview and Hadoop Architecture.
- Introduction to Amazon Elastic MapReduce
- Starting a Hadoop cluster in the cloud with EMR.
- Writing your first MapReduce job.
- Loading Data into the cluster.
- Data and Security.
- Hadoop Programming Models.
- Hadoop Streaming.
- Using Hadoop with Pig.
- Using Hadoop with Hive.
- Advanced Hadoop Features – UDFs, UDAFs.
- Hadoop Ecosystem.
The particular agenda for each day may be adjusted according to student interests, pace, and other considerations.
Introduction to Hadoop
- Introduction to Hadoop and the problems it solves.
- Hadoop's components, their roles, and how they work.
- Understanding and using HDFS and other file systems.
- The Hadoop ecosystem.
- Running Hadoop clusters in the cloud with Amazon EMR.
- Running “jobs” on Hadoop.
- Loading data into a Hadoop cluster.
- Exercise: Hadoop Walkthrough
Java MapReduce Programming
The Basics of Java MapReduce
- Data flow through a MapReduce application, using the classic Word Count algorithm.
- An overview of the Java MapReduce API and the anatomy of an application.
- Exercise: Java MapReduce development with Eclipse, unit testing with MRUnit, and running MapReduce jobs on Amazon EMR.
- Hadoop Streaming for writing map and reduce code in Ruby, Python, etc.
- Exercise: Word Count implemented using Hadoop Streaming.
MapReduce Deep Dive
- Combiners for reducing IO overhead.
- Key-Value formats:
WritableComparables. How to create your own.
- Partitioners and Comparators for custom sorting.
- The Secondary Sort algorithm.
- Exercise: Using the Secondary Sort algorithm.
- File formats: built-in formats and how to create your own custom formats. Compression options and the issue of splittable file formats.
- Distributed file systems: HDFS and others.
- Counters and logging: knowing what’s going on.
- Mapper reuse: composing mappers with
- Task scheduling: the queue, fair and capacity schedulers.
- The distributed cache and its use.
- Joins: map-side and reduce-side joins.
- Exercise: Indexing Twitter traffic.
Enterprise Application Considerations
- Monitoring applications.
- Scheduling work flows.
- ETL (extract, transform, and load) and data export techniques.
- Monitoring, profiling, debuggin, and tuning applications.
Data Warehousing with Hive
- What is Hive and why would I use it?
- Exercise: Running Hive and basic queries.
- The Hive Query Language (HiveQL) by example.
- Running Exercises: Practice HiveQL concepts as they are introduced.
- Hive vs. Relational Databases.
- Extending Hive with user defined functions (UDFs).
- Exercise: Integrate a UDF into Hive.
- Extending Hive with new data formats.
- Exercise: Supporting a custom SerDe (record format) in Hive.
- Hive under the hood; understanding how Hive drives MapReduce.
- Notes on setting up the Hive Metadata repository.
- Hive tips and tricks.
- Exercise: Ngram analysis with Hive.
Data Flow Programming with Pig
- What is Pig and why would I use it?
- Pig for data flows vs. Hive for queries.
- Exercise: Running Pig and basic data flows.
- Pig Latin, the language of Pig, by example.
- Running Exercises: Practice Pig Latin concepts as they are introduced.
- Extending Pig with Java user defined functions (UDFs).
- Exercise: Extending a Pig application with UDFs.
- Pig under the hood; understanding how Pig drives MapReduce.
- Pig tips and tricks.
- Exercise: Pulling it all together: a complex data flow processing exercise using Pig.
- Recap of what we learned.
- Where to go from here: references and resources.