Think Big, Start Smart, Scale Fast
Hadoop Developer 3-Day Course
The Hadoop Developer 3-Day Course presented by Think Big Academy is designed to make you productive with Hadoop. The topics of this course reflect our experience building Hadoop analytics applications with clients. Once your Hadoop cluster is up and running, you can transition existing data management applications over to Hadoop. Explore new ways to use Hadoop and unstructured data and learn the comprehensive capabilities now available across the Hadoop ecosystem.
The following prerequisites ensure that you will gain the maximum benefit from the course.
- Programming experience: This is developers course. We will write Java, Hive, and Pig applications. Prior Java experience is recommended.
- Linux shell experience (recommended): Basic Linux shell (bash) commands will be used extensively.
- Experience with SQL databases (optional): SQL experience is useful for learning Hive and Pig, but not essential.
What You Must Bring
Bring your laptop with the following software installed.
- VMWare or VirtualBox: A virtual machine (VM) will be provided for running the course exercises. Your laptop must have the freeware VirtualBox (https://www.virtualbox.org/) or a licensed copy of VMWare (http://www.vmware.com/products/workstation/) installed to run the virtual machine.
- JDK 1.6: The JDK (Java Development Kit) version 1.6 or newer is required (not just the JRE - Java Runtime Environment). (http://www.oracle.com/technetwork/java/javase/downloads/index.html)
- Ant: The Java-based ant build tool, version 1.7 or newer, will be used to build and test the exercises. (http://ant.apache.org/)
- A programmer's source code editor: Whatever you prefer. Either Eclipse or IntelliJ IDEA is recommended and project files for both environments will be provided.
- Cygwin (Windows systems): Laptops running Windows must have the Cygwin environment installed, including the open-ssl module (discussed next). (http://www.cygwin.com/)
- Secure shell (ssh) software: for example, OpenSSL or a GUI secure shell application. Each computer will need the ability to use
sshto log into the provided virtual machine used for the course. The related
scp(secure copy) command must also be available for transferring files between environments.
- The Academy courseware: Several days before the class, Think Big Academy will provide a courseware package including the virtual machine used for the course, the course exercises, installation instructions, etc. Please install this package before the first day of class!
What You Will Learn
Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, you will learn the following:
- The components of the Hadoop ecosystem and how they work together.
- The basics of installing and configuring Hadoop.
- The MapReduce programming model and the problems it solves.
- The Hadoop Distributed File System (HDFS).
- How to use the Hadoop command line tools and the web consoles.
- How to create Hadoop applications using the MapReduce Java and Streaming APIs.
- How to create Hadoop applications using Hive and Pig.
- How to create custom extensions for Hive and Pig and MapReduce.
- The roles of other Hadoop ecosystem tools, including HBase, data ingestion tools, data integration tools, job scheduling techniques, and monitoring options.
- Common pitfalls and problems, and how to avoid them.
- Lessons from real-world Hadoop implementations.
Agenda: Day 1
Introduction to Hadoop
- Introduction to the Hadoop Ecosystem and the problems it solves.
- The Hadoop components and how they work.
- Running Hadoop jobs.
- Working with files in HDFS.
- Exercise: Working with Hadoop on the virtual machine (VM).
Java MapReduce Programming
- Data flow through a MapReduce application.
- Anatomy of a Java MapReduce application.
- Exercise: Running Word Count on the VM.
- Hadoop Streaming for writing applications with Ruby, Python, etc.
- Exercise: Word Count using Hadoop streaming.
- Combiners and Partitioners.
- Effective use of compression.
- File formats: plain-text, sequence files and others.
- File systems: HDFS, S3, and others.
- The distributed cache and its use.
- Testing with MRUnit.
- More advanced MapReduce techniques.
- Improving performance.
- Exercise: Bringing it all together - Twitter Analytics.
Agenda: Day 2:
Data Warehousing with Hive
- What is Hive and why would I use it?
- Exercise: Running Hive and basic queries.
- The Hive Query Language (HQL) by example.
- Running Exercise: Practice HQL concepts as they are introduced.
- Hive vs. Relational Databases.
- Extending Hive with user defined functions (UDFs).
- Exercise: Integrate a UDF into Hive.
- Extending Hive with with new data formats.
- Hive under the hood; understanding how Hive drives MapReduce.
- Setting up the Hive Metadata repository.
- Hive tips and tricks.
- Exercise: Improving the performance of a Hive application.
A Brief Look at Other Higher-Level Hadoop Tools
- Cascading and Cascalog.
- Crunch and FlumeJava.
- ... and others.
Lessons Learned from Hadoop Implementations
- Troubleshooting application issues.
- Effective use monitoring tools: JMX, Ganglia, Nagios and others.
Agenda: Day 3
Dataflow Programming with Pig
- What is Pig and why would I use it?
- Pig vs. Hive.
- Exercise: Running Pig and basic dataflows.
- Pig Latin by example.
- Running Exercise: Practice Pig Latin concepts as they are introduced.
- Extending Pig with Java user defined functions (UDFs).
- Exercise: Extending a Pig application with UDFs.
- Pig under the hood; understanding how Pig drives MapReduce.
- Pig tips and tricks.
Hadoop in the Enterprise
- HBase: A Hadoop-oriented database.
- Scheduling workflows with bash/cron and other patterns.
- Enterprise integration patterns.
Final Group Exercise
- Pulling it all together with a group exercise.
- Recap of what we have learned.
- Where to go from here: references and resources.
When & Where
Up and Running with Big Data: 3 Day Deep-Dive
Over three days, explore the Big Data tools, technologies and techniques which allow organisations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you'll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform
- MapReduce concepts
- Hadoop implementation: Jobtracker, Namenode, Tasktracker, Datanode, Shuffle & Sort
- Introduction to Amazon AWS and EMR with console and command-line tools
- Implementing MapReduce with Java and Streaming
- Hive Introduction
- Hive Relational Operators
- Hive Implementation to MapReduce
- Hive Partitions
- Hive UDFs, UDAFs, UDTFs
- Pig Introduction
- Pig Relational Operators
- Pig Implementation to MapReduce
- Pig UDFs
- NoSQL discussion