Big Data Deep Dive, Boston, MA
Monday, November 5, 2012 at 9:00 AM - Wednesday, November 7, 2012 at 4:00 PM (EST)
Due to the severe weather associated with Hurricane Sandy, this course has been postponed by one week and will now be held Monday-Wednesday 11/5-11/7.
Amazon Elastic MapReduce 3-Day Developer Course
The Amazon Elastic MapReduce 3-Day Developer Course gives you the essential skills to develop applications on Amazon Elastic MapReduce (Amazon EMR). Our goal is to make you productive as quickly as possible, while setting the stage for your future growth as an EMR developer.
This course gives you the essential grounding in the principles of Hadoop and the MapReduce computation model, the basis of EMR, the Amazon Simple Storage Service (Amazon S3) for distributed file storage, the roles of other essential tools, and how to write applications effectively using these tools. If your team is new to EMR, you’ll learn the skills necessary to transition your existing data management applications over to EMR quickly and then start leveraging the new capabilities EMR gives you for data analysis.
The following prerequisites ensure that you will gain the maximum benefit from the course.
- Programming experience: This is a developers course. We will write Java, Hive, and Pig applications. Prior Java experience is strongly recommended.
- Linux shell experience: Basic Linux shell (bash) commands will be used extensively. Some prior experience is recommended.
- Experience with SQL databases: SQL experience is helpful for learning Hive and Pig, but not essential.
What You Must Bring
We will log into remote EMR instances to build, test, and run our applications. You will also be provided with all the exercise software so you can view it on your laptop, if desired.
Bring your laptop with the following software installed in advance.
- JDK 1.6 or 1.7: The JDK (Java Development Kit) version 1.6 or newer (not just the JRE - Java Runtime Environment). (http://www.oracle.com/technetwork/java/javase/downloads/index.html).
- Ant: The Java-based ant build tool, version 1.7 or newer, if you want to build and test the Java exercises on your laptop. (http://ant.apache.org/)
- A programmer’s source code editor: Whatever you prefer. Either Eclipse or IntelliJ IDEA is recommended for the Java exercises and project files for both environments will be provided. You might find a separate programmer’s text editor to be more convenient for Hive and Pig exercises.
What You Will Learn
Think Big Academy courses teach by doing, where short lectures and hands-on exercises are interspersed. By the end of the course, you will learn the following:
- Amazon Elastic MapReduce Overview and Hadoop Architecture.
- Amazon Elastic MapReduce Value Proposition.
- Starting and examining your first EMR cluster.
- Writing your first Amazon Elastic MapReduce job.
- Loading Data into the cluster.
- Amazon Elastic MapReduce Controls and Debugging EMR.
- Data and Security.
- Elastic MapReduce Programming Models.
- Amazon Elastic MapReduce with streaming.
- Amazon Elastic MapReduce with Pig.
- Amazon Elastic MapReduce with Hive.
- Advanced Hadoop Features – UDFs, UDAFs.
- Amazon Elastic MapReduce Ecosystem.
The particular agenda for each day may be adjusted according to student interests, pace, and other considerations.
Introduction to Amazon EMR
- Introduction to Amazon EMR and the problems it solves.
- The Amazon EMR components, their roles, and how they work.
- Understanding and using Amazon S3 and other file systems.
- The Amazon EMR ecosystem.
- The Amazon EMR value proposition.
- Running Amazon EMR clusters.
- Running “jobs” on Amazon EMR.
- Loading data into an Amazon EMR cluster.
- Exercise: Walkthrough of the Amazon EMR components.
Java MapReduce Programming
The Basics of Java MapReduce
- Data flow through a MapReduce application, using the classic Word Count algorithm.
- An overview of the Java MapReduce API and the anatomy of an application.
- Exercise: Java MapReduce development with Eclipse, unit testing with MRUnit, and running MapReduce jobs on Amazon EMR.
- Hadoop Streaming for writing map and reduce code in Ruby, Python, etc.
- Exercise: Word Count implemented using Hadoop Streaming.
MapReduce Deep Dive
- Combiners for reducing IO overhead.
- Key-Value formats:
WritableComparables. How to create your own.
- Partitioners and Comparators for custom sorting.
- The Secondary Sort algorithm.
- Exercise: Using the Secondary Sort algorithm.
- File formats: built-in formats and how to create your own custom formats. Compression options and the issue of splittable file formats.
- Distributed file systems: HDFS, S3, MapR, and others.
- Counters and logging: knowing what’s going on.
- Mapper reuse: composing mappers with
- Task scheduling: the queue, fair and capacity schedulers.
- The distributed cache and its use.
- Joins: map-side and reduce-side joins.
- Exercise: Indexing Twitter traffic.
Enterprise Application Considerations
- Monitoring applications.
- Scheduling work flows.
- ETL (extract, transform, and load) and data export techniques.
- Monitoring, profiling, debuggin, and tuning applications.
Data Warehousing with Hive
- What is Hive and why would I use it?
- Exercise: Running Hive and basic queries.
- The Hive Query Language (HiveQL) by example.
- Running Exercises: Practice HiveQL concepts as they are introduced.
- Hive vs. Relational Databases.
- Extending Hive with user defined functions (UDFs).
- Exercise: Integrate a UDF into Hive.
- Extending Hive with new data formats.
- Exercise: Supporting a custom SerDe (record format) in Hive.
- Hive under the hood; understanding how Hive drives MapReduce.
- Notes on setting up the Hive Metadata repository.
- Hive tips and tricks.
- Exercise: Ngram analysis with Hive.
Data Flow Programming with Pig
- What is Pig and why would I use it?
- Pig for data flows vs. Hive for queries.
- Exercise: Running Pig and basic data flows.
- Pig Latin, the language of Pig, by example.
- Running Exercises: Practice Pig Latin concepts as they are introduced.
- Extending Pig with Java user defined functions (UDFs).
- Exercise: Extending a Pig application with UDFs.
- Pig under the hood; understanding how Pig drives MapReduce.
- Pig tips and tricks.
- Exercise: Pulling it all together: a complex data flow processing exercise using Pig.
- Recap of what we learned.
- Where to go from here: references and resources.
When & Where
Up and Running with Big Data: 3 Day Deep-Dive
Over three days, explore the Big Data tools, technologies and techniques which allow organisations to gain insight and drive new business opportunities by finding signal in their data. Using Amazon Web Services, you'll learn how to use the flexible map/reduce programming model to scale your analytics, use Hadoop with Elastic MapReduce, write queries with Hive, develop real world data flows with Pig and understand the operational needs of a production data platform
- MapReduce concepts
- Hadoop implementation: Jobtracker, Namenode, Tasktracker, Datanode, Shuffle & Sort
- Introduction to Amazon AWS and EMR with console and command-line tools
- Implementing MapReduce with Java and Streaming
- Hive Introduction
- Hive Relational Operators
- Hive Implementation to MapReduce
- Hive Partitions
- Hive UDFs, UDAFs, UDTFs
- Pig Introduction
- Pig Relational Operators
- Pig Implementation to MapReduce
- Pig UDFs
- NoSQL discussion