Skip Main Navigation
Page Content
This event has ended

Introduction to Hive and Hive Master Class

Think Big Analytics

Wednesday, January 30, 2013 at 9:00 AM - Thursday, January 31, 2013 at 5:00 PM (MST)

Mountain View, CA

Introduction to Hive and Hive Master Class

Ticket Information

Ticket Type Sales End Price Fee Quantity
Intro & Master Class (2-days) Ended $2,295.00 $0.00

Who's Going

Loading your connections...

Share Introduction to Hive and Hive Master Class

Event Details

This is a two day course with one day for introduction and one for a deeper dive into Hive. You can attend only one day instead of both by clicking here for the introduction part and here for the Master class.

Introduction to Hive and Hive Master Class

This workshop is taught by Dean Wampler, Principal Consultant at Think Big Analytics and the co-author of Programming Hive. Our Introduction to Hive course provides a one day intensive introduction to Hive for data analysts. Students will learn how to use Hive to query data in Hadoop clusters using familiar SQL queries. The second day's Hive Master Class builds on that knowledge and explores more advanced Hive techniques such as optimizing Hive table design and queries, analyzing performance characteristics, debugging, and using sophisticated UDFs (user defined functions) and SerDes (serializer-deserializers) for different file formats.

Prerequisites

The workshop will be taught in a Linux environment, using the Hive command-line interface (CLI). Therefore, the perspective student for this workshop needs to meet the following prerequisites:

  • SQL Experience: The ability to write SQL queries is required.
  • Linux shell experience: The ability to log into Linux servers and use basic Linux shell (bash) commands is required.

What You Must Bring

Bring your laptop with the following software installed in advance.

  • Putty (Windows only): Students will log into a remote cluster for the workshop. Mac OSX and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
  • A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad, but not Word, NotePad++, but not Notepad, are suitable.

What You Will Learn

This workshop is taught by Dean Wampler, Principal Consultant at Think Big Analytics and the co-author of Programming Hive. The specific topics of a workshop will be calibrated to the needs of the students, but will generally cover the following topics:

Introduction to Hive


  • What is Hive and why would I use it?
  • Exercise: Running Hive and basic queries.
  • Hive vs. Relational Databases
  • The Hive Query Language (HiveQL) by example
  • Running Exercises: Practice HiveQL concepts as they are introduced: Select, Joins, Ordering, Grouping and built-in functions
  • Extending Hive with user defined functions (UDFs).
  • Exercise: Integrate a UDF into Hive.
  • Extending Hive with new data formats.
  • Exercise: Adding and using a custom SerDe (record format) in Hive.
  • Hive under the hood: understanding how Hive drives MapReduce.
  • Notes on setting up the Hive Metadata repository.
  • Hive tips and tricks


Hive Master Class


  • New features in Hive v0.9.X.
  • Hive table organization and schema considerations for optimizing different usage scenarios, including:
    • Using partitioning to optimize queries with range-bounding where clauses.
    • Using bucketing to optimize the performance of joins between tables of arbitrary size and to support data sampling.
  • Generating representative samples of large data sets.
  • Optimizing join performance.
  • Optimizing order by performance.
  • Using Hive’s support for table indexing.
  • Using UDFs to optimize specific operations and to perform more advanced analytics.
  • Using “explain” to understand how queries are translated to MapReduce.
  • Miscellaneous tuning tips to improve performance.
  • Using column-oriented storage with the RCFile format and corresponding SerDe.
  • Avro for interoperability with other tools.
  • Using Hive with HBase and other persistent data stores.
  • Using Sqoop to import data from and export data to relational data stores.
  • How Hive compares with Cloudera’s Impala.

Students will learn these techniques by performing exercises with real-world data samples and analysis problems.

Agenda

All topics are exercise driven and the students are provided with complete solutions for all exercises.


Introduction to Hive

Overview

The problems Hive solves and the place of Hive in the Hadoop ecosystem.
Exercise: Hands-on walkthrough of the Hive installation and commands.

Defining Databases and Tables

The meaning of these concepts in Hive, including Hive’s performance-driven extensions to conventional types in database schema (namely collection types), file formats and encodings, and the use of external tables for data sharing and partitioned tables for query optimization and data management.
Exercises: Several running exercises that illustrate these concepts using tables with complex data and actual data from NASDAQ and NYSE on stock prices.

Strategies for Loading Data into Hive Tables and Getting It Back Out Again

Techniques and performance considerations for importing data into Hive tables and exporting table data or query results.
Exercise: Practice different importing and exporting techniques.

Hive’s Select Statements

Hive’s version of the select statement, including extensions, plus several support clauses, such as group by.
Exercise: Practice the unique features of Hive’s select statement, such as working with Hive’s collection types, the use of regular expressions for queries, and other topics.

Joins

Joins as implemented in Hive, including limitations compared to other SQL dialects, as well as special optimizations implemented in Hive for higher performance.
Exercise: Understand the performance and behaviors of Hive joins using complex data sets and stock plus dividend data.

Ordering

The support for total ordering of data and the performance issues involved, with workarounds.
Exercise: Experiment with ordering performance.

Built-in and User Defined Functions

Using Hive’s built-in functions and adding your own.
Exercise: Experiment with Hive’s built-in functions and use third-party libraries of additional functions.

Native and Custom File Formats

Understanding the choices for data formats and how they affect performance, ease of use, etc.
Exercise: Use a third-party plugin to query a sample of Twitter traffic in JSON (JavaScript Object Notation) format.

Driving External Programs from Hive

Using Hive queries that delegate some of their work to external programs.

Conclusions

A recap of what we learned and where to go from here.

Hive Master Class

Hive v0.9.X Features

We’ll warm up with a quick review of the latest enhancements to Hive.
Exercise: Setup our exercise environment. Ensure that everyone can access the exercise cluster and run Hive queries.

Table Organization and Schema Design

When most queries of a table include range-bound where clause (e.g., WHERE year = 2012), partitioning the table by the corresponding columns improves performance. Usingexternal tables promotes easier interoperability with other tools. Finally, Hive supports three complex data types, ARRAYS, STRUCTS, and MAPS within records.
We’ll quickly review these concepts, then discuss the use of bucketing to organize data in large tables so that joins and sampling are accelerated. Bucketing can be used with partitioning, but it is also a useful alternative when a good partitioning strategy isn’t obvious.
Exercise: Set up our external, partitioned, and bucketed tables for the rest of the workshop. Run queries that sample data.

Sampling Large Data Sets

Bucketing is also useful when users frequently run queries to generate small, but representative samples of large data sets. We’ll discuss the TABLESAMPLE clause that’s used for this purpose.
Exercise: Generate different-sized samples using TABLESAMPLE queries.

Joins

After quickly reviewing Hive’s join support, including known issues with outer joins and workarounds, we’ll discuss how to optimize joins. Map-side joins work when all but one table is small and when larger tables are bucketed.
We’ll also discuss the performance of left-semi joins vs. inner joins.
Exercise: Analyze the performance of different Hive joins, including map-side and left-semi joins.

Ordering

Using the ORDER BY clause forces all the data to pass through a single reducer to perform a total ordering. We’ll examine ordering trade-offs and ways to avoid total ordering.
Exercise: Experiment with ordering options.

Indexing

Hive offers basic support for indexing tables. We’ll examine its strengths, including extensibility, and weaknesses.
Exercise: Experiment with indexing and its impact on query performance.

Exploiting Custom User Defined Functions for Particular Performance Scenarios

Some queries can be written using just the built-in functions and other features of Hive. However, in some cases, a custom UDF that optimizes the results based on your knowledge of the query can improve the query’s performance. Similarly, the complexity of some queries can be reduced with an appropriate UDF.
Exercise: Analyze the complexity and performance of queries in two forms, one form which uses a UDF and one form which uses only the “generic” features of Hive.

Understanding Explain

Explain describes how Hive translates a logical query into MapReduce jobs. However, it can be difficult to understand the output of explain. We’ll apply explain to several queries and discuss what the output means. Then we’ll examine how to use this output to diagnose performance problems and possible improvements.
Exercise: In this module, the exercise will be done together with the instructor.

Miscellaneous Tuning Tips

There are several techniques available to improve performance, from Hive-specific and to improve performance.

Column-oriented Storage with the RCFile Format

Sometimes schema have a large number of columns, but typical queries only select a small subset of them. Column-oriented storage reduces the overhead of scanning these tables, because data on disk for uninteresting columns is not read. Hive’s RCFile and corresponding ColumnarSerDe supports column-oriented storage.
Exercise: Create a table with column-oriented storage and run queries against it.

Using the Avro File Format

Avro is a data serialization system for tool interoperability, especially for Hadoop ecosystem tools. We’ll discuss its features, strengths and weaknesses.
Exercise: Use Avro-formatted data for a Hive table and explore interoperability with other Hive tables and third-party tools, such as Pig.

Accessing HBase Tables as Hive Tables

As an example of querying data managed in other tools, we will examine how Hive can query data resident in HBase tables, including performance considerations.
Exercise: Query data in HBase and compare the performance to queries from “normal” Hive tables.

Enterprise Integration with Sqoop

Sqoop is a tool for integrating Hadoop with relational and NoSQL databases, where it’s necessary to import data from these stores or export data to them, rather than query the tables directly. We’ll look at example sqoop commands that import relational data to Hive, including the creation of Hive tables on the fly with compatible schema. We’ll look at similar sqoop examples that export data back from Hive.
There is no exercise for this module.

Comparing Hive with Cloudera’s Impala

Cloudera’s Impala is a new, promising query tool that supports HiveQL, but promises significantly better performance than Hive. Impala is currently in beta and it is not yet a feature-complete substitute for Hive. We’ll discuss what features are currently supported or planned, and the relative advantages and disadvantages of using Impala vs. Hive.
There is no exercise for this module.

Final Exercise: Natural Language Processing with Hive

We’ll finish with a final exercise that pulls together the techniques we learned to implement some common natural language processing calculations.

Conclusions

A recap of what we learned and where to go from here.

Have questions about Introduction to Hive and Hive Master Class? Contact Think Big Analytics

When & Where


Think Big Analytics
520 San Antonio Road
#210
Mountain View, CA 94040

Wednesday, January 30, 2013 at 9:00 AM - Thursday, January 31, 2013 at 5:00 PM (MST)


  Add to my calendar

Please log in or sign up

In order to purchase these tickets in installments, you'll need an Eventbrite account. Log in or sign up for a free account to continue.