Hive Master Class
Monday, February 25, 2013 from 9:00 AM to 5:00 PM (MST)
Mountain View, CA
Hive Master Class
This workshop is taught by Dean Wampler, Principal Consultant at Think Big Analytics and the co-author of Programming Hive. This one-day workshop provides intensive, exercise-driven instruction in more advanced Hive techniques. It is designed for data analysts, who will learn how to optimize Hive table design and queries, analyze performance characteristics, debug problems, and use sophisticated UDFs (user defined functions) and SerDes (serializer-deserializers) for different file formats.
The workshop will be taught in a Linux environment, using the Hive command-line interface (CLI). Therefore, the perspective student for this workshop needs to meet the following prerequisites:
- Previous Hive Experience: The student must have the equivalent of the Think Big Analytics 1-Day Introduction to Hive training or similar hands-on experience.
- SQL Experience: The ability to write SQL queries is required.
- Linux shell experience: The ability to log into Linux servers and use basic Linux shell (bash) commands is required.
What You Must Bring
Bring your laptop with the following software installed in advance.
- Putty (Windows only): Students will log into a remote cluster for the workshop. Mac OSX and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
- A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad, but not Word, NotePad++, but not Notepad, are suitable.
What You Will Learn
You’ll learn Hive features and techniques not typically covered in introductory courses, including techniques for solving difficult query problems and optimizing the performance of your Hive queries, advanced tips for custom UDFs (user-defined functions) to improve performance and SerDes (serializer-deserializers) for custom file formats and interoperability, and how to understand and exploit information Hive gives you in the “explain” command and its log file output.
The specific topics of a workshop will be calibrated to the needs of the students, selected from the following list of possible topics:
- New features in Hive v0.9.X.
- Hive table organization and schema considerations for optimizing different usage scenarios, including:
- Using partitioning to optimize queries with range-bounding where clauses.
- Using bucketing to optimize the performance of joins between tables of arbitrary size and to support data sampling.
- Generating representative samples of large data sets.
- Optimizing join performance.
- Optimizing order by performance.
- Using Hive’s support for table indexing.
- Using UDFs to optimize specific operations and to perform more advanced analytics.
- Using “explain” to understand how queries are translated to MapReduce.
- Miscellaneous tuning tips to improve performance.
- Using column-oriented storage with the RCFile format and corresponding SerDe.
- Avro for interoperability with other tools.
- Using Hive with HBase and other persistent data stores.
- Using Sqoop to import data from and export data to relational data stores.
- How Hive compares with Cloudera’s Impala.
Students will learn these techniques by performing exercises with real-world data samples and analysis problems.
All topics are exercise driven and the students are provided with complete solutions for all exercises.
Hive v0.9.X Features
We’ll warm up with a quick review of the latest enhancements to Hive.
Exercise: Setup our exercise environment. Ensure that everyone can access the exercise cluster and run Hive queries.
Table Organization and Schema Design
When most queries of a table include range-bound where clause (e.g., WHERE year = 2012), partitioning the table by the corresponding columns improves performance. Usingexternal tables promotes easier interoperability with other tools. Finally, Hive supports three complex data types, ARRAYS, STRUCTS, and MAPS within records.
We’ll quickly review these concepts, then discuss the use of bucketing to organize data in large tables so that joins and sampling are accelerated. Bucketing can be used with partitioning, but it is also a useful alternative when a good partitioning strategy isn’t obvious.
Exercise: Set up our external, partitioned, and bucketed tables for the rest of the workshop. Run queries that sample data.
Sampling Large Data Sets
Bucketing is also useful when users frequently run queries to generate small, but representative samples of large data sets. We’ll discuss the TABLESAMPLE clause that’s used for this purpose.
Exercise: Generate different-sized samples using TABLESAMPLE queries.
After quickly reviewing Hive’s join support, including known issues with outer joins and workarounds, we’ll discuss how to optimize joins. Map-side joins work when all but one table is small and when larger tables are bucketed.
We’ll also discuss the performance of left-semi joins vs. inner joins.
Exercise: Analyze the performance of different Hive joins, including map-side and left-semi joins.
Using the ORDER BY clause forces all the data to pass through a single reducer to perform a total ordering. We’ll examine ordering trade-offs and ways to avoid total ordering.
Exercise: Experiment with ordering options.
Hive offers basic support for indexing tables. We’ll examine its strengths, including extensibility, and weaknesses.
Exercise: Experiment with indexing and its impact on query performance.
Exploiting Custom User Defined Functions for Particular Performance Scenarios
Some queries can be written using just the built-in functions and other features of Hive. However, in some cases, a custom UDF that optimizes the results based on your knowledge of the query can improve the query’s performance. Similarly, the complexity of some queries can be reduced with an appropriate UDF.
Exercise: Analyze the complexity and performance of queries in two forms, one form which uses a UDF and one form which uses only the “generic” features of Hive.
Explain describes how Hive translates a logical query into MapReduce jobs. However, it can be difficult to understand the output of explain. We’ll apply explain to several queries and discuss what the output means. Then we’ll examine how to use this output to diagnose performance problems and possible improvements.
Exercise: In this module, the exercise will be done together with the instructor.
Miscellaneous Tuning Tips
There are several techniques available to improve performance, from Hive-specific and to improve performance.
Column-oriented Storage with the RCFile Format
Sometimes schema have a large number of columns, but typical queries only select a small subset of them. Column-oriented storage reduces the overhead of scanning these tables, because data on disk for uninteresting columns is not read. Hive’s RCFile and corresponding ColumnarSerDe supports column-oriented storage.
Exercise: Create a table with column-oriented storage and run queries against it.
Using the Avro File Format
Avro is a data serialization system for tool interoperability, especially for Hadoop ecosystem tools. We’ll discuss its features, strengths and weaknesses.
Exercise: Use Avro-formatted data for a Hive table and explore interoperability with other Hive tables and third-party tools, such as Pig.
Accessing HBase Tables as Hive Tables
As an example of querying data managed in other tools, we will examine how Hive can query data resident in HBase tables, including performance considerations.
Exercise: Query data in HBase and compare the performance to queries from “normal” Hive tables.
Enterprise Integration with Sqoop
Sqoop is a tool for integrating Hadoop with relational and NoSQL databases, where it’s necessary to import data from these stores or export data to them, rather than query the tables directly. We’ll look at example sqoop commands that import relational data to Hive, including the creation of Hive tables on the fly with compatible schema. We’ll look at similar sqoop examples that export data back from Hive.
There is no exercise for this module.
Comparing Hive with Cloudera’s Impala
Cloudera’s Impala is a new, promising query tool that supports HiveQL, but promises significantly better performance than Hive. Impala is currently in beta and it is not yet a feature-complete substitute for Hive. We’ll discuss what features are currently supported or planned, and the relative advantages and disadvantages of using Impala vs. Hive.
There is no exercise for this module.
Final Execise: Natural Language Processing with Hive
We’ll finish with a final exercise that pulls together the techniques we learned to implement some common natural language processing calculations.
A recap of what we learned and where to go from here.