Introduction to Hive
Wednesday, January 30, 2013 from 9:00 AM to 5:00 PM (PST)
Mountain View, CA
Note: this introductory course is being taught as the first day of a two day Introduction to Hive and Hive Master Class program. Please visit the full course description if you are interested in attending both days or if you only want the second day for more experienced programmers in Hive click here.
Introduction to Hive
This workshop is taught by Dean Wampler, Principal Consultant at Think Big Analytics and the co-author of Programming Hive. Our Introduction to Hive course provides an intensive introduction to Hive for data analysts. Students will learn how to use Hive to query data in Hadoop clusters using familiar SQL queries.
The workshop will be taught in a Linux environment, using the Hive command-line interface (CLI). Therefore, the prospective student for this workshop needs to meet the following prerequisites:
- Previous Hive Experience: The student must have the equivalent of the Think Big Analytics 1-Day Introduction to Hive training or similar hands-on experience.
- SQL Experience: The ability to write SQL queries is required.
- Linux shell experience: The ability to log into Linux servers and use basic Linux shell (bash) commands is required.
What You Must Bring
Bring your laptop with the following software installed in advance.
- Putty (Windows only): Students will log into a remote cluster for the workshop. Mac OSX and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
- A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad, but not Word, NotePad++, but not Notepad, are suitable.
What You Will Learn
The specific topics of a workshop will be calibrated to the needs of the students, but will generally cover the following topics:
- What is Hive and why would I use it?
- Exercise: Running Hive and basic queries.
- Hive vs. Relational Databases
- The Hive Query Language (HiveQL) by example
- Running Exercises: Practice HiveQL concepts as they are introduced: Select, Joins, Ordering, Grouping and built-in functions
- Extending Hive with user defined functions (UDFs).
- Exercise: Integrate a UDF into Hive.
- Extending Hive with new data formats.
- Exercise: Adding and using a custom SerDe (record format) in Hive.
- Hive under the hood: understanding how Hive drives MapReduce.
- Notes on setting up the Hive Metadata repository.
- Hive tips and tricks
The particular agenda for each day may be adjusted according to student interests, pace, and other considerations. All topics are exercise driven and the students are provides with complete solutions for all exercises.
Introduction to HiveThe problems Hive solves and the place of Hive in the Hadoop ecosystem.
Exercise: Hands-on walkthrough of the Hive installation and commands.
Defining Databases and TablesThe meaning of these concepts in Hive, including Hive’s performance-driven extensions to conventional types in database schema (namely collection types), file formats and encodings, and the use of external tables for data sharing and partitioned tables for query optimization and data management.
Exercises: Several running exercises that illustrate these concepts using tables with complex data and actual data from NASDAQ and NYSE on stock prices.
Strategies for Loading Data into Hive Tables and Getting It Back Out AgainTechniques and performance considerations for importing data into Hive tables and exporting table data or query results.
Exercise: Practice different importing and exporting techniques.
Hive’s Select StatementsHive’s version of the select statement, including extensions, plus several support clauses, such as group by.
Exercise: Practice the unique features of Hive’s select statement, such as working with Hive’s collection types, the use of regular expressions for queries, and other topics.
JoinsJoins as implemented in Hive, including limitations compared to other SQL dialects, as well as special optimizations implemented in Hive for higher performance.
Exercise: Understand the performance and behaviors of Hive joins using complex data sets and stock plus dividend data.
OrderingThe support for total ordering of data and the performance issues involved, with workarounds.
Exercise: Experiment with ordering performance.
Built-in and User Defined FunctionsUsing Hive’s built-in functions and adding your own.
Exercise: Experiment with Hive’s built-in functions and use third-party libraries of additional functions.
Native and Custom File FormatsUnderstanding the choices for data formats and how they affect performance, ease of use, etc.
Driving External Programs from HiveUsing Hive queries that delegate some of their work to external programs.
Exercise: Compute text analysis tasks with external programs in a Hive query.