Skip Main Navigation
Page Content
This event has ended

Introduction to Hive

Think Big Analytics

Wednesday, January 30, 2013 from 9:00 AM to 5:00 PM (PST)

Introduction to Hive

Ticket Information

Ticket Type Sales End Price Fee Quantity
Introduction to Hive (1-day) Ended $995.00 $0.00

Who's Going

Loading your connections...

Share Introduction to Hive

Event Details

Note: this introductory course is being taught as the first day of a two day Introduction to Hive and Hive Master Class program. Please visit the full course description if you are interested in attending both days or if you only want the second day for more experienced programmers in Hive click here.

Introduction to Hive

This workshop is taught by Dean Wampler, Principal Consultant at Think Big Analytics and the co-author of Programming Hive. Our Introduction to Hive course provides an intensive introduction to Hive for data analysts. Students will learn how to use Hive to query data in Hadoop clusters using familiar SQL queries.

Prerequisites

The workshop will be taught in a Linux environment, using the Hive command-line interface (CLI). Therefore, the prospective student for this workshop needs to meet the following prerequisites:

  • Previous Hive Experience: The student must have the equivalent of the Think Big Analytics 1-Day Introduction to Hive training or similar hands-on experience.
  • SQL Experience: The ability to write SQL queries is required.
  • Linux shell experience: The ability to log into Linux servers and use basic Linux shell (bash) commands is required.

What You Must Bring

Bring your laptop with the following software installed in advance.

  • Putty (Windows only): Students will log into a remote cluster for the workshop. Mac OSX and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
  • A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad, but not Word, NotePad++, but not Notepad, are suitable.


What You Will Learn

The specific topics of a workshop will be calibrated to the needs of the students, but will generally cover the following topics:

  • What is Hive and why would I use it?
  • Exercise: Running Hive and basic queries.
  • Hive vs. Relational Databases
  • The Hive Query Language (HiveQL) by example
  • Running Exercises: Practice HiveQL concepts as they are introduced: Select, Joins, Ordering, Grouping and built-in functions
  • Extending Hive with user defined functions (UDFs).
  • Exercise: Integrate a UDF into Hive.
  • Extending Hive with new data formats.
  • Exercise: Adding and using a custom SerDe (record format) in Hive.
  • Hive under the hood: understanding how Hive drives MapReduce.
  • Notes on setting up the Hive Metadata repository.
  • Hive tips and tricks

Agenda


The particular agenda for each day may be adjusted according to student interests, pace, and other considerations. All topics are exercise driven and the students are provides with complete solutions for all exercises.

Introduction to Hive

The problems Hive solves and the place of Hive in the Hadoop ecosystem.
Exercise: Hands-on walkthrough of the Hive installation and commands.

Defining Databases and Tables

The meaning of these concepts in Hive, including Hive’s performance-driven extensions to conventional types in database schema (namely collection types), file formats and encodings, and the use of external tables for data sharing and partitioned tables for query optimization and data management.
Exercises: Several running exercises that illustrate these concepts using tables with complex data and actual data from NASDAQ and NYSE on stock prices.

Strategies for Loading Data into Hive Tables and Getting It Back Out Again

Techniques and performance considerations for importing data into Hive tables and exporting table data or query results.
Exercise: Practice different importing and exporting techniques.

Hive’s Select Statements

Hive’s version of the select statement, including extensions, plus several support clauses, such as group by.
Exercise: Practice the unique features of Hive’s select statement, such as working with Hive’s collection types, the use of regular expressions for queries, and other topics.

Joins

Joins as implemented in Hive, including limitations compared to other SQL dialects, as well as special optimizations implemented in Hive for higher performance.
Exercise: Understand the performance and behaviors of Hive joins using complex data sets and stock plus dividend data.

Ordering

The support for total ordering of data and the performance issues involved, with workarounds.
Exercise: Experiment with ordering performance.

Built-in and User Defined Functions

Using Hive’s built-in functions and adding your own.
Exercise: Experiment with Hive’s built-in functions and use third-party libraries of additional functions.

Native and Custom File Formats

Understanding the choices for data formats and how they affect performance, ease of use, etc.
Exercise: Use a third-party plugin to query a sample of Twitter traffic in JSON (JavaScript Object Notation) format.

Driving External Programs from Hive

Using Hive queries that delegate some of their work to external programs.
Exercise: Compute text analysis tasks with external programs in a Hive query.

Execise: Natural Language Processing with Hive

A final exercise using several built-in Hive functions for basic ngram analysis of text sources.

Conclusions

A recap of what we learned and where to go from here.

Have questions about Introduction to Hive? Contact Think Big Analytics

When & Where


Think Big Analytics
520 San Antonio Road
#210
Mountain View, CA 94040

Wednesday, January 30, 2013 from 9:00 AM to 5:00 PM (PST)


  Add to my calendar

Please log in or sign up

In order to purchase these tickets in installments, you'll need an Eventbrite account. Log in or sign up for a free account to continue.