Teaching new Presto performance tricks to the Old-School DBA

I’ve spent much of my career working with relational databases such as Oracle and MySQL, and SQL performance has always been an area of focus for me. I’ve spent countless hours reviewing EXPLAIN plans, rewriting subqueries, adding new indexes, and chasing down table-scans. I’ve been trained to make performance improvements such as:  only choose columns in a SELECT that are absolutely necessary, stay away from LIKE clauses, review the cardinality of columns before adding indexes, and always JOIN on indexed columns.

It’s been an instinctual part of my life as a Database Administrator who supports OLTP databases that have sold in excess of 20K tickets per minute to your favorite events. I remember a specific situation where a missing index caused our production databases to get flooded with table-scans that brought a world-wide on-sale to an immediate halt. I had a lot of explaining to do that day as the missing index made it to QA and Stage but not Production!

In recent years, I’ve transitioned to Data Engineering and began supporting Big Data environments.  Specifically, I’m supporting Eventbrite’s Data Warehouse which leverages Presto and Apache Hive using the Presto/Hive connector. The data files can be of different formats, but we’re using HDFS and S3.  The Hive metadata describes how data stored in HDFS/S3 maps to schemas, tables, and columns to be queried via SQL. We persist this metadata information in Amazon Aurora and access it through the Presto/Hive connector via the Hive Metastore Service (HMS). 

The stakes have changed and so have the skill-sets required. I’ve needed to retrain myself in how to write optimal SQL for Presto. Some of the best practices for Presto are the same as relational databases and others are brand new to me. This blog post summarizes some of the similarities and some of the differences with writing efficient SQL on MySQL vs Presto/Hive. Along the way I’ve had to learn new terms such as “federated queries”, “broadcast joins”, “reshuffling”, “join reordering”, and “predicate pushdown”.

Let’s start with the basics:

What is MySQL? The world’s most popular open source database. The MySQL software delivers a fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL is intended for mission-critical, heavy-load production database usage.

What is Presto? Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel relational databases.

Presto uses ANSI SQL syntax/semantics to build its queries. The advantage of this is that analysts with experience with relational databases will find it very easy and straightforward to write Presto queries! That said, the best practices for developing efficient SQL via Presto/Hive are different from those used to query standard RDBMS databases.

Let’s transition to Presto performance tuning tips and how they compare to standard best practices with MySQL.

 

1. Only specify the columns you need

Restricting columns for SELECTs can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*). This applies to Presto as well as MySQL! 

 

2. Consider the cardinality within GROUP BY

When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.

The GROUP BY operator distributes rows based on the order of the columns to the worker nodes, which hold the GROUP BY values in memory. As rows are being ingested, the GROUP BY columns are looked up in memory and the values are compared. If the GROUP BY columns match, the values are then aggregated together.

 

3. Use LIMIT with ORDER BY

The ORDER BY clause returns the results of a query in sort order. To  process the sort, Presto must send all rows of data to a single worker and then sort them. This sort can be a very memory-intensive operation for large datasets that will put strain on the Presto workers. The end result will be long execution times and/or memory errors. 

If you are using the ORDER BY clause to look at the top N values, then use a LIMIT clause to reduce the cost of the sort significantly by pushing the sorting/limiting to individual workers, rather than the sorting being done by a single worker. 

I highly recommend you use the LIMIT clause not just for SQL with ORDER BY but in any situation when you’re validating new SQL. This is good practice for MySQL as well as Presto!

 

4. Using approximate aggregate functions

When exploring large datasets often an approximation (with standard deviation of 2.3%) is more than good enough! Presto has approximate aggregation functions that give you significant performance improvements. Using the approx_distinct(x) function on large data sets vs COUNT(DISTINCT x) will result in performance gains. 

When an exact number may not be required―for instance, if you are looking for a rough estimate of the number of New Years events in the Greater New York area then consider using approx_distinct(). This function minimizes the memory usage by counting unique hashes of values instead of entire strings. The drawback is that there is a small standard deviation.

 

5. Aggregating a series of LIKE clauses in one single regexp_like clause

The LIKE operation is well known to be slow especially when not anchored to the left (i.e. the search text is surrounded by ‘%’ on both sides) or when used with a series of OR conditions. So it is no surprise that Presto’s query optimizer is unable to improve queries that contain many LIKE clauses.  

We’ve found improved  LIKE performance on Presto by  substituting the LIKE/OR  combination with a single REGEXP_LIKE clause, which is Presto native.  Not only is it easier to read but it’s also more performant. Based on some quick performance tests, we see ~30% increase in run-times with REGEXP_LIKE vs comparable LIKE/OR combination.

For example:

SELECT  ...FROM zoo 
WHERE method LIKE '%monkey%' OR 
      method LIKE '%hippo%' OR 
      method LIKE '%tiger%' OR 
      method LIKE '%elephant%'

can be optimized by replacing the four LIKE clauses with a single REGEXP_LIKE clause:

SELECT  ...FROM zoo 
WHERE REGEXP_LIKE(method, 'monkey|hippo|tiger|elephant')

 

6. Specifying large tables first in join clause

When joining tables, specify the largest table first in the join. The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. If the right-hand side table is “small” then it can be replicated to all the join workers which will save CPU and network costs.  This type of join will be most efficient when the right-hand side table is small enough to fit within one node. 

If you receive an ‘Exceeded max memory’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your largest table is the first table in your sequence of joins. 

This was an interesting performance tip for me. As we know, SQL is a declarative language and the ordering of tables used in joins in MySQL, for example,  is *NOT* particularly important. The MySQL optimizer will re-order to choose the most efficient path. With Presto, the join order matters. You’ve been WARNED! Presto does not perform automatic join-reordering unless using the Cost Based Optimizer!

 

7. Turning on the distributed hash join

If you’re battling with memory errors then try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. 

At Eventbrite we have the distributed_join variable set to ‘true’. (SHOW SESSION). Also it can be enabled by setting a session property (set session distributed_join = ‘true’).

 

8. Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date or country.  You define partitions at table creation, and they help reduce the amount of data scanned per query, thereby improving performance. 

Here are some hints on partitioning:

  • Columns that are used as WHERE filters are good candidates for partitioning.
  • Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Use caution when partitioning and make sure you don’t partition too finely. 
  • If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

A key partition column at Eventbrite is transaction date (txn_date).

CREATE TABLE IF NOT EXISTS fact_ticket_purchase
(
    ticket_id STRING,
....
    create_date STRING,
    update_date STRING
)
PARTITIONED BY (trx_date STRING)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

 

9. Optimize columnar data store generation

Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by using column-wise compression based on data type, special encoding, and predicate pushdown. At Eventbrite, we define Hive tables as PARQUET using compression equal to SNAPPY….

CREATE TABLE IF NOT EXISTS dim_event
(
    dim_event_id STRING,
....
    create_date STRING,
    update_date STRING,

)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

Apache Parquet is an open-source, column-oriented data storage format. Snappy is designed for speed and will not overload your CPU cores. The downside of course is that it does not compress as well as gzip or bzip2.

 

10. Presto’s Cost-Based Optimizer/Join Reordering 

We’re not currently using Presto’s Cost-Based Optimizer (CBO)! Eventbrite data engineering released Presto 330 in March 2020, but we haven’t tested CBO yet.

CBO inherently requires the table stats be up-to-date which we only calculate for a small subset of tables! Using the CBO, Presto will be able to intelligently decide the best sequence based on the statistics stored in the Hive Metastore.

As mentioned above, the order in which joins are executed in a query can have a big impact on performance. If we collect table statistics then the CBO can automatically pick the join order with the lowest computed costs. This is governed by the join_reordering_strategy (=AUTOMATIC) session property and I’m really excited to see this feature in action.

Another interesting join optimization is dynamic filtering. It relies on the stats estimates of the CBO to correctly convert the join distribution type to “broadcast” join. By using dynamic filtering via run-time predicate pushdown, we can squeeze out more performance gains for highly-selective inner-joins.  We look forward to using this feature in the near future!

 

11. Using WITH Clause

The WITH clause is used to define an inline view within a single query.  It allows for flattening nested subqueries. I find it hugely helpful for simplifying SQL, and making it more readable and easier to support.

 

12. Use Presto Web Interface

Presto provides a web interface for monitoring queries (https://prestodb.io/docs/current/admin/web-interface.html). 

The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. If Presto cluster is having any performance-related issues, this web interface is a good place to go to identify and capture slow running SQL!

 

13. Explain plan with Presto/Hive (Sample)

EXPLAIN is an invaluable tool for showing the logical or distributed execution plan of a statement and to validate the SQL statements. 

Logical Plan with Presto

explain select SUBSTRING(last_modified,1,4) ,count(*)  from hive.df_machine_learning.event_text where lower(name) like ‘%wilbraham%’ or (REGEXP_LIKE(lower(name), ‘.*wilbraham.*’)) group by 1 order by 1;

 

14. Explain plan with MySQL (Sample)

In this particular case you can see that the primary key is used on the ‘ejp_events’ table and the non-primary key on the “ejp_orders’ table. This query is going to be fast!

 

Conclusion

Presto is the “SQL-on-Anything” solution that powers Eventbrite’s data warehouse. It’s been very rewarding for me as the “Old School DBA” to learn new SQL tricks related to a distributed query engine such as Presto. In most cases, my SQL training on MySQL/Oracle has served me well but there are some interesting differences which I’ve attempted to call-out above. Thanks for reading and making it to the end. I appreciate it!

We look forward to giving Presto’s Cost-Based Optimizer a test drive and kicking the tires on new features such as dynamic filtering & partition pruning!

All comments are welcome, or you can message me at ed@eventbrite.com. You can learn more about Eventbrite’s use of Presto by checking out my previous post at Boosting Big Data workloads with Presto Auto Scaling.

Special thanks to Eventbrite’s Data Foundry team (Jeremy Bakker,  Alex Meyer, Jasper Groot, Rainu Ittycheriah, Gray Pickney, and Beck Cronin-Dixon) for the world-class Presto support, and Steven Fast for reviewing this blog post. Eventbrite’s Data Teams rock!

Create Meaningful (and Fun!) Remote Community

Create meaningful and fun remote community

What makes community meaningful? How can you create a meaningful culture and community in a time where we are all separated while working remote?

I want today to talk about community, and how important it is to be a proactive agent in fostering a meaningful (and fun!) remote work culture.  Just a smidge about me: before I was a software engineer at Eventbrite, I was on the ground figuring out how to create and grow community in various capacities – as a resident advisor in college, an event planner at Stanford, a coding teacher in various classrooms, and as a participant in various worldwide dance and performing communities.  Community is my thing.

I have seen a lot of articles floating around with recommendations on how to create a remote work culture. I want to point out that I’m very specifically using the phrase “remote community” as opposed to “remote culture” because to me, community is something present and tangible and interconnected, whereas culture can be this very nebulous thing.  Culture arises out of community – community is the foundation.

I am also writing this article from the perspective of a mostly remote work community, but these ideas can be taken and used in any type of community.  This article is not engineering specific but can easily used to benefit any engineering community.

What makes community meaningful?

Community, to me, is about caring for people.  A lot of times, community starts by happenstance – you and others happen to share an interest or work at the same job.  But, in my opinion, ‘happenstance’ community graduates to being a real community when people work together to create a place that everyone cares about.  That to me is where the meaning is derived – from people caring enough to want to make their community a place where people feel supported and feel joy.

How you can create meaningful community

I find that creating meaningful community is as simple as stepping up. Being proactive goes a long, long way – either on an individual level, like reaching out to someone you know who is having a hard time. Or being proactive with a broader group in mind, like creating activities that bring people together, etc.

I want to touch mostly on the latter – how to engage a broader group, with either active engagement or passive engagement.  One more thing to consider – how to make things fun! For me, it’s pretty easy – I think about what I enjoy and what I want to see in a community, and then I make it happen.  You can too!

Active Engagement

Active engagement involves events or activities that require people to be present and available at the time that it is happening.  Here are some ideas:

  1. Remote happy hour.  Schedule one once a week with your department or with cross sections of the company you normally don’t interact with.  Use this as an opportunity to talk to new people that you would normally don’t cross paths with.Pro tip – if you think more than 10 people are going to show up, figure out some way to create ‘rooms’ – either separate Hangouts that people can jump in and out of, or utilize Zoom’s room feature.
  2. Remote team lunch.  Suggestion – why not every day or twice a week?
  3. 2pm 2 minute planks.  Get stronger every day!  We do this via video chat as well to cheer each other on.

Passive Engagement

Passive engagement involves activities that people can circle back around to when they have time.  Ideas include:

  1. Daily Photo Share.  Post a theme for the photo of that day (Baby photo of you! Throwback photo of your parents! Photo of you doing an activity you love!) and see what you end up learning.  I have added a list of ideas that I compiled with my co-workers at the bottom of this article.
  2. Dear Diary chaining.  Remember when email chaining was a thing?  Well, my coworker came up with this idea – write a ‘Dear Diary’ entry about your day, post with a picture, and then tag a new coworker for the next day! It’s been interesting to get a more in-depth look into people’s mentality and how they are trying to stay optimistic.

Out of all the ideas above, so far the Daily Photo Share has seemed to have the most impact and been the most uplifting – I tend to post it around 12pm, and then spend the next hour scrolling and commenting on everyone’s amazing pictures of themselves, their families, and the things they love.  It’s an amazing way to see everyone share what they care about.

What’s your time commitment?

It’s up to you to decide your community time commitment.  It doesn’t have to be everything that’s listed above. I recommend the thought model of 2 minutes, 2 hours, or 2 months – what can you contribute? Can you contribute 2 minutes of your time, or 2 hours, or maybe even 2 months?  The 2pm 2 minute plank is a great example of a short amount of time, or even commenting on someone else’s photo of the daily photo share. 2 hours could be something like deciding to throw and promote a remote happy hour.

I think that now, more than ever, it is really important to take time out of our lives to reach out, to connect, to engage – to build community, and to receive from our communities.

I want to hear from you.  What ideas do you have for active or passive engagement in your community?  What makes community meaningful to you, and how can we surface that more in remote communities?

(Some) Daily Photo Share Theme Ideas:
  • Throwback photo of your parent(s)
  • Hobby/activity you love doing
  • Last meal you took a photo of
  • Your family (immediate or otherwise)
  • Your workspace at home
  • Silly photo of you
  • An item you cherish
  • Vacation you recently took
  • Photo you are proud of (either proud that you took, or of something you are proud of)
  • Photo with a story (and tell the story!)
  • Something that most people don’t know about you
  • Favorite costume photo of you (Halloween or otherwise)
  • Photo of you where you like your vibe
  • Share a photo of a coworker that you appreciate, and why you appreciate them
  • Photo of a new hobby or life hack have you acquired in the last few weeks
  • Something/someone you miss (and give back story!)
  • This could have been an ad for…
  • Everybody should try this once
  • “I didn’t want to be in this photo“

Leveraging AWS “spot” instances to drive down costs

I had the pleasure of participating in a Big Data Cost Optimization event in San Francisco (October 17) at the AWS Loft location. I was a panelist in a round-table discussion focusing on the use of AWS spot instances. It was a terrific opportunity to share how Eventbrite’s Data Engineering team is smartly using spot instances for Big Data & EMR to drive down costs. 

At Eventbrite, we’re using spot instances to leverage AWS auto scaling and I’ve published a blog entry on this topic (Big Data workloads with Presto Auto Scaling). Two key focus areas for the data engineering team are ephemeral computing (something that lasts for a very short time)  and idempotence (repeatable behaviour with the same outcome).  Making a commitment in these areas has allowed us to leverage spot instances to effectively manage costs.

Spot instances are a cost-effective choice if you can be flexible about when your applications run and if your applications can be interrupted. They are well-suited for data analysis, batch jobs, background processing, and non mission-critical tasks. Below are some of the panel questions and my answers, and I’d love to share what I learned based on the conversation.
Continue reading “Leveraging AWS “spot” instances to drive down costs”

Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations

D3.js is the de facto standard JavaScript library for building interactive data visualizations on the Web. When beginning a new D3.js project, developers usually start by taking one of the community’s many examples and using it to jumpstart new work. This is a convenient way to get something working fast so that they can iterate over it. However, by working this way, you soon struggle to maintain, extend, or modify your charts. D3.js examples are made to demonstrate implementations and techniques, not to ship them to production.

Discover “Pro D3.js”, the new book that walks you through the creation of maintainable, modular, and testable charts. You will also learn how to package and document your visualizations, producing open-source libraries. Read on to learn more about the book and its contents. Continue reading “Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations”

A Story of a React Re-Rendering Bug

As front-end developers, we often find ourselves getting into perplexing bugs when the page we build involves a lot of user interactions. When we find a bug, no matter how tricky it is, it means something is wrong in the code. There is no magic, and the code itself does not lie.

This blog will take you on a short journey about how I fixed a particularly annoying bug that existed in one of our products. Continue reading “A Story of a React Re-Rendering Bug”

How boba trips and event onsites made me a better engineer (a Briteling intern’s reflection on community at Eventbrite)

Written by 2019 summer Organizer App intern Vivian Phung

Hello World! I’m Vivian Phung, a Computer Science and Mathematics double major at Bryn Mawr College, and for the last few months, an iOS Software Engineering intern on Eventbrite’s Organizer App Team. Some of the best parts of my internship were the communities I had the opportunity to engage withboth at the office and within the broader Eventbrite community. In this blog post, I want to share how engaging with these communities helped me conquer imposter syndrome and get the most out of my summer internship.  Continue reading “How boba trips and event onsites made me a better engineer (a Briteling intern’s reflection on community at Eventbrite)”

How to fix the ugly focus ring and not break accessibility in React

header image

Creating beautiful, aesthetic designs while maintaining accessibility has always been a challenge in the frontend. One particular barrier is the dreaded “:focus” ring. It looks like this:

focus outline on a button

After clicking any button, the default styling in the browser displays a focus outline. This ugly outline easily mars a perfectly crafted interface.

A quick Stack Overflow search reveals an easy fix: just use a bit of CSS, outline: none; on the affected element. It turns out that many websites use this trick to make their sites beautiful and avoid the focus outline. Continue reading “How to fix the ugly focus ring and not break accessibility in React”

How to be a Successful Junior Engineer

I came to software engineering through a nontraditional background, and as a result, I entered the engineering world trepidatiously with a great fear of failing. As I gained experience in my first few engineering roles, I realized that being successful had to do with providing myself the right tools.

In this blog post, I outline some of the things that helped me get to where I am today: an accomplished software engineer that feels confident about the value I provide to my team and my company. Continue reading “How to be a Successful Junior Engineer”

The Truth about Boundaries, Curiosity, and Requests (Part 2 of 2)

In Part 1 of this series, we discussed boundaries in depth. As a refresher, a boundary is the line of demarcation between one person’s consent and another’s agency. This article will be covering curiosity and requests. The two things, used together, help us manage boundaries and navigate through other’s boundaries.

Practicing curiosity

I’m going to borrow from the Cognitive Behavioral Therapy model and look first at what is happening in our minds that drives actions. In this approach to therapy, there is a model called the Thought-Emotion-Action (TEA) Triangle. The model starts with some event that initiates a thought, the thought leads to an emotion, and that emotion leads to action. Breaking it down, here’s a possible TEA in action for the example we used in Part 1 of a hugger initiating a hug with someone who doesn’t like them: Continue reading “The Truth about Boundaries, Curiosity, and Requests (Part 2 of 2)”

The Truth about Boundaries, Curiosity, and Requests (Part 1 of 2)

Learning how to recognize and manage your boundaries and respect others’ boundaries is key to growing emotional intelligence and generally being a better human. Most importantly, managing boundaries is essential to healthy conflict, reduced stress, and creating a psychologically safe environment for yourself.

In the first of this two-part blog, we’ll dig into understanding boundaries. In the coming weeks, stay tuned for a blog on the skill of practicing curiosity and making requests which can help you manage the sometimes tumultuous landscape of your inner dialogue and maybe the panorama of someone else’s boundaries.

Would you like FRIES with that?

We spend just over 13 years at work in our lifetimes; that’s 676 weeks and over 27 thousand hours. Relationships at the office are arguably just as critical as our familial or social connections. With the abundance of social media, communication technologies, and increased connectedness the edges of our boundaries are blurring more and more every day. Continue reading “The Truth about Boundaries, Curiosity, and Requests (Part 1 of 2)”