MySQL High Availability at Eventbrite

Situation

Eventbrite has been using MySQL since its inception as a company in 2006. MySQL has served the company well as an OLTP database and we’ve leveraged the strong features of MySQL such as native replication and fast reads, as well as dealt with some of its pain points such as impactful table ALTERs. The production environment relies heavily on MySQL replication. It includes a single primary instance for writes and a number of secondary replicas to distribute the reads.  

Fast forward to 2019.  Eventbrite is still being powered by MySQL but the version in production (MySQL 5.1) is woefully old and unsupported. Our MySQL production environment still leans heavily on native MySQL replication. We have a single primary for writes, numerous secondary replicas for reads, and an active-passive setup for failover in case the primary has issues. Our ability to failover is complicated and risky which has resulted in extended outages as we’ve fixed the root cause of the outage on the existing primary rather than failing over to a new primary. 

If the primary database is not available then our creators are not creating events and our consumers are not buying tickets for these events. The failover from active to passive primary is available as a last resort but requires us to rebuild a number of downstream replicas. Early in 2019, we had several issues with the primary MySQL 5.1 database and due to reluctance to failover we incurred extended outages while we fixed the source of the problems. 

The Database Reliability Engineering team in 2019 was tasked first and foremost with upgrading to MySQL 5.7 as well as implementing high availability and a number of other improvements to our production MySQL datastores. The goal was to implement an automatic failover strategy on MySQL 5.7 where an outage to our primary production MySQL environment would be measured in seconds rather than minutes or even hours. Below is a series of solutions/improvements that we’ve implemented since mid-year 2019 that have made a huge positive impact on our MySQL production environment. 

Solutions

MySQL 5.7 upgrade

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances. Due to the nature of the upgrade and the large gap between 5.1 and 5.7, we incurred downtime to make it happen. The maintenance window lasted ~30 minutes and it went like clockwork. The DBRE team completed ~15 Failover practice runs against Stage in the days leading up to the cut-over and it’s one of the reasons the cutover was so smooth. The cut-over required 50+ Engineers, Product, QA, Managers in a hangout to support with another 50+ Engineers assuming on-call responsibilities through the weekend. It was not just a DBRE team effort but a full Engineering team effort!

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver! 

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value. We had a recent production incident that was related to hitting the max value on an INT auto-increment primary key column. When this happens in production, it’s an ugly situation where all new inserts result in a primary key constraint error. If you’ve experienced this pain yourself then you know what I’m talking about. If not then take my word for it.  It’s painful!

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

Improved MySQL ALTERs

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via  gh-ost on one of our larger MySQL tables.  The duration of the ALTER was 50 hours and it completed with no application impact. So what’s the big deal?  

The big deal is that we could now ALTER tables in our production environment with little to no impact on our application, and this included some of our larger tables that were ~500GB in size.

Here is a little background. The ALTER TABLE statement in MySQL is very expensive. There is a global write lock on the table for the duration of the ALTER statement which leads to a concurrency nightmare.  The duration time for an ALTER is directly related to the size of the table so the larger the table, the larger the impact.  For OLTP environments where lock waits need to be as minimal as possible for transactions, the native MySQL ALTER command is not a viable option. As a result, online schema-change tools have been developed that emulate the MySQL ALTER TABLE functionality using creative ways to circumvent the locking.

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag.  Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage. 

GitHub introduced a new Online Schema Migration tool for MySQL (gh-ost ) that uses a binary log stream to capture table changes, and asynchronously applies them onto a “duplicate” table. gh-ost provides control over the migration process and allows for features such as pausing, suspending and throttling the migration. In addition, it offers many operational perks that make it safer and trustworthy to use. It is:

  • Triggerless
  • Pausable
  • Lightweight
  • Controllable
  • Testable

Orchestrator

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator is a MySQL high availability and replication management tool. It will detect a failure, promote a new primary, and then reassign the name/VIP. Here are some of the nice features of Orchestrator:

  • Discovery – Orchestrator actively crawls through your topologies and maps them. It reads basic MySQL info such as replication status and configuration. 
  • Refactoring – Orchestrator understands replication rules. It knows about binlog file:position and GTID. Moving replicas around is safe: orchestrator will reject an illegal refactoring attempt.
  • Recovery – Based on information gained from the topology itself, Orchestrator recognizes a variety of failure scenarios. The recovery process utilizes the Orchestrator’s understanding of the topology and its ability to perform refactoring. 

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

Manual failover tests

In March of 2020 the DBRE team completed several manual/controlled fail-overs using Orchestrator and HAProxy. Eventbrite experienced some AWS hardware issues on the MySQL primary and completing manual failovers was the first big test. Orchestrator passed the tests with flying colors.

Automatic failover

In May of 2020 we enabled automatic fail-over for our production MySQL data stores. This is a big step forward in addressing the single-point-of-failure situation with our primary MySQL instance. The DBRE team also completed several rounds of testing in QA/Stage for ProxySQL in preparation for the move from HAProxy to ProxySQL.

Show time

In July 2020, Eventbrite experienced hardware failure on the primary MySQL instance that resulted in automatic failover.  The new and improved automatic failover process via Orchestrator kicked in and we failed over to the new MySQL primary in ~20 seconds. The impact to the business was astronomically low! 

ProxySQL

In August of 2020 we made the jump to ProxySQL across our production MySQL environments.  ProxySQL is a proxy specially designed for MySQL. It allows the Eventbrite DBRE team to control database traffic and SQL queries that are issued against the databases. Some nice features include:

  • Query caching
  • Query Re-routing – to separate reads from writes
  • Connection pool and automatic retry of queries

Also during this time period we began our AWS Aurora evaluation as we began our  “Managed Databases” journey. Personally, I prefer to use the term “Database-as-a-Service (DBaaS)”. Amazon Aurora is a high-availability database that consists of compute nodes replicated across multiple availability zones to gain increased read scalability and failover protection.  It’s compatible with MySQL which is a big reason why we picked it.

Schema Migration Tool

In September of 2020, we began testing a new Self-Service Database Migration Tool to provide interactive schema changes (invokes gh-ost behind the scene).  It supports all “ALTER TABLE…”, “CREATE TABLE…”, or “DROP TABLE…” DDL statements. 

It includes a UI where you can see the status of migrations and run them with the click of a button:

Any developer can file and run migrations, and a DBRE is only required to approve the DDL (this is all configurable though). Original source for the tool can be found here shift. We’ve not pushed to production yet but we do have a running version in our non-prod environment.

Database-as-a-Service (DBaaS) and next steps

We’re barreling down the “Database-as-a-Service (DBaaS)” path with Amazon Aurora. 

DBaaS flips the DBRE role a bit by eliminating mundane operational tasks and allowing the DBRE team to align our work more closely with the business to derive value from the data assets that we manage. We can then focus on lending our DBA skills to application teams and end users—helping deliver new features, functionality, and proactive tuning value to the core business. 

Stay tuned for updates via future blog posts on our migration to AWS Aurora! We’re very interested in the scaled read operations and increased availability that Aurora DB cluster provides. Buckle your seatbelt and get ready for a wild ride 🙂  I’ll provide an update on “Database-as-a-Service (DBaaS)” at Eventbrite in my next blog post!

All comments are welcome! You can message me at ed@eventbrite.com. Special thanks to Steven Fast for co-authoring this blog post.

Building a Protest Map: A Behind the Scenes Look!

Sometimes, a project that you’re most proud of or passionate about isn’t one that involves architecture diagrams or deep technical discussions. Sometimes, the most significant project is one that allows peers to come together to build something that can have a positive impact. When we can all collaborate for a cause we’re passionate about, and get quick buy-in from our colleagues and executives, we can create a feature we’re proud of. 

Such is the case with Eventbrite’s Protests map. If you’ve visited the Eventbrite homepage recently, you’ve probably seen our map highlighting protests and marches to support racial equality, hosted on Eventbrite.

The map features events across the U.S., and the data source for the events shown is a Curated Collection.

The birth of a map

On May 25, 2020, news of George Floyd’s murder in Minneapolis spread rapidly across the country, igniting protests and rallies seeking justice. On June 9, a small team consisting of a product manager, engineer, designer, and content editor came together to build a map for these events on Eventbrite.com.

The plan was to create a map on Eventbrite that led users to protests across the United States. This map would aggregate Black Lives Matter protests and allow our users to interact based on their location. While some projects exemplify technical ingenuity of a team, this one relied on quick thinking and action, teamwork, and a driving force of passion. There are always a number of technical solutions to test out, but sometimes it’s simpler to leverage people.

Company buy-in and teamwork

The fact that this map lives on our homepage means that everyone, including event organizers and consumers would see this high-touch feature. That means that we needed quick buy-in from the whole company, especially impacted department leads, to launch this feature while protests were still occurring nation-wide. This project was greenlit in a matter of days. 

The trust involved in getting this map out the door was important to our success. Legal, PR, and marketing all came together to bring this project to life, it was a true cross-functional effort.

Collections and discovery

When we were in the brainstorming phase, we faced the question of how to populate the map with events that were specifically protests in support of Black Lives Matter. Our first instinct was to construct a query to identify events that would fit into the theme.

For a simple example, we could return events where the event’s organizer had selected the “Rally” format when they created the event. Then further filter those results by events where the organizer had entered a description containing the text “BLM” or “George Floyd.”

We quickly realized this wasn’t going to produce good results. Variations on this approach would either be over-specific or too broad, including far too many irrelevant events or excluding relevant ones. Plus, it would have been vulnerable to trolling and require us to constantly adjust the query as people learned how to configure their events to show up on the map.

It was probably naive to have even tried that approach, but the obvious alternative had its downsides too. The alternative being manual, human-driven curation through the use of an existing feature on our platform called “Collections.”

Collections allow Eventbrite users to create collections of events. Any user on Eventbrite can create a collection simply by clicking the “heart” icon on an event card:

Our marketing team uses this feature to highlight particularly interesting events that adhere to certain themes: online yoga classes, drive-in concerts, etc. Sometimes you’ll see it in our product under the name of “Editor’s Picks.”

But as mentioned above, this approach has downsides. It takes significant manual work to identify all the events that we want to be part of the collection. And event inventory has a unique challenge: events end. At some point in time, a planned event will take place and the content is no longer relevant to show consumers. The Collections feature depends on the same Elasticsearch index we use for search, and due to various constraints we actively remove events from the index after they end. 

Since we remove ended events, we run the risk of showing a sparse or empty map if we don’t maintain the collection. This meant that someone would have to constantly keep an eye on this collection and keep it updated with new events as people create them on our platform.

Luckily, we had help.

The initial list of protests came together through help from the rest of the company, specifically our company’s global diversity group, collaborating on Slack and Google Sheets.

After we launched, Eventbrite Co-founder and CEO, Julia Hartz was so inspired by the work that she offered extra funding for the marketing team to continue the work of identifying relevant events for the map. That’s some of the most positive internal feedback you can get!

The nuts-and-bolts

While this was a straightforward project from a technical implementation standpoint, for the most part leveraging an existing feature rather than building something entirely new, it’s worth highlighting the work we’ve done in Eventbrite Engineering over the past year to invest in our architecture and infrastructure. Like many companies, Eventbrite is undergoing the difficult process of moving away from a monolithic architecture to a microservices architecture (ours has a domain-driven spin). We call the monolith “Core.”

Although we’re nowhere close to finished, we’re already starting to reap the benefits of the hard work invested in moving features out of Core, piece by piece. A little over a year ago, it wouldn’t have been feasible to ship a project of this nature so quickly because deploying any part of the site, no matter how small, required deploying all of Core. Today, we can ship changes to the homepage within a matter of minutes. 

Just as importantly, we have reliable alerting to notify us of problems that arise after shipping a change. As a matter of fact, shortly after we launched this project we were alerted of an unusual spike in errors on the homepage. Within 5 minutes of receiving this notification, we rolled back the release containing the map. After another 10 minutes, we re-released the map with the fix.

Bringing people together

Our company’s mission is simple: Bring the world together through live experiences. Protests themselves bring together like-minded individuals to fight towards a common goal. Working on a feature that exemplifies our mission as a company was a truly inspiring opportunity. This lit a fire in our team that translated to the company as a whole.

Authors: Eloise Porter, Jaylum Chen, Alex Rice, and Allen Jilo

How are you building/maintaining team cohesion?

Let’s face it! 2020 has been a year of huge change! One of the biggest changes is with how we work. Like many other companies, Eventbrite has transitioned to a fully remote work environment with engineers and managers located across the globe. We have development centers in San Francisco (USA), Nashville (USA), Mendoza (Argentina), and Madrid (Spain) with teams spread across all four locations. The challenge for managers and individual contributors alike is to build/maintain cohesion for remote teams! 

This is for sure one of the most common topics in management these days. How do you keep cohesion when your team has transitioned to fully remote? A group of Eventbrite managers decided to try a few strategies and to document the process. 

At Eventbrite, our mission is to bring the world together, and in a pre-pandemic world it was through live events, these days it’s more about connecting, somehow, with people.

Building a team is more art than science, nurturing the culture, the bonds, and, most importantly, building trust between the members needs time and dedication. Doing all of this is difficult even when everyone can be at the same place, make jokes, laugh, and work side by side. In this new COVID-19 world, we lost many of these tools. We decided it was time for some new tricks!

Daily standup meetings

Right from the beginning of the COVID-19 pandemic in March, one of our Engineering Managers (Nacho) suggested that we try to sync every morning. Before COVID-19, we wouldn’t do it regularly. We gave it a try and it worked! Since then we talk everyday, we are very active on Slack, and it’s not always about work. It’s helped us both professionally and emotionally as we shared problems, frustrations, … We make the daily standups a top priority as these virtual meetings are a perfect chance to build rapport with our teammates.  

It was super hard to get everyone onboard, but eventually we were all there, always, every morning. It’s great to think that I have my team when I need them, available and willing to help.

Weekly Board Games

Starting in April, we set up a board game meeting for 4pm every Tuesday. We play Settlers of Catan on colonist.io while chatting over video conference. It has become a weekly ritual for one of our teams in Argentina, with between 3 and 5 people playing. It lets us spend some time together, have fun and socialize. 

  • Catan:
https://colonist.io/
  • Versión en línea de juego Basta!, Tutti frutti, Lápiz quieto, ¡Mercadito!, Dulce de Membrillo el que todos conocemos. 8 players
https://bastaonline.net/

Coffee breaks

There’s always time for a coffee break to talk about anything, work or non-work related stuff. We created a Slack shortcut (!cafebreak) to easily share the meeting details for anyone to quickly jump in. These short spontaneous recesses provide distraction and allow us to relax from the daily hustle. 

Keeping the Culture alive

The Eventbrite’s company culture makes it a unique place to work! The challenge is to continue to foster this culture while functioning as a fully remote team. It might be easy for a team that is accustomed to spending 9 hours a day, 5 days a week together to forget what makes them unique when everyone is scattered, without interaction besides Slack or Hangout calls. Knowing what makes your team unique is a key thing to understand. 

On our team, it is our love for problem solving. Thus every Friday afternoon, we stopped our work earlier to take some time and play around Hackerrank for a couple of hours, solving different exercises and sharing our solutions. It was an incredible way of having fun and learning from each other by checking how each one of us approached the solutions. It helped us remember who we were as a team.

Engineering Development Academy

Eventbrite implemented EDA (Engineering Development Academy) in Argentina. This initiative is a new way to find and incorporate engineering talent in our company. The EDA group of professionals is trained in technical (Python, Django, Testing, CI, CD, Javascript, React, CSS) and non-technical subjects (agile methodologies, Scrum, English) in a 3-month program.

Every day at 9:00 am, we start with breakfast/chat about things that do and do not work. Movies, news, etc. The first talks we looked for topics to share because the whole team was new, and thus forced everyone to speak. We also do afternoon checkpoints for 30 minutes at 2:00 PM for both technical or non-technical questions, to help make the training as positive as possible.

Lightning Talks

Emilio (Principal Engineer) organizes Lightning Talks on a weekly basis. These are open talks about anything that is important to an engineer and she/he is interested in sharing with others. What we have learned during COVID-19?

Spontaneous 1:1’s

There are often times where discussions are better in 1:1 or group meetings rather than discussing in Slack or collaborating via a Google doc or email. We encourage our team members to create a Google hangout or Zoom session at any time if it helps resolve issues quicker. 

Quick text messages (via Slack) are great for clarifying simple matters. Often a more detailed discussion is required and this is when a Google Hangout is preferred.  It’s important to recognize the distinction as too many calls can burn engineers out. But you can also waste lots of time exchanging messages when a five-minute call could provide answers to multiple questions.

Virtual Meeting tips

It can be challenging to connect with teammates during virtual/online meetings. Simple things such as keeping your camera on and using verbal and non-verbal cues (such as head nods or thumbs-up) are great ways to make a connection during virtual meetings. Also, go out of your way to recognize teammates who have gone above and beyond during online meetings.  While working remote it is often easy to bypass/miss the great work that others are doing.

Many of the virtual meetings will start at 5 minutes past the starting time. Use this time to show personal interest for the other participants. Ask them about their weekend? Break the ice while you’re waiting for all participants to join. It’s a great way to learn more about some of your co-workers who you may not necessarily know very well.

 

This article’s co-authors are Henry Lyne, Gabriel Flores, Ed Presz, Emiliano André and Juan Pablo Marsano. Reviewed by Rainu Ittycheriah.

Teaching new Presto performance tricks to the Old-School DBA

I’ve spent much of my career working with relational databases such as Oracle and MySQL, and SQL performance has always been an area of focus for me. I’ve spent countless hours reviewing EXPLAIN plans, rewriting subqueries, adding new indexes, and chasing down table-scans. I’ve been trained to make performance improvements such as:  only choose columns in a SELECT that are absolutely necessary, stay away from LIKE clauses, review the cardinality of columns before adding indexes, and always JOIN on indexed columns.

It’s been an instinctual part of my life as a Database Administrator who supports OLTP databases that have sold in excess of 20K tickets per minute to your favorite events. I remember a specific situation where a missing index caused our production databases to get flooded with table-scans that brought a world-wide on-sale to an immediate halt. I had a lot of explaining to do that day as the missing index made it to QA and Stage but not Production!

In recent years, I’ve transitioned to Data Engineering and began supporting Big Data environments.  Specifically, I’m supporting Eventbrite’s Data Warehouse which leverages Presto and Apache Hive using the Presto/Hive connector. The data files can be of different formats, but we’re using HDFS and S3.  The Hive metadata describes how data stored in HDFS/S3 maps to schemas, tables, and columns to be queried via SQL. We persist this metadata information in Amazon Aurora and access it through the Presto/Hive connector via the Hive Metastore Service (HMS). 

The stakes have changed and so have the skill-sets required. I’ve needed to retrain myself in how to write optimal SQL for Presto. Some of the best practices for Presto are the same as relational databases and others are brand new to me. This blog post summarizes some of the similarities and some of the differences with writing efficient SQL on MySQL vs Presto/Hive. Along the way I’ve had to learn new terms such as “federated queries”, “broadcast joins”, “reshuffling”, “join reordering”, and “predicate pushdown”.

Let’s start with the basics:

What is MySQL? The world’s most popular open source database. The MySQL software delivers a fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL is intended for mission-critical, heavy-load production database usage.

What is Presto? Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel relational databases.

Presto uses ANSI SQL syntax/semantics to build its queries. The advantage of this is that analysts with experience with relational databases will find it very easy and straightforward to write Presto queries! That said, the best practices for developing efficient SQL via Presto/Hive are different from those used to query standard RDBMS databases.

Let’s transition to Presto performance tuning tips and how they compare to standard best practices with MySQL.

 

1. Only specify the columns you need

Restricting columns for SELECTs can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*). This applies to Presto as well as MySQL! 

 

2. Consider the cardinality within GROUP BY

When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.

The GROUP BY operator distributes rows based on the order of the columns to the worker nodes, which hold the GROUP BY values in memory. As rows are being ingested, the GROUP BY columns are looked up in memory and the values are compared. If the GROUP BY columns match, the values are then aggregated together.

 

3. Use LIMIT with ORDER BY

The ORDER BY clause returns the results of a query in sort order. To  process the sort, Presto must send all rows of data to a single worker and then sort them. This sort can be a very memory-intensive operation for large datasets that will put strain on the Presto workers. The end result will be long execution times and/or memory errors. 

If you are using the ORDER BY clause to look at the top N values, then use a LIMIT clause to reduce the cost of the sort significantly by pushing the sorting/limiting to individual workers, rather than the sorting being done by a single worker. 

I highly recommend you use the LIMIT clause not just for SQL with ORDER BY but in any situation when you’re validating new SQL. This is good practice for MySQL as well as Presto!

 

4. Using approximate aggregate functions

When exploring large datasets often an approximation (with standard deviation of 2.3%) is more than good enough! Presto has approximate aggregation functions that give you significant performance improvements. Using the approx_distinct(x) function on large data sets vs COUNT(DISTINCT x) will result in performance gains. 

When an exact number may not be required―for instance, if you are looking for a rough estimate of the number of New Years events in the Greater New York area then consider using approx_distinct(). This function minimizes the memory usage by counting unique hashes of values instead of entire strings. The drawback is that there is a small standard deviation.

 

5. Aggregating a series of LIKE clauses in one single regexp_like clause

The LIKE operation is well known to be slow especially when not anchored to the left (i.e. the search text is surrounded by ‘%’ on both sides) or when used with a series of OR conditions. So it is no surprise that Presto’s query optimizer is unable to improve queries that contain many LIKE clauses.  

We’ve found improved  LIKE performance on Presto by  substituting the LIKE/OR  combination with a single REGEXP_LIKE clause, which is Presto native.  Not only is it easier to read but it’s also more performant. Based on some quick performance tests, we see ~30% increase in run-times with REGEXP_LIKE vs comparable LIKE/OR combination.

For example:

SELECT  ...FROM zoo 
WHERE method LIKE '%monkey%' OR 
      method LIKE '%hippo%' OR 
      method LIKE '%tiger%' OR 
      method LIKE '%elephant%'

can be optimized by replacing the four LIKE clauses with a single REGEXP_LIKE clause:

SELECT  ...FROM zoo 
WHERE REGEXP_LIKE(method, 'monkey|hippo|tiger|elephant')

 

6. Specifying large tables first in join clause

When joining tables, specify the largest table first in the join. The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. If the right-hand side table is “small” then it can be replicated to all the join workers which will save CPU and network costs.  This type of join will be most efficient when the right-hand side table is small enough to fit within one node. 

If you receive an ‘Exceeded max memory’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your largest table is the first table in your sequence of joins. 

This was an interesting performance tip for me. As we know, SQL is a declarative language and the ordering of tables used in joins in MySQL, for example,  is *NOT* particularly important. The MySQL optimizer will re-order to choose the most efficient path. With Presto, the join order matters. You’ve been WARNED! Presto does not perform automatic join-reordering unless using the Cost Based Optimizer!

 

7. Turning on the distributed hash join

If you’re battling with memory errors then try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. 

At Eventbrite we have the distributed_join variable set to ‘true’. (SHOW SESSION). Also it can be enabled by setting a session property (set session distributed_join = ‘true’).

 

8. Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date or country.  You define partitions at table creation, and they help reduce the amount of data scanned per query, thereby improving performance. 

Here are some hints on partitioning:

  • Columns that are used as WHERE filters are good candidates for partitioning.
  • Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Use caution when partitioning and make sure you don’t partition too finely. 
  • If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

A key partition column at Eventbrite is transaction date (txn_date).

CREATE TABLE IF NOT EXISTS fact_ticket_purchase
(
    ticket_id STRING,
....
    create_date STRING,
    update_date STRING
)
PARTITIONED BY (trx_date STRING)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

 

9. Optimize columnar data store generation

Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by using column-wise compression based on data type, special encoding, and predicate pushdown. At Eventbrite, we define Hive tables as PARQUET using compression equal to SNAPPY….

CREATE TABLE IF NOT EXISTS dim_event
(
    dim_event_id STRING,
....
    create_date STRING,
    update_date STRING,

)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

Apache Parquet is an open-source, column-oriented data storage format. Snappy is designed for speed and will not overload your CPU cores. The downside of course is that it does not compress as well as gzip or bzip2.

 

10. Presto’s Cost-Based Optimizer/Join Reordering 

We’re not currently using Presto’s Cost-Based Optimizer (CBO)! Eventbrite data engineering released Presto 330 in March 2020, but we haven’t tested CBO yet.

CBO inherently requires the table stats be up-to-date which we only calculate for a small subset of tables! Using the CBO, Presto will be able to intelligently decide the best sequence based on the statistics stored in the Hive Metastore.

As mentioned above, the order in which joins are executed in a query can have a big impact on performance. If we collect table statistics then the CBO can automatically pick the join order with the lowest computed costs. This is governed by the join_reordering_strategy (=AUTOMATIC) session property and I’m really excited to see this feature in action.

Another interesting join optimization is dynamic filtering. It relies on the stats estimates of the CBO to correctly convert the join distribution type to “broadcast” join. By using dynamic filtering via run-time predicate pushdown, we can squeeze out more performance gains for highly-selective inner-joins.  We look forward to using this feature in the near future!

 

11. Using WITH Clause

The WITH clause is used to define an inline view within a single query.  It allows for flattening nested subqueries. I find it hugely helpful for simplifying SQL, and making it more readable and easier to support.

 

12. Use Presto Web Interface

Presto provides a web interface for monitoring queries (https://prestodb.io/docs/current/admin/web-interface.html). 

The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. If Presto cluster is having any performance-related issues, this web interface is a good place to go to identify and capture slow running SQL!

 

13. Explain plan with Presto/Hive (Sample)

EXPLAIN is an invaluable tool for showing the logical or distributed execution plan of a statement and to validate the SQL statements. 

Logical Plan with Presto

explain select SUBSTRING(last_modified,1,4) ,count(*)  from hive.df_machine_learning.event_text where lower(name) like ‘%wilbraham%’ or (REGEXP_LIKE(lower(name), ‘.*wilbraham.*’)) group by 1 order by 1;

 

14. Explain plan with MySQL (Sample)

In this particular case you can see that the primary key is used on the ‘ejp_events’ table and the non-primary key on the “ejp_orders’ table. This query is going to be fast!

 

Conclusion

Presto is the “SQL-on-Anything” solution that powers Eventbrite’s data warehouse. It’s been very rewarding for me as the “Old School DBA” to learn new SQL tricks related to a distributed query engine such as Presto. In most cases, my SQL training on MySQL/Oracle has served me well but there are some interesting differences which I’ve attempted to call-out above. Thanks for reading and making it to the end. I appreciate it!

We look forward to giving Presto’s Cost-Based Optimizer a test drive and kicking the tires on new features such as dynamic filtering & partition pruning!

All comments are welcome, or you can message me at ed@eventbrite.com. You can learn more about Eventbrite’s use of Presto by checking out my previous post at Boosting Big Data workloads with Presto Auto Scaling.

Special thanks to Eventbrite’s Data Foundry team (Jeremy Bakker,  Alex Meyer, Jasper Groot, Rainu Ittycheriah, Gray Pickney, and Beck Cronin-Dixon) for the world-class Presto support, and Steven Fast for reviewing this blog post. Eventbrite’s Data Teams rock!

Create Meaningful (and Fun!) Remote Community

Create meaningful and fun remote community

What makes community meaningful? How can you create a meaningful culture and community in a time where we are all separated while working remote?

I want today to talk about community, and how important it is to be a proactive agent in fostering a meaningful (and fun!) remote work culture.  Just a smidge about me: before I was a software engineer at Eventbrite, I was on the ground figuring out how to create and grow community in various capacities – as a resident advisor in college, an event planner at Stanford, a coding teacher in various classrooms, and as a participant in various worldwide dance and performing communities.  Community is my thing.

I have seen a lot of articles floating around with recommendations on how to create a remote work culture. I want to point out that I’m very specifically using the phrase “remote community” as opposed to “remote culture” because to me, community is something present and tangible and interconnected, whereas culture can be this very nebulous thing.  Culture arises out of community – community is the foundation.

I am also writing this article from the perspective of a mostly remote work community, but these ideas can be taken and used in any type of community.  This article is not engineering specific but can easily used to benefit any engineering community.

What makes community meaningful?

Community, to me, is about caring for people.  A lot of times, community starts by happenstance – you and others happen to share an interest or work at the same job.  But, in my opinion, ‘happenstance’ community graduates to being a real community when people work together to create a place that everyone cares about.  That to me is where the meaning is derived – from people caring enough to want to make their community a place where people feel supported and feel joy.

How you can create meaningful community

I find that creating meaningful community is as simple as stepping up. Being proactive goes a long, long way – either on an individual level, like reaching out to someone you know who is having a hard time. Or being proactive with a broader group in mind, like creating activities that bring people together, etc.

I want to touch mostly on the latter – how to engage a broader group, with either active engagement or passive engagement.  One more thing to consider – how to make things fun! For me, it’s pretty easy – I think about what I enjoy and what I want to see in a community, and then I make it happen.  You can too!

Active Engagement

Active engagement involves events or activities that require people to be present and available at the time that it is happening.  Here are some ideas:

  1. Remote happy hour.  Schedule one once a week with your department or with cross sections of the company you normally don’t interact with.  Use this as an opportunity to talk to new people that you would normally don’t cross paths with.Pro tip – if you think more than 10 people are going to show up, figure out some way to create ‘rooms’ – either separate Hangouts that people can jump in and out of, or utilize Zoom’s room feature.
  2. Remote team lunch.  Suggestion – why not every day or twice a week?
  3. 2pm 2 minute planks.  Get stronger every day!  We do this via video chat as well to cheer each other on.

Passive Engagement

Passive engagement involves activities that people can circle back around to when they have time.  Ideas include:

  1. Daily Photo Share.  Post a theme for the photo of that day (Baby photo of you! Throwback photo of your parents! Photo of you doing an activity you love!) and see what you end up learning.  I have added a list of ideas that I compiled with my co-workers at the bottom of this article.
  2. Dear Diary chaining.  Remember when email chaining was a thing?  Well, my coworker came up with this idea – write a ‘Dear Diary’ entry about your day, post with a picture, and then tag a new coworker for the next day! It’s been interesting to get a more in-depth look into people’s mentality and how they are trying to stay optimistic.

Out of all the ideas above, so far the Daily Photo Share has seemed to have the most impact and been the most uplifting – I tend to post it around 12pm, and then spend the next hour scrolling and commenting on everyone’s amazing pictures of themselves, their families, and the things they love.  It’s an amazing way to see everyone share what they care about.

What’s your time commitment?

It’s up to you to decide your community time commitment.  It doesn’t have to be everything that’s listed above. I recommend the thought model of 2 minutes, 2 hours, or 2 months – what can you contribute? Can you contribute 2 minutes of your time, or 2 hours, or maybe even 2 months?  The 2pm 2 minute plank is a great example of a short amount of time, or even commenting on someone else’s photo of the daily photo share. 2 hours could be something like deciding to throw and promote a remote happy hour.

I think that now, more than ever, it is really important to take time out of our lives to reach out, to connect, to engage – to build community, and to receive from our communities.

I want to hear from you.  What ideas do you have for active or passive engagement in your community?  What makes community meaningful to you, and how can we surface that more in remote communities?

(Some) Daily Photo Share Theme Ideas:
  • Throwback photo of your parent(s)
  • Hobby/activity you love doing
  • Last meal you took a photo of
  • Your family (immediate or otherwise)
  • Your workspace at home
  • Silly photo of you
  • An item you cherish
  • Vacation you recently took
  • Photo you are proud of (either proud that you took, or of something you are proud of)
  • Photo with a story (and tell the story!)
  • Something that most people don’t know about you
  • Favorite costume photo of you (Halloween or otherwise)
  • Photo of you where you like your vibe
  • Share a photo of a coworker that you appreciate, and why you appreciate them
  • Photo of a new hobby or life hack have you acquired in the last few weeks
  • Something/someone you miss (and give back story!)
  • This could have been an ad for…
  • Everybody should try this once
  • “I didn’t want to be in this photo“

Leveraging AWS “spot” instances to drive down costs

I had the pleasure of participating in a Big Data Cost Optimization event in San Francisco (October 17) at the AWS Loft location. I was a panelist in a round-table discussion focusing on the use of AWS spot instances. It was a terrific opportunity to share how Eventbrite’s Data Engineering team is smartly using spot instances for Big Data & EMR to drive down costs. Thanks to Jasper Groot from the Data Engineering team for helping to co-author this blog post!

At Eventbrite, we’re using spot instances to leverage AWS auto scaling and I’ve published a blog entry on this topic (Big Data workloads with Presto Auto Scaling). Two key focus areas for the data engineering team are ephemeral computing (something that lasts for a very short time)  and idempotence (repeatable behaviour with the same outcome).  Making a commitment in these areas has allowed us to leverage spot instances to effectively manage costs.

Spot instances are a cost-effective choice if you can be flexible about when your applications run and if your applications can be interrupted. They are well-suited for data analysis, batch jobs, background processing, and non mission-critical tasks. Below are some of the panel questions and my answers, and I’d love to share what I learned based on the conversation.
Continue reading “Leveraging AWS “spot” instances to drive down costs”

Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations

D3.js is the de facto standard JavaScript library for building interactive data visualizations on the Web. When beginning a new D3.js project, developers usually start by taking one of the community’s many examples and using it to jumpstart new work. This is a convenient way to get something working fast so that they can iterate over it. However, by working this way, you soon struggle to maintain, extend, or modify your charts. D3.js examples are made to demonstrate implementations and techniques, not to ship them to production.

Discover “Pro D3.js”, the new book that walks you through the creation of maintainable, modular, and testable charts. You will also learn how to package and document your visualizations, producing open-source libraries. Read on to learn more about the book and its contents. Continue reading “Discover “Pro D3.js”, a new book to improve your JavaScript data visualizations”

A Story of a React Re-Rendering Bug

As front-end developers, we often find ourselves getting into perplexing bugs when the page we build involves a lot of user interactions. When we find a bug, no matter how tricky it is, it means something is wrong in the code. There is no magic, and the code itself does not lie.

This blog will take you on a short journey about how I fixed a particularly annoying bug that existed in one of our products. Continue reading “A Story of a React Re-Rendering Bug”

How boba trips and event onsites made me a better engineer (a Briteling intern’s reflection on community at Eventbrite)

Written by 2019 summer Organizer App intern Vivian Phung

Hello World! I’m Vivian Phung, a Computer Science and Mathematics double major at Bryn Mawr College, and for the last few months, an iOS Software Engineering intern on Eventbrite’s Organizer App Team. Some of the best parts of my internship were the communities I had the opportunity to engage withboth at the office and within the broader Eventbrite community. In this blog post, I want to share how engaging with these communities helped me conquer imposter syndrome and get the most out of my summer internship.  Continue reading “How boba trips and event onsites made me a better engineer (a Briteling intern’s reflection on community at Eventbrite)”

How to fix the ugly focus ring and not break accessibility in React

header image

Creating beautiful, aesthetic designs while maintaining accessibility has always been a challenge in the frontend. One particular barrier is the dreaded “:focus” ring. It looks like this:

focus outline on a button

After clicking any button, the default styling in the browser displays a focus outline. This ugly outline easily mars a perfectly crafted interface.

A quick Stack Overflow search reveals an easy fix: just use a bit of CSS, outline: none; on the affected element. It turns out that many websites use this trick to make their sites beautiful and avoid the focus outline. Continue reading “How to fix the ugly focus ring and not break accessibility in React”