Scalability Archives - Engineering Blog

Packaging generated code from protobuf files for gRPC Services

May 2, 2022May 3, 2022

Background

At Eventbrite, we identified in our 3-year technical vision that one of our goals is to enable autonomous dev teams to own their code and architecture so as to be able to deliver reliable, high quality and cost effective solutions to our customers. However, this autonomy does not mean that our team has to work in complete isolation from other teams in order to achieve their goals.

Over the past year, we have started our transition from our monolithic Django + Python approach to a microservices architecture; we selected gRPC as our low-latency protocol for inter-microservice communication. One of the main challenges that we face is sharing Protobuf files between teams for generating client libraries. We want it to be as easy as possible by avoiding unnecessary ceremonies and integrating into team development cycles.

Challenges managing Protobuf definitions

Since our teams have full autonomy of their code and infrastructure, they will have to share Protobuf files. Multiple sharing strategies are available, so we identified key questions:

Should we copy and paste .proto files in every repository where they are needed? This is not a good idea and could be frustrating for the consuming teams. We should avoid any error-prone or manual activity in favor of a fully automated process. This will drive consistency and reduce toil.

How will changes in .proto files impact clients? We should implement a versioning strategy to support changes.

How do we communicate changes to clients? We need a common place to share multiple versions with other teams and adopt a standard header to client expectations, such as Deprecation and Sunset.

Our proposed solution

We will maintain protobuf files within the owning service’s repository to simplify ownership. The code owners are responsible for generating the needed packages for their clients. Their CI/CD pipeline will automatically generate the library code from the protobuf file for each target language.

Packages will be published in a central place to be consumed by all client teams. Each package will be versioned for consistency and communication. Before deprecating and sunsetting any package version, all clients must be notified and given enough time to upgrade.

Repository Structure

In our opinion, having a monorepo for all protobuf definitions would slow down the teams’ development cycles: each modification to a Protobuf definition would require a PR to publish the change in the monorepo, waiting for an approval before generating required artifacts and distributing them to clients. Once the package was published, teams would have to update the package and publish a new version of their services. We need to keep the Protobuf files with their owning service.

Project Structure

The project’s organization should provide a clear distinction between the services that exist in the project and the underlying Protobuf version that the package is implementing. The proto folder will hold the definition of each proto file with a correctly formed version using the package specifier. The service folder will hold the implementation of each gRPC service which is registered against the server.

The proto folder will hold the definition of each proto file with a correctly formed version using the package specifier.

This approach will allow us to publish a v2 version of our service with breaking change, while we continue supporting the v1 version. We should take into consideration the next points when we publish a new version of our service:

Try to avoid breaking changes (Backward and forward compatibility)
Do not change the version unless making breaking changes.
Do change the version when making breaking changes.

Proto file validation

To make sure the proto files do not contain errors and to enforce good API design choices we recommend using Buf as a linter and a breaking change detector. It should be used on a daily basis as part of the development workflow, for example, by adding a pre-commit check to ensure our proto files do not contain any errors.

Following our “reduce toil over automation” principle, we added a task in our CI/CD pipelines in CircleCI. A Docker image is available to add some steps for linting and breaking change detection. It helps us to ensure that we publish error-free packages:

If a developer pushes breaking changes or changes with linter problems, our CI/CD pipelines in CircleCI will fail as can be seen in the pictures below:

Linter problems

Breaking changes

Versioning packages

Another challenge is building and versioning artifacts from the protobuf file-generated code. We selected Semantic Versioning as a way to publish and release packages’ versions.

The package name should reflect the service name and follow the conventions established by the language, platform, framework and community.

Generating code for libraries

We have set up an automated process in CircleCI to generate code for libraries. Once a proto file is changed and tagged, CircleCI detects the changes and begins generating the code from the proto file.

We compile it using protoc. To avoid the burden of installing it, we use a Docker image that contains it. This facilitates our local development as well as CI/CD pipelines. Here is the CircleCI configurations:

In the previous example, we are generating code for python but it can also be generated for Java, Ruby, Go, Node, C#, etc.

Once code is generated and persisted into a CircleCI workspace it’s time to publish our package.

Publishing packages

This process could be overwhelming for teams if they had to figure out how to package and publish each artifact in all supported languages in our Golden Path. For this reason we took the same approach as docker-protoc and we dockerized a tool that we developed called protop.

Protop is a simple Python project that combines typer and cookiecutter to provide us a way to package the code into a library for each language. At the moment it only supports PyPI using Twine because our main codebase of consumers are in Python, but we are planning to addGradle support soon.

The use of protop is very similar to docker-protoc. We published a dockerized version of protop to an AWS Elastic Container Registry to allow teams to use it in their CI/CD pipelines in CircleCI:

We published a dockerized version of protop to an AWS Elastic Container Registry to allow teams to use it in their CI/CD pipelines in CircleCI

At Eventbrite we use AWS CodeArtifact in order to store other internal libraries so we decided to re-use it to store our gRPC service libraries. You can see a diagram of the overall process below.

AWS CodeArtifact stores both internal libraries and our gRPC service libraries.

This AWS CodeArtifact repository should be shared by all teams in order to have only one common place to find those packages instead of having to ask each team what repository they have stored their packages in and having lots of keys to access them.

The teams that want to consume those packages should configure their CI/CD pipelines to pull the libraries down from AWS CodeArtifact when their services are built.

This process will help us reduce the amount of time spent in service integration without diminishing the teams’ code ownership..

Using the packages

The last step is to use our package. With the package uploaded to AWS CodeArtifact, we need to update our Pipfile:

Updated PIp File to use the artifact. — Updated PIP File to use the artifact.

or requirements txt

Alternative way of using Protobuf files.

Conclusion

We started out by defining the challenges of managing Protobuf definitions at Eventbrite, explaining the key questions about where to store these definitions, how to manage changes and how to communicate those changes. We’ve also explained the repository and project structure.

Then, we proceed to cover protobuf validation using Buf as a linter and a breaking change detector in our CI/CD pipelines and how to version using Semantic Versioning as a way to publish and release packages’ versions.

After that, we’ve turned out to focus on how to generate, publish and consume our libraries as a kind of SDK for the service’s domain allowing other teams to consume gRPC services in a simple way..

But of course, this is the first iteration of the project and we are already planning actions to be more efficient and further reduce toil over automation. For example, we are working on generating the packages’ version automatically using something similar to Semantic Release to avoid teams having to update the package version manually and therefore avoiding error-prone interactions.

To summarize, if you want to drastically reduce the time that teams waste on service integration avoiding a lot of manual errors, consider automating as much as you can the process of generating, publishing and consuming your gRPC client libraries.

Reflecting on Eventbrite’s Journey From Centralized Ops to DevOps

January 21, 2022January 21, 2022

Once a scrappy startup, Eventbrite has quickly grown into the market leader for live event ticketing globally. Our technical stack changed during the first few years, but as with most things that reach production, pieces and patterns lingered.

Over the years, we leaned heavily into a Django, Python, MySQL stack, and our monolith grew. We changed how our monolith was deployed and scaled as we went into the AWS cloud as an early adopter. This entailed building internal tooling and processes to solve specific problems we were facing, and doubling down on our internal tooling while the cloud matured around us.

Keeping up with traffic bursts from high-demand events

Part of the fun and challenge of being a primary ticketing company is handling burst traffic from high traffic on-sales — these are high-demand events that generate traffic spikes when tickets are released to purchase at a specific time. Creators (how we refer to folks that host events) will often gate traffic externally, and post a direct link to an Eventbrite listing on a social network or their own websites. Traffic builds on these sites while customers wait for a link to be posted. The result is hundreds of thousands of customers hitting our site at once.

Ten-plus years ago, this was incredibly difficult to solve, and it’s still a fun challenge from a speed of scale and cost perspective. Ultimately, challenges around the reliability of our monolithic architecture led to us investing in specialized engineering teams to help manually scale the site up during these traffic bursts as well as address the day-to-day maintenance and upkeep of the infrastructure we were running.

A monolithic architecture isn’t a bad thing — it just means there are tradeoffs

On one hand, our monolithic setup allowed us to move fast. Having some of Django’s core contributors helped us solve complex industry problems, such as high-volume on-sales in which small numbers of tickets go on sale to large numbers of customers. On the other hand, as we and our platform’s features grew, things became unwieldy, and we centralized our production and deployment maintenance in response to site incidents and bug triage.

This led to us trying to break up the monolith. The result? Things got worse because we didn’t address the data layer and ended up with mini Django monoliths, which we incorrectly called services.

The decision to move from an Ops model to a DevOps model, and the hurdles along the way

Enter our three-year technical vision. In order to address our slowing developer velocity and improve our reliability, performance, and scale, we made an engineering-wide declaration to move away from an Ops model — in which a centralized team had all the keys to our infrastructure and our deployments— to a DevOps model in which each team had full ownership.

An initial hurdle we had to jump over was a process hurdle. In order for teams to take any ownership, they’d have to be on call 24×7 for the services and code they owned. We had a small number of teams with production access that were on call, but the vast majority of our teams were not. This was an important moment in our ownership journey. And our engineering teams had many questions about the implications of what was not only a cultural but also a process change.

There are many technical hurdles to providing team-level ownership, and it’s tempting to get drawn into a “boil-the-ocean” moment and throw away all the historic learnings and business logic we developed over our history. Our primary building block towards team autonomy was leveraging a multi-AWS sub-account strategy. Using Terraform, we were able to build an account vending system allowing teams to design clear walls between their workloads, frontends, and services. With these walls in place each team had better control and visibility into the code they owned.

Technical debt, generally, is a complicated ball of yarn to unwind

We had many centralized EC2-based data clusters: MySQL, Redis, Memcache, ElasticSearch, Kafka, etc. Migrating these to managed versions — and the transfer of ownership between our legacy centralized ownership directly to teams — required a high degree of cross-team coordination and focused team capacity.

As an example, the migration of our primary MySQL cluster to Aurora required 60 engineers during the off-hours writer cutover — they represented all of our development teams. The effort towards the decentralization of our data is leading us to develop full-featured infrastructure as code building blocks that teams can pull off the shelf to leverage the full capabilities of best-in-class managed data services.

Our systems powering our frontend as well as our backend services are process-wise similar to our data-ownership journey. We have examples of innovation around serverless compute patterns and new architectural approaches to address scale and reliability. We’re making big bets on some of our largest and most-impactful services — two of which still live as libraries in our core monolith. The learnings that are accrued through these efforts will power the second and third year of our three-year tech vision journey.

The impact thus far, with more unlocks to come

By now, you’re probably realizing that at least some of our teams were shocked at the amount of change happening as their ownership responsibilities increased. We were confident that this short-term pain was worth it. After all, our teams were demanding this through direct feedback in our dev and culture surveys.

The prize for us on this journey is customer value delivered through increased team velocity. While our monolithic architecture — both on the code and data sides of the house -— got us to where we are today, teams were not happy with their ability to bring change and improvements to things that they owned. This was frustrating for everyone involved, and the gold at the end of the rainbow for us is that teams can make fundamental changes with modern tools and processes.

In the first year of our three-year technical vision big changes in ownership have been unlocked. As an example, we have migrated to Aurora where teams have ownership of their data. We’ve also provided direct team-level ownership of teams CI pipelines, improved our overall code coverage for testing, provided team autonomy for feature flag releases, and started re-architecting our two largest tier-1 services. It’s exciting to see new sets of challenges arise along the way — knowing these hurdles also unveil opportunities.

November 1, 2021October 5, 2021

Creating the 3 Year Frontend Strategy

Last post we talked about Developing the 3 Year Frontend Vision, in this post we will go into how that vision, the tenets, requirements, and challenges shaped the Strategy moving forward.

One of the key themes in Eventbrite since I joined is DevOps, moving ownership from a single team who has been responsible for ops and distributing that responsibility to each individual team. To give them ownership over decisions, infrastructure, and to control their own destiny. The first step in defining the Strategy was to put together what a Technical Strategy is, and the foundation for that strategy.

Technical Strategy

The overall Technical Strategy is based on availability and ownership. Starting with the way we build our services and frontends, to the way we deploy and serve assets to our customers. The architecture is designed to reduce the blast radius of errors, increase our uptime, and give each team as much control over their space as possible.

Availability

Moving forward we will achieve High Availability (HA), in which our frontends and systems are resilient to faults and traffic, and will operate continuously without human intervention. In order to achieve HA, we will utilize Managed AWS Services or redundant fault tolerant software, and by utilizing content delivery networks (CDN) to increase our performance and resilience by putting our code as close to the customer as possible. We will ensure that all aspects of the system are tested, fault tolerant, and resilient, and that both the client-side and server-side gracefully degrade when downstream services fail.

Ownership

DevOps combines the traditional software development by one team and operations and infrastructure by another into a single team responsible for the full lifecycle of development and infrastructure management. This combination enables organizations to deliver applications at a higher velocity, evolving and improving their products at a faster pace than traditional split teams. The goal of DevOps is to shift the ownership of decision making from the management structure to the developers, improve processes, and remove unproductive barriers that have been put in place over the years.

Frontend

Once we had the foundation of the strategy defined, it was time to define the scope. To understand how to develop a strategy, or to even define one, we need to understand what makes up a “frontend”. In our case, the Frontend is everything from the backend service api calls to the customer. Because of this, we need to design a solution that allows for code to be run in a browser, on a server, service calls from a browser. Once you define the surface area of the solution, it becomes apparent that the scope and complexity of this problem is quickly compounding.

High Level Architecture

We need to define an architecture for everything above the red line in the above graphic. In order to simplify the design, I broke this down into three main areas; The UI Layer consisting of a micro-frontend framework with team built

Custom Components, a shared Content Delivery Network (CDN) to front all customer facing pages, and a deployable set of bundled software that we code named Oberon, including a UI Rendering Service and a Backend-For-Frontend.

UI Layer

The UI leverages the micro-frontend architecture and modern web framework best practices to build frontends that leverage browser specifications while being resilient and team owned.

Micro-Frontend

When first approaching the micro-frontend architecture I realized that there is no clear definition of what a micro-frontend is.

Martin Fowler has a very high level definition which he states as

“An architectural style where independently deliverable frontend applications are composed into a greater whole”.

Xenon Stack describes a Micro-frontend as

“a Microservice Testing approach to front-end web development.”

Reading through the many opinions and definitions, I felt it was necessary to get a clearer understanding, and for everyone to agree what a micro-frontend architecture is. I worked with a couple of other Frontend Engineers to put together the following definition for a Micro-Frontend.

Definition

A Micro-Frontend is an Architecture for building reusable and shareable frontends. They are independently deployable, composable frontends made up of components which can stand on their own or be combined with other components to form a cohesive user experience. This architecture is generally supported by hosting a parent application which dynamically slots in child components. Components within a micro-frontend should not explicitly communicate with external entities, but instead publish and subscribe to state updates to maintain loose coupling.

Micro-frontends are inspired by the move to microservices on the backend, bringing the same level of ownership and team independent development and delivery to the frontend.

Self-Contained Components

In order to avoid frontends that over time inadvertently tightly couple themselves and create fragile un-reusable components, we must build components that are encapsulated, isolated, and able to render without the requirement of any other component on the page.

Component Rendering Pipeline

The Component Rendering pipeline renders components to the customer while the framework defines a set of Interfaces, Application Context, and a predictable state container for use across all of the rendering components.

State Management

State management is responsible for maintaining the application state, inter-component communication and API calls. State updates are unidirectional; updates trigger state changes which in turn invoke the appropriate components so they can act on the changes.

Content Delivery Network

Our current architecture has resilience issues, where one portion of the site may become slow or unresponsive and that has a direct impact on the rest of the domain, and in many cases cause an overall site availability issue. In order to get around some of this issue, we add a CDN at the ingress of our call stack. Every downstream frontend rendering will contain Cache-Control headers, in order to control the caching of assets and pages in the CDN. During a site availability issue, the rendering fleet may increase the cache control header, caching for small amounts of time (60 seconds – 5 minutes max), for pages that don’t require dynamic rendering, or customer content. Thus taking load off the fleet and increasing it’s resource availability for other areas.

Oberon

Oberon is a collection of software and Infrastructure-as-Code (IaC) that enables teams to set up frontends quickly and to get in front of customers faster. It includes a configurable Gateway pre-configured for authentication as needed, a UI Rendering Service to server-side render UI’s, UI Asset Server to serve client side assets, and a stubbed out Backend-For-Frontend.

Server Side UI Rendering Service

The UI Rendering Service defines a runtime environment for rendering applications, their components, and is responsible for serving pages to customers. The service maps incoming requests to applications and pages, gathers dependency bundles, and renders the layout to the customer. Oberon will leverage the traffic absorbing nature of a CDN with the scaling of a full serverless architecture.

Backend-For-Frontends (BFF)

A BFF is part of the application layer, bridging the user experience and adding an abstraction layer over the backend microservices. This abstraction layer fills a gap that is inherent in the microservice architecture, where microservices must compete to be as generic as possible while the frontends need to be customer driven.

BFFs are optimized for each specific user interface, resulting in a smaller, less complex, and faster than generic backend, allowing the frontend code to 1) limit over-requesting on the client, 2) to be simpler, and 3) see a unified version of the backend data. Each interface team will have a BFF, allowing them autonomy to control their own interface calls, giving them the ability to choose their own languages and deploy as early or as often as they would like.

Next Steps.

Now that we’ve published the 3 Year Frontend Strategy, the hard work begins. Over the next few months we will be defining the low level architecture of Oberon, and working on a Proof Of Concept that teams can start to leverage in early 2022.

October 5, 2021

Creating a 3 Year Frontend Vision

JC Fant IV
Oct-5th-2021

History

Over the course of the last 21 years I’ve spent time in nearly every aspect of the technical stack, however, I’ve always been drawn to the frontend as the best place to be able to impact customers. I’ve enjoyed the rapid iterations, and the ability to visualize those changes in the browser. It’s why I spent much of the last 14 years prior to Eventbrite at Amazon (AWS) evangelizing the frontend stack. That passion led me to co-found one of the largest conferences internally to Amazon reaching over 7500 engineers across 6 continents. The conference is focused on all aspects of the Frontend, and helped to highlight technologies that teams could adopt and leverage to solve customer problems.

In March of 2021 I joined Eventbrite to help solve some of those same challenges that I’ve spent much of my career trying to solve. As part of my onboarding I was asked to ramp up on the current problem space and the technical challenges the company faces, and to dive into the issues impacting many of our frontend developers and designers. With all of that knowledge, I was tasked to come up with a 3 Year Frontend Strategy.

Many of you have already read the first 3 posts in this series, Creating our 3 year technical vision, Writing our 3 year technical vision, and Writing our Golden Path. If you haven’t had a chance, those 3 posts help to set the context for how we defined and delivered our 3 year Frontend Strategy.

Current Challenges and Limitations

In those previous posts, Vivek Sagi and Daniel Micol described many of the problems that backend engineers, and engineers in general face at Eventbrite. My first task was to engage and listen to the Frontend Engineers around the company and to identify more specific frontend challenges and limitations that we face every day.

A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
Our performance is suboptimal leading to some poor customer interfaces and low lighthouse scores.
We lack automation in how we test, deploy, monitor and roll back our frontend code.
Our frontends are currently written in both a legacy framework and a more modern framework where the rendering patterns have diverged, and are no longer swappable without a migration.
Service or datastore performance issues have a high blast radius where all aspects of the site are degraded including pages that are static in nature.
Our front end experiences are inconsistent across our product portfolio and making changes to deliver against our 3-year self service strategy requires too much coordination.

Developing Requirements

Now that we had a decent understanding of the issues we’ve been facing, we turned our attention to understanding the requirements to solve these problems.

Features. As our product offering evolves to deliver high quality self-service experiences for creators and attendees, we ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we provide.
Performance. User perception of our product’s performance is paramount: a slow product is a poor product that impacts our customers’ trust.
Search Engine Optimized. Through page speed, optimized content, and an improved User Experience, our frontends must employ the proper techniques to maintain or increase our SEO.
Scale. Our frontends must out-scale our traffic, absorbing load spikes when necessary, and deliver a consistent customer experience.
Resilient. Our frontends will respond to customer requests, regardless of the status of downstream services.
Accessible. Our frontends will be developed to ensure equal access and opportunity to everyone with a diverse set of abilities.
Quality. The quality of our experiences should be prioritized to deliver customer value, solve customer problems, and be at a level of performance that meets our SLA’s and reduces customer reported bugs.

Defining Our Tenets

We set out to define a core set of tenets for this strategy; a core set of principles designed to guide our decision making. These tenets help us to align the vision and decisions against our end goals. I wanted these tenets to be focused on driving the solution to be something that Frontend Engineers want to adopt, not something they must. We need to deliver something that is seductive, makes engineers’ lives better, and in turn is able to directly impact our customers; as engineers are able to move quicker, and have the autonomy and ownership to make decisions.

Developer Experience. Start with the developer and work backwards. Tools and frameworks must enable rapid development. Developing inside the Frontend Strategy must be easy and fast, with limited friction.
Metric Driven. We make decisions through the use of metrics; measuring how our pages and components behave and their latencies to drive changes.
Ownership. Teams control their own destiny from end-to-end. From the infrastructure to the software development lifecycle (SDLC), owning the full stack leads to better customer focus, team productivity, and higher quality code.
No Obstacles. We remove gatekeepers from the process by providing self-service options, reusable templates, and tooling.
Features Over Infrastructure. We leverage solutions that unlock frontend engineer productivity, in order to focus on customer features rather than maintaining our infrastructure.
Pace of Innovation. We build solutions to obstacles that interfere with getting features in front of customers.
Every Briteling. We build tools and leverage technology that allows every Briteling to build customer facing features.

Developing Our Vision

Now that we had the challenges, requirements and tenets outlined, we needed to define a vision for this 3 year frontend strategy. Following the tenets, we want to empower Britelings to deliver customer impactful features, and make our customers lives better. We want this vision to be something everyone in the company can get behind, and as such we don’t actually reference Frontend Engineers, instead we strive to empower ALL Britelings to deliver customer impactful experiences.

Vision

Delight creators and attendees by empowering Britelings to easily design, build, and deliver best in class user experiences.

Next Post we will talk about the Strategy and the architecture.

Writing our 3-year technical vision

July 1, 2021July 1, 2021

I joined Eventbrite as their first Technical Fellow, the most senior engineering individual contributor role in the company. One of my initial goals was to come up with an overarching technical vision for the whole company aligned with our 3-year business strategy, and that would move us away from a monolithic architecture and central SRE team to a distributed system where we shift ownership to each team. In our most recent post, Vivek Sagi described the list of problems that we identified and our future-looking goals, which to recap are:

Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
Drive automation and reduce toil. All feature dev teams should be able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

To accomplish these goals, I started working with other engineers and product leaders to understand the history of our technical architecture and the challenges that we were facing including developer productivity issues, site reliability problems or scalability limitations. From these goals, we derived a set of requirements for our 3-year technical vision:

Features. As our product offering evolves to deliver high quality self-service experiences for Super Creators and Consumers, we must ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we will need to provide. For example, Super Creators require multi-event creating/editing, organization level reporting, and multi-event cart support – all of which will require significant architectural changes relative to our current offering. In addition, a new bundle of marketing tools will enhance creators’ ability to acquire new audiences and grow existing ones, especially by leveraging automation and machine learning to simplify the experience while increasing the impact. We seek to improve our offering for consumers to discover, and attend events and to maintain trust in our platform.
Leveraging Data. We have the opportunity to power new data differentiated products based on data from over a decade of past events and round out our focused product offering with key 3rd party integrations (e.g. Mailchimp, Zoom).
Performance. User perception of our product’s performance is paramount: a slow product is a poor product. In addition, better page performance leads to better SEO rankings. We decided to leverage Lighthouse’s performance score, an industry standard web dev performance metric, and we endeavor to achieve a green score (90 to 100) across our customer facing features. We also must enforce low latency in our internal infrastructure and API response times, and set reduction goals year-over-year.
Scale. We will support two types of scaling improvements. We will scale our systems to handle 5x the current load as we grow our business and we need to have systems that support this load and scale to such limits. The second one is related to spikiness in our traffic due to large event sales, where today we use a Waiting Room to throttle calls to our services and DB. We will design systems that can autoscale and descale in certain events and avoid having to overprovision our infrastructure on a manual basis.
Quality. The defect rate of our product offering can either make or break the experience for our users. In the past year, we have reduced the quantity of critical open bugs from 311 down to 175 and also reduced the number of bugs that missed our fix SLA from 200 to 110. We should aggressively lean into this trend and continue to reduce both by 50% YoY. We will improve our ability to deliver along that trend by increasing our test coverage, reducing our code complexity, having better tooling and increasing our level of automation.
Self-Service. We will improve self-service both externally and internally. For the former we will aim for a 50% YoY customer support contact rate reduction relative to total ticket sales, while ensuring that help center page views don’t disproportionately grow – the point being that we deliver product experiences that have sufficient in-line guidance to result in successful experiences. Internally we will ensure that data is accessible by teams, each of the data sources and services has clear documentation and runbooks as well as contracts and use cases. We will define these in “How We Work” guidelines that every team will follow.
Development Process. Finally, we must streamline our internal development processes and progress along the DevOps Big 4 to these levels: Deployment frequency: Elite (Daily for web and backend services and up to weekly for native apps), Lead time for changes: Elite (Less than one hour), Mean time to restore service: Elite (Less than one hour), and Change failure rate: Elite (0-15%).

Applying these principles to the problems that were outlined in the previous post, we thought about the following solutions to them:

Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. We need to decouple our monolith into smaller microservices that can evolve and scale independently. This is a similar trend that many other companies have followed as they grow, and based on our professional experience prior to Eventbrite, we know it works.
Our initial partial attempt to move to a Services Oriented Architecture (SOA) compounded the problem. In our prior attempt, we lacked a clear vision of what moving to SOA meant and how to accomplish it. We moved business logic out but not data, compounding the problem. This time around, we’ve prioritized this architecture transition at a company level, focusing first on the core business logic, including segregating and migrating the underlying data with every service.
Our performance became suboptimal leading to a poor utilization of our hardware resources. We planned to fix this in two ways: by moving to managed services, letting cloud providers deal with this responsibility, and choosing technologies that would autoscale properly based on our traffic patterns, which are spiky by nature due to large onsale events.
Our SDLC process was ad hoc and lacked sufficient controls in a few places. We’ve defined and set ownership boundaries between services and logical components. We’ve also enacted Architecture Review Committees to review designs to ensure we are building extensible services that don’t become monolithic themselves.
Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. We are transitioning to DevOps where each team is the owner of the end-to-end lifecycle of their services. Similar to an earlier point, we’ve implemented this successfully in the past at other companies and we know it works.
We lack automation in how we test, deploy, monitor and roll back our code. Our vision document has sections specifically addressing deployments, testing and operations, indicating that we should aspire to full automation and minimize (and remove, if possible) any manual intervention.
Our core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We see this as an architectural issue where our data boundaries were never established and we had many different services reading from the same tables and writing to them. We also used the same database technology for all of our use cases which has proven to be inefficient.
We also built homegrown tools such as our own RPC protocol, PySOA. We are no longer investing our time in areas that are not business critical and where we can’t build competitive differentiation. For everything else where we need a commoditized solution; evaluate buying instead of building whenever possible. This allows us to focus on providing customer value.

As we can see, we’re trying to move ownership from a centralized SRE team and monolithic architecture to empower teams to build and own their systems. But moving from a situation where the technology set for building features is very limited to another one which is much more open has its risks as well, and we didn’t want to end up with a technology spectrum so wide that it would be difficult to maintain. This is why we wrote our Golden Path, a living document that details the technologies that teams are allowed to use in production for their services, and covers areas such as RPC protocols, storage layers or programming languages. We say it’s a living document because teams are still encouraged to evaluate other technologies when designing their systems, and, if proven they’re the right choice, we’ll update our Golden Path to reflect these. We’ll write another post with more details about this Golden Path.

From an architecture perspective we also depicted a high level view of how we’d design our end system, starting from the client-facing applications and APIs:

And then describing the set of components that we would have in our internal network:

Our 3-year technical vision was a collaborative effort where the entire engineering team was involved. We reviewed the proposal multiple times with different stakeholders, including all engineers, data scientists, product managers and other roles in the company. We received hundreds of comments that enriched and made the whole proposal better. We hosted several Q&As to ensure that all aspects of the vision were clear and there were no outstanding items to be resolved. We also presented it to our CEO and the board of directors. We needed the entire company to become owners of this vision, and leaders in achieving it. After our 3-year technical vision was finalized, a few subsequent long-term thinking proposals were driven by our engineering organization, such as:

Operational Model. We describe the infrastructure and networking that we’ll have to support our shift from centrally-owned infrastructure to a distributed mindset where each team owns the end-to-end lifecycle of their services.
Data. We describe our future internal and external reporting capabilities, and how these will work with a service oriented architecture where each service has its own storage layer, and not limited to a centralized MySQL DB. It also covers how to have a centralized data lake our data scientists can rely on to build their ML models.
Frontend. We propose how to unify our frontend stack and extract our server side rendering from our monolith to Backend-for-Frontends for each application.
Mobile. We are rethinking our integration with our core services and how to share logic between the different applications that we have today.

Apart from this, the roadmaps from all of our teams have been adapted to align with our vision and now include areas of focus such as moving away from the monolith into their own service, having their own storage layer, or moving from in-house technology to industry standards. This is also reflected in all recent technical proposals that have been written, all of which start by clarifying that what’s outlined in the proposal is in alignment with our 3-year technical vision and the Golden Path.

But writing a document and sharing that proposal was just the seed of the vision. We’re now making tangible progress to get there, such as:

We have deprecated our in-house RPC protocol PySOA in favor of gRPC (tenet: we will choose conforming over creating/reforming). We did an initial evaluation where we compared PySOA with gRPC, and did a proof-of-concept to understand which would better suit our use cases. We decided to move to gRPC because it allows us to focus on our business needs instead of maintaining our own RPC protocol, and gRPC is superior since it supports HTTP2 (while PySOA relies on Redis), has a smaller payload size since it relies on protocol buffers and binary serialization, supports multiple programming languages instead of just Python, and has TLS/SSL support, among other advantages. We have also started writing new services using this new protocol.
We are enabling self-service AWS account provisioning and defining our networking and security layers so that teams can own their service’s infrastructure (tenet: teams will have end-to-end ownership of their systems and services).
We are migrating our unmanaged MySQL database to AWS Aurora (tenet: we favor cloud managed services or serverless for commoditized systems and components).
We have worked on several long-term designs for some of our key components such as Ordering and Event Management instead of focusing on shorter-term and incremental improvements (tenet: we will favor long-term maintainability and scale over short-term deliveries for strategic solutions). We have also started their implementation and expect our initial deliveries later this year.
We are writing new designs that break the previous limitation/guidance to only use Python/Django and MySQL and consider databases such as DynamoDB or QLDB, Kotlin or Go, and SNS or Kinesis, as a few examples (tenet: we will standardize on a few stacks but also empower teams to choose the right tool for the job).
We have recently launched an Operational Readiness Review process to analyze the reliability of our current codebase, as well as new designs, are overhauling our Security Review process, moving to full CI/CD, Dockerizing our monolith, raising the bar in our testing and quality processes, and several other initiatives that we have in place (tenet: we will strive for continuous improvement and will ask why not instead of why?).

These are just a few examples that show how having a clearly outlined long-term technical direction can have a significant impact on an organization’s architecture and processes. We will detail many more of these examples for actual impact in upcoming posts.

We are excited about this new, long-term thinking technical vision that will provide the right guidance to our teams, indicate how the different pieces in our system should fit together, and help our every-day decision-making process. And what’s even more exciting is that the whole company participated in its definition and have embraced it with energy and passion.

MySQL High Availability at Eventbrite

January 25, 2021

Situation

Eventbrite has been using MySQL since its inception as a company in 2006. MySQL has served the company well as an OLTP database and we’ve leveraged the strong features of MySQL such as native replication and fast reads, as well as dealt with some of its pain points such as impactful table ALTERs. The production environment relies heavily on MySQL replication. It includes a single primary instance for writes and a number of secondary replicas to distribute the reads.

Fast forward to 2019. Eventbrite is still being powered by MySQL but the version in production (MySQL 5.1) is woefully old and unsupported. Our MySQL production environment still leans heavily on native MySQL replication. We have a single primary for writes, numerous secondary replicas for reads, and an active-passive setup for failover in case the primary has issues. Our ability to failover is complicated and risky which has resulted in extended outages as we’ve fixed the root cause of the outage on the existing primary rather than failing over to a new primary.

If the primary database is not available then our creators are not creating events and our consumers are not buying tickets for these events. The failover from active to passive primary is available as a last resort but requires us to rebuild a number of downstream replicas. Early in 2019, we had several issues with the primary MySQL 5.1 database and due to reluctance to failover we incurred extended outages while we fixed the source of the problems.

The Database Reliability Engineering team in 2019 was tasked first and foremost with upgrading to MySQL 5.7 as well as implementing high availability and a number of other improvements to our production MySQL datastores. The goal was to implement an automatic failover strategy on MySQL 5.7 where an outage to our primary production MySQL environment would be measured in seconds rather than minutes or even hours. Below is a series of solutions/improvements that we’ve implemented since mid-year 2019 that have made a huge positive impact on our MySQL production environment.

Solutions

MySQL 5.7 upgrade

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances. Due to the nature of the upgrade and the large gap between 5.1 and 5.7, we incurred downtime to make it happen. The maintenance window lasted ~30 minutes and it went like clockwork. The DBRE team completed ~15 Failover practice runs against Stage in the days leading up to the cut-over and it’s one of the reasons the cutover was so smooth. The cut-over required 50+ Engineers, Product, QA, Managers in a hangout to support with another 50+ Engineers assuming on-call responsibilities through the weekend. It was not just a DBRE team effort but a full Engineering team effort!

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver!

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value. We had a recent production incident that was related to hitting the max value on an INT auto-increment primary key column. When this happens in production, it’s an ugly situation where all new inserts result in a primary key constraint error. If you’ve experienced this pain yourself then you know what I’m talking about. If not then take my word for it. It’s painful!

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

Improved MySQL ALTERs

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via gh-ost on one of our larger MySQL tables. The duration of the ALTER was 50 hours and it completed with no application impact. So what’s the big deal?

The big deal is that we could now ALTER tables in our production environment with little to no impact on our application, and this included some of our larger tables that were ~500GB in size.

Here is a little background. The ALTER TABLE statement in MySQL is very expensive. There is a global write lock on the table for the duration of the ALTER statement which leads to a concurrency nightmare. The duration time for an ALTER is directly related to the size of the table so the larger the table, the larger the impact. For OLTP environments where lock waits need to be as minimal as possible for transactions, the native MySQL ALTER command is not a viable option. As a result, online schema-change tools have been developed that emulate the MySQL ALTER TABLE functionality using creative ways to circumvent the locking.

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag. Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage.

GitHub introduced a new Online Schema Migration tool for MySQL (gh-ost ) that uses a binary log stream to capture table changes, and asynchronously applies them onto a “duplicate” table. gh-ost provides control over the migration process and allows for features such as pausing, suspending and throttling the migration. In addition, it offers many operational perks that make it safer and trustworthy to use. It is:

Triggerless
Pausable
Lightweight
Controllable
Testable

Orchestrator

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator is a MySQL high availability and replication management tool. It will detect a failure, promote a new primary, and then reassign the name/VIP. Here are some of the nice features of Orchestrator:

Discovery – Orchestrator actively crawls through your topologies and maps them. It reads basic MySQL info such as replication status and configuration.
Refactoring – Orchestrator understands replication rules. It knows about binlog file:position and GTID. Moving replicas around is safe: orchestrator will reject an illegal refactoring attempt.
Recovery – Based on information gained from the topology itself, Orchestrator recognizes a variety of failure scenarios. The recovery process utilizes the Orchestrator’s understanding of the topology and its ability to perform refactoring.

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

Manual failover tests

In March of 2020 the DBRE team completed several manual/controlled fail-overs using Orchestrator and HAProxy. Eventbrite experienced some AWS hardware issues on the MySQL primary and completing manual failovers was the first big test. Orchestrator passed the tests with flying colors.

Automatic failover

In May of 2020 we enabled automatic fail-over for our production MySQL data stores. This is a big step forward in addressing the single-point-of-failure situation with our primary MySQL instance. The DBRE team also completed several rounds of testing in QA/Stage for ProxySQL in preparation for the move from HAProxy to ProxySQL.

Show time

In July 2020, Eventbrite experienced hardware failure on the primary MySQL instance that resulted in automatic failover. The new and improved automatic failover process via Orchestrator kicked in and we failed over to the new MySQL primary in ~20 seconds. The impact to the business was astronomically low!

ProxySQL

In August of 2020 we made the jump to ProxySQL across our production MySQL environments. ProxySQL is a proxy specially designed for MySQL. It allows the Eventbrite DBRE team to control database traffic and SQL queries that are issued against the databases. Some nice features include:

Query caching
Query Re-routing – to separate reads from writes
Connection pool and automatic retry of queries

Also during this time period we began our AWS Aurora evaluation as we began our “Managed Databases” journey. Personally, I prefer to use the term “Database-as-a-Service (DBaaS)”. Amazon Aurora is a high-availability database that consists of compute nodes replicated across multiple availability zones to gain increased read scalability and failover protection. It’s compatible with MySQL which is a big reason why we picked it.

Schema Migration Tool

In September of 2020, we began testing a new Self-Service Database Migration Tool to provide interactive schema changes (invokes gh-ost behind the scene). It supports all “ALTER TABLE…”, “CREATE TABLE…”, or “DROP TABLE…” DDL statements.

It includes a UI where you can see the status of migrations and run them with the click of a button:

Any developer can file and run migrations, and a DBRE is only required to approve the DDL (this is all configurable though). Original source for the tool can be found here shift. We’ve not pushed to production yet but we do have a running version in our non-prod environment.

Database-as-a-Service (DBaaS) and next steps

We’re barreling down the “Database-as-a-Service (DBaaS)” path with Amazon Aurora.

DBaaS flips the DBRE role a bit by eliminating mundane operational tasks and allowing the DBRE team to align our work more closely with the business to derive value from the data assets that we manage. We can then focus on lending our DBA skills to application teams and end users—helping deliver new features, functionality, and proactive tuning value to the core business.

Stay tuned for updates via future blog posts on our migration to AWS Aurora! We’re very interested in the scaled read operations and increased availability that Aurora DB cluster provides. Buckle your seatbelt and get ready for a wild ride 🙂 I’ll provide an update on “Database-as-a-Service (DBaaS)” at Eventbrite in my next blog post!

All comments are welcome! You can message me at ed@eventbrite.com. Special thanks to Steven Fast for co-authoring this blog post.

Building a Scalable Reserved Seating Ticketing Solution with Redis and Lua

July 16, 2014July 17, 2014

After working in online ticketing for many years, I’ve seen how speed is everything especially during large on-sales where the general public swarms on a site as if it were a DoS attack. Since the items being sold are unique inventory, the system has to be much more fluid than found in your typical retail store. Tickets must be locked, reserved or released back into the pool when users reject their seat choice or simply walk away from their browser. This may seem straightforward but the concurrent nature of ticketing where you have multiple users competing for the same inventory is what can make the system behavior very unpredictable. Too much latency from the added bloat of things like message buses and ORM libraries will sink a ticketing system quickly which means it must be extremely lean and efficient in order to survive. Teamed up with a small group of ticketing veterans, our goal was to build Eventbrite’s first reserved seating system to demonstrate the value of keeping things simple for the sake of performance and long term maintainability.

Continue reading “Building a Scalable Reserved Seating Ticketing Solution with Redis and Lua”

Replayable Pub/Sub Queues with Cassandra and ZooKeeper

September 26, 2013September 26, 2013

When first playing around with Cassandra and discovering how fast it is at giving you columns for a row, it appears to be an excellent choice for implementing a distributed queue. However, in reality queues tend to bring out the worst of Cassandra’s thorniest areas: tombstones and consistency level, and are thus seen as an antipattern.

Row-Based vs Column-Based

To implement a queue in Cassandra, you must choose from either row-based or column-based. In row-based, the item to be processed is stored as a row key. In column-based, the item to be processed is stored as a column in a specific row.

With the item to be processed stored as a row key, consistency becomes a bottleneck. Since the items to process are unknown, getting range slices across row keys is the only way to fetch data; this operation ends up querying every node when all keys are needed, as the location and number of keys are unknown ahead of time. Since not all nodes are available at any given time, this is less than ideal.

Continue reading “Replayable Pub/Sub Queues with Cassandra and ZooKeeper”

Optimizing a table with composite primary keys

July 31, 2013February 11, 2014

To scale our data storage, Eventbrite’s strategy has been a combination of: move data to NoSQL solutions, aggressively move queries to slave databases, buy better database hardware, maintain different indexes on database slaves that receive different queries, and finally: design the most optimal tables possible for large and highly-utilized data-sets.

This is a story of optimizing a design for a single MySQL table to store multiple email-addresses per-user (needed by some forward-looking infrastructure we are building). We’ll discuss the Django implementation in a future post.

Multiple Email Address Table

To support multiple email-addresses per-user in MySQL, we need a one-to-many table. A typical access pattern is lookup by email-address, and a join to the users table.

Here is the basic design, followed by our improvements.

The Naïve Implementation

The basic design’s one-to-many table would have an auto-increment primary-key, a column for the email-address, and an index on the email-address. Lookups by email-address will pass through that index.

DROP TABLE IF EXISTS `user_emails`;
CREATE TABLE `user_emails` (
 `id` int NOT NULL AUTO_INCREMENT,
 `email_address` varchar(255) NOT NULL,
 … --other columns about the user
 `user_id` int, --foreign key to users
 KEY (`email_address`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Continue reading “Optimizing a table with composite primary keys”

Watching Metadata Changes in a Distributed Application Using ZooKeeper

April 14, 2013May 15, 2013

We created a distributed ETL system we affectionately call Mandoline. It is configurable, distributed, scalable, and easy to manage – here’s how we did it.

One of the hardest parts of building a distributed system is ensuring proper coordination between nodes across a cluster, and we decided to do it using Apache ZooKeeper. ZooKeeper can be imagined as a remote file system, where every file is also a folder (these are referred to as “znodes”). For example, let’s say we have the znode /mandoline where we store the system version, "1". Under /mandoline we may also store items like the configuration, so /mandoline/load_configstores our load configuration (in our case, a json dictionary for every source). The magic sauce of ZooKeeper is that it guarantees “strictly ordered access” to znodes, allowing synchronization and coordination to be built on top of it.

Mandoline coordinates its tasks using watchers, a neat ZooKeeper feature. You can specify a znode, and whether you’d like to watch changes to its data or its children, and ZooKeeper will notify you once those have happened. In the Mandoline architecture, processes choose specific znodes and use those to communicate changes in the data they are responsible for.

For example, we have a component that loads orders from our Orders table in the database, and we have two other components that need to track: 1. the purchase history of a given user, and 2. the total sales for that event. This is how the loading data component does it:

latest_timestamp = 0
for datum in query_data:
    key = datum.pop(primary_key)

    timestamp = datum.pop(MANDOLINE_TIME_CHECKPOINT, 0)
    if timestamp > latest_timestamp:
        latest_timestamp = timestamp

    main_batch.insert(key, datum)

self.zk_client.retry(
    self.zk_client.set,
    self.load_notification_node,
    str(latest_timestamp),
)

Notice that there are many operations done for a given query, however only a small value (a timestamp, in this case) is written to ZooKeeper. Znodes have a restriction whereas they cannot hold large values, so the queue containing items to actually perform work on are stored in Cassandra while ZooKeeper handles the notification part.

Continue reading “Watching Metadata Changes in a Distributed Application Using ZooKeeper”