Creating the 3 Year Frontend Strategy

Last post we talked about Developing the 3 Year Frontend Vision, in this post we will go into how that vision, the tenets, requirements, and challenges shaped the Strategy moving forward.

One of the key themes in Eventbrite since I joined is DevOps, moving ownership from a single team who has been responsible for ops and distributing that responsibility to each individual team. To give them ownership over decisions, infrastructure, and to control their own destiny. The first step in defining the Strategy was to put together what a Technical Strategy is, and the foundation for that strategy.

Technical Strategy

The overall Technical Strategy is based on availability and ownership. Starting with the way we build our services and frontends, to the way we deploy and serve assets to our customers. The architecture is designed to reduce the blast radius of errors, increase our uptime, and give each team as much control over their space as possible.

Availability

Moving forward we will achieve High Availability (HA), in which our frontends and systems are resilient to faults and traffic, and will operate continuously without human intervention. In order to achieve HA, we will utilize Managed AWS Services or redundant fault tolerant software, and by utilizing content delivery networks (CDN) to increase our performance and resilience by putting our code as close to the customer as possible. We will ensure that all aspects of the system are tested, fault tolerant, and resilient, and that both the client-side and server-side gracefully degrade when downstream services fail.

Ownership

DevOps combines the traditional software development by one team and operations and infrastructure by another into a single team responsible for the full lifecycle of development and infrastructure management. This combination enables organizations to deliver applications at a higher velocity, evolving and improving their products at a faster pace than traditional split teams. The goal of DevOps is to shift the ownership of decision making from the management structure to the developers, improve processes, and remove unproductive barriers that have been put in place over the years.

Frontend

Once we had the foundation of the strategy defined, it was time to define the scope. To understand how to develop a strategy, or to even define one, we need to understand what makes up a “frontend”. In our case, the Frontend is everything from the backend service api calls to the customer. Because of this, we need to design a solution that allows for code to be run in a browser, on a server, service calls from a browser. Once you define the surface area of the solution, it becomes apparent that the scope and complexity of this problem is quickly compounding.

High Level Architecture

We need to define an architecture for everything above the red line in the above graphic. In order to simplify the design, I broke this down into three main areas; The UI Layer consisting of a micro-frontend framework with team built 

Custom Components, a shared Content Delivery Network (CDN) to front all customer facing pages, and a deployable set of bundled software that we code named Oberon, including a UI Rendering Service and a Backend-For-Frontend.

UI Layer

The UI leverages the micro-frontend architecture and modern web framework best practices to build frontends that leverage browser specifications while being resilient and team owned.

Micro-Frontend

When first approaching the micro-frontend architecture I realized that there is no clear definition of what a micro-frontend is.

Martin Fowler has a very high level definition which he states as

“An architectural style where independently deliverable frontend applications are composed into a greater whole”.

Xenon Stack describes a Micro-frontend as

“a Microservice Testing approach to front-end web development.”

Reading through the many opinions and definitions, I felt it was necessary to get a clearer understanding, and for everyone to agree what a micro-frontend architecture is. I worked with a couple of other Frontend Engineers to put together the following definition for a Micro-Frontend.

Definition

A Micro-Frontend is an Architecture for building reusable and shareable frontends. They are independently deployable, composable frontends made up of components which can stand on their own or be combined with other components to form a cohesive user experience. This architecture is generally supported by hosting a parent application which dynamically slots in child components. Components within a micro-frontend should not explicitly communicate with external entities, but instead publish and subscribe to state updates to maintain loose coupling. 

Micro-frontends are inspired by the move to microservices on the backend, bringing the same level of ownership and team independent development and delivery to the frontend.

Self-Contained Components

In order to avoid frontends that over time inadvertently tightly couple themselves and create fragile un-reusable components, we must build components that are encapsulated, isolated, and able to render without the requirement of any other component on the page. 

Component Rendering Pipeline

The Component Rendering pipeline renders components to the customer while the framework defines a set of Interfaces, Application Context, and a predictable state container for use across all of the rendering components.

State Management

State management is responsible for maintaining the application state, inter-component communication and API calls. State updates are unidirectional; updates trigger state changes which in turn invoke the appropriate components so they can act on the changes. 

Content Delivery Network

Our current architecture has resilience issues, where one portion of the site may become slow or unresponsive and that has a direct impact on the rest of the domain, and in many cases cause an overall site availability issue. In order to get around some of this issue, we add a CDN at the ingress of our call stack. Every downstream frontend rendering will contain Cache-Control headers, in order to control the caching of assets and pages in the CDN. During a site availability issue, the rendering fleet may increase the cache control header, caching for small amounts of time (60 seconds – 5 minutes max), for pages that don’t require dynamic rendering, or customer content. Thus taking load off the fleet and increasing it’s resource availability for other areas.

Oberon

Oberon is a collection of software and Infrastructure-as-Code (IaC) that enables teams to set up frontends quickly and to get in front of customers faster. It includes a configurable Gateway pre-configured for authentication as needed, a UI Rendering Service to server-side render UI’s, UI Asset Server to serve client side assets, and a stubbed out Backend-For-Frontend. 

Server Side UI Rendering Service

The UI Rendering Service defines a runtime environment for rendering applications, their components, and is responsible for serving pages to customers. The service maps incoming requests to applications and pages, gathers dependency bundles, and renders the layout to the customer. Oberon will leverage the traffic absorbing nature of a CDN with the scaling of a full serverless architecture. 

Backend-For-Frontends (BFF)

A BFF is part of the application layer, bridging the user experience and adding an abstraction layer over the backend microservices. This abstraction layer fills a gap that is inherent in the microservice architecture, where microservices must compete to be as generic as possible while the frontends need to be customer driven.  

BFFs are optimized for each specific user interface, resulting in a smaller, less complex, and faster than generic backend, allowing the frontend code to 1) limit over-requesting on the client, 2) to be simpler, and 3) see a unified version of the backend data. Each interface team will have a BFF, allowing them autonomy to control their own interface calls, giving them the ability to choose their own languages and deploy as early or as often as they would like.

Next Steps.

Now that we’ve published the 3 Year Frontend Strategy, the hard work begins. Over the next few months we will be defining the low level architecture of Oberon, and working on a Proof Of Concept that teams can start to leverage in early 2022.

Creating a 3 Year Frontend Vision

JC Fant IV
Oct-5th-2021

History

Over the course of the last 21 years I’ve spent time in nearly every aspect of the technical stack, however, I’ve always been drawn to the frontend as the best place to be able to impact customers. I’ve enjoyed the rapid iterations, and the ability to visualize those changes in the browser. It’s why I  spent much of the last 14 years prior to Eventbrite at Amazon (AWS) evangelizing the frontend stack. That passion led me to co-found one of the largest conferences internally to Amazon reaching over 7500 engineers across 6 continents. The conference is focused on all aspects of the Frontend, and helped to highlight technologies that teams could adopt and leverage to solve customer problems.

In March of 2021 I joined Eventbrite to help solve some of those same challenges that I’ve spent much of my career trying to solve. As part of my onboarding I was asked to ramp up on the current problem space and the technical challenges the company faces, and to dive into the issues impacting many of our frontend developers and designers. With all of that knowledge, I was tasked to come up with a 3 Year Frontend Strategy. 

Many of you have already read the first 3 posts in this series, Creating our 3 year technical vision, Writing our 3 year technical vision, and Writing our Golden Path. If you haven’t had a chance, those 3 posts help to set the context for how we defined and delivered our 3 year Frontend Strategy.

Current Challenges and Limitations

In those previous posts, Vivek Sagi and Daniel Micol described many of the problems that backend engineers, and engineers in general face at Eventbrite. My first task was to engage and listen to the Frontend Engineers around the company and to identify more specific frontend challenges and limitations that we face every day.

  • A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
  • Our performance is suboptimal leading to some poor customer interfaces and low lighthouse scores. 
  • We lack automation in how we test, deploy, monitor and roll back our frontend code.
  • Our frontends are currently written in both a legacy framework and a more modern framework where the rendering patterns have diverged, and are no longer swappable without a migration. 
  • Service or datastore performance issues have a high blast radius where  all aspects of the site are degraded including pages that are static in nature.
  • Our front end experiences are inconsistent across our product portfolio and making changes to deliver against our 3-year self service strategy requires too much coordination.

Developing Requirements

Now that we had a decent understanding of the issues we’ve been facing, we turned our attention to understanding the requirements to solve these problems. 

  1. Features. As our product offering evolves to deliver high quality self-service experiences for creators and attendees, we ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we provide. 
  2. Performance. User perception of our product’s performance is paramount: a slow product is a poor product that impacts our customers’ trust. 
  3. Search Engine Optimized. Through page speed, optimized content, and an improved User Experience, our frontends must employ the proper techniques to maintain or increase our SEO.
  4. Scale. Our frontends must out-scale our traffic, absorbing load spikes when necessary, and deliver a consistent customer experience.  
  5. Resilient. Our frontends will respond to customer requests, regardless of the status of downstream services. 
  6. Accessible. Our frontends will be developed to ensure equal access and opportunity to everyone with a diverse set of abilities.
  7. Quality.  The quality of our experiences should be prioritized to deliver customer value, solve customer problems, and be at a level of performance that meets our SLA’s and reduces customer reported bugs. 

Defining Our Tenets

We set out to define a core set of tenets for this strategy; a core set of principles designed to guide our decision making. These tenets help us to align the vision and decisions against our end goals. I wanted these tenets to be focused on driving the solution to be something that Frontend Engineers want to adopt, not something they must. We need to deliver something that is seductive, makes engineers’ lives better, and in turn is able to directly impact our customers; as engineers are able to move quicker, and have the autonomy and ownership to make decisions.

  1. Developer Experience. Start with the developer and work backwards. Tools and frameworks must enable rapid development. Developing inside the Frontend Strategy must be easy and fast, with limited friction.
  2. Metric Driven. We make decisions through the use of metrics; measuring how our pages and components behave and their latencies to drive changes.
  3. Ownership. Teams control their own destiny from end-to-end. From the infrastructure to the software development lifecycle (SDLC), owning the full stack leads to better customer focus, team productivity, and higher quality code.
  4. No Obstacles. We remove gatekeepers from the process by providing self-service options, reusable templates, and tooling.
  5. Features Over Infrastructure. We leverage solutions that unlock frontend engineer productivity, in order to focus on customer features rather than maintaining our infrastructure. 
  6. Pace of Innovation. We build solutions to obstacles that interfere with getting features in front of customers.
  7. Every Briteling. We build tools and leverage technology that allows every Briteling to build customer facing features. 

Developing Our Vision

Now that we had the challenges, requirements and tenets outlined, we needed to define a vision for this 3 year frontend strategy. Following the tenets, we want to empower Britelings to deliver customer impactful features, and make our customers lives better. We want this vision to be something everyone in the company can get behind, and as such we don’t actually reference Frontend Engineers, instead we strive to empower ALL Britelings to deliver customer impactful experiences.

Vision

Delight creators and attendees by empowering Britelings to easily design, build, and deliver best in class user experiences. 

Next Post we will talk about the Strategy and the architecture.

A day in the life of a Technical Fellow

In my two most recent blog posts, I talked about how to write a Long-Term Technical Vision and a Golden Path. These are future-looking and high-level artifacts so the question I keep hearing is: do I need to give up coding to grow in my career and become a Technical Fellow? In this post I will explain what it’s like being a Technical Fellow and how to strike a good balance between breadth and depth. Let’s also forget about the specific title for a moment, since different companies will have other names such as Distinguished or Senior Principal Engineer. What really matters is the scope and how to be able to cope with it while ensuring that you don’t become a person who’s too detached from the details and provides overly generic feedback and guidance.

Eventbrite has roughly 40 engineering teams and in theory I could say that my scope covers all of them. However, it’s unrealistic to be involved in so many of them and have enough context to provide meaningful contributions to each team. The two critical aspects for making this work are: knowing how to prioritize my time, and being able to delegate. But how did I learn this?

Earlier in my career, I was the tech lead for a small team with two other engineers. Over time the product that we had built was successful and we grew to three feature teams, with me being the uber tech lead for them. At first I was trying to be as embedded into each of them as I was when I belonged to just one team: attending their standups, being part of the technical design reviews, coding, etc. Soon enough I realized that this approach would not scale and I sought feedback on how to manage the situation. One piece of advice that was critical in my career was: “in order to grow, you need to find or grow other people to do what you’re doing now, so you can then become dispensable and start focusing on something else”. That “something else” could be taking on a larger scope or just finding another area to work on, but the key here is that what we should be aspiring to is growing others so that they end up doing a similar job to what we’re doing now, and we should become dispensable in our current role. It is interesting to think that our goal should be to reach a point where we’re almost irrelevant, and that took me time to properly understand, but it’s really key for career growth.

After growing tech leads in the three teams I was overseeing, I could start focusing on the larger picture. However, I didn’t want to become too detached from the lower level details, so I opted for working in a rotating way with the three teams, where each quarter I would become a part-time IC for each of those teams, including coding tasks, designs, code reviews and being on call. And I say part-time because I still had to invest time in my breadth activities and thinking about the long term. I structured my schedule in a way where my mornings would be mostly IC work and the afternoons would be filled with leading the overall organization and being a force multiplier. This dual approach where I oversaw the larger organization but also had time to tackle lower level aspects allowed me to focus on the bigger picture while being attached to the actual problems that teams were facing, and have enough context to be useful when providing them feedback and guidance.

Time has passed and at Eventbrite I now follow a similar model but with a larger set of teams. Since the rotational approach won’t work as well (rotating a team per quarter will take me 10+ years to complete each rotation), we decided to implement a model where Principal Engineers and above (including Technical Fellows) would have different engagement levels with each team, which could be divided into the three categories listed below:

  • Sponsors are part of a team and spend ~2 days/week working with that team, which includes attending the standup, participating in system designs, coding and being on call. We expect Principal+ engineers to sponsor at most 2 areas at any given point in time.
  • Guides spend ~2 hours/week on a given project. They are aware of the team’s mission and roadmap, protect the long-term architecture, provide the long-term direction of the product, and may be active in the code base.
  • Participants are available to a team for any questions they have or to help disambiguate areas of concern, they are active in meetings but may not be deep in the code base. Participants spend a few hours a month on the project/team.

With the above in mind, I am sponsoring two teams right now, and that is expected to rotate based on the teams who will need my involvement the most. As of today this means that I’m more involved in the Ordering and Event Infrastructure teams, including coding, working on technical designs, mentoring others in the team, etc.

So what’s a day in my life look like? 

As I mentioned before, I structure my day so that in the mornings I will do IC work and the afternoons will be for breadth work. Right now my main area of focus as an IC is getting our new Ordering Pipeline implemented and that’s where I spend most of my coding cycles. This is a brand new service written in Kotlin, gRPC and uses AWS technologies such as DynamoDB and Lambdas. It’s particularly critical not only because Ordering is at the core of Eventbrite, but because it’s paving the way for the new generation of services that we’re starting to build in the company, since this is the first one with the technologies and processes outlined in the 3-Year Technical Vision and the Golden Path. And such, many other services who will follow will use what Ordering is building today as their reference architecture, and we’re also finding a few unpredicted gaps that we have to solve before other teams find them. I was also on call for this team a couple of weeks ago.

In contrast, my afternoons are typically filled with breadth work, that is, with 1:1s, syncs with other people in Argentina or the US, company tech talks, design reviews, and others. For example, I was recently heavily involved in coming up with a new engineering career guide for the company (which we’ll blog about at some point), or attending leadership syncs with our CTO and CPO about the current state of our Foundations and the challenges ahead of us.

As time passes my focus will move away from Ordering to other areas where I can contribute in depth, and by then I will have expected to grow the team to a state where they don’t miss me and they can keep moving forward without my help. Breadth work is there to stay and can be very different each week depending on what the company needs the most at that particular moment.

Writing our Golden Path

In my last blog post I explained how we defined our 3-year technical vision for the company. One of the key pillars of this vision is shifting from a model where we used the same tool for every job (mostly a combination of Python + Django + MySQL), to the right tool(s) for each job. Given that this would be a new way of working for our organization, we wanted to have some guidelines that teams would follow to ensure that our services and applications wouldn’t have a completely different tech stack depending on the team developing them, which would harm the maintainability of our overall architecture. This is why we decided to write a Golden Path document that would guide teams on the best set of technologies for each potential scenario and recommended tools for common repeatable use cases like logging, security, etc. 

The Golden Path is a document that explains the allowed technologies available for use at Eventbrite when building software. It has been built collaboratively by the entire development organization and is in continuous evolution as teams find better solutions for the problems to be solved. We require any technology choice that is not included in this list to have explicit approval from the Architecture Review Committee (ARC), which is our engineering governance body, before implementing it.

Therefore, one principle around our Golden Path is that we are recommending the use of the “right tool for the job,” which most often means opting for industry standard technologies (enabling us to focus our limited innovation tokens on technological advancements unique to live experiences). Teams are encouraged to evaluate other alternatives that are not in this document when working on their system designs, or challenge currently deprecated ones, and propose these edits to ARC if they find them superior or better suited for their use case than the currently approved ones. This is the way we keep this as a living document that improves over time and adapts to new industry trends.

We divide technologies into the following life cycle phases:

  • Emerging. New technologies that are very likely to become recommended but are not production-ready yet.
  • Recommended. The default choice as of today.
  • Allowed. Technologies that we allow although the recommended one should be used if possible.
  • Deprecated. Discouraged for new development but could be maintained for currently-existing systems.
  • Rejected. Technologies that we don’t use or haven’t used in the past but have been rejected in previous evaluations.

Our Golden Path contains several sections such as programming languages (for microservices, data science, frontend), source package managers, web frameworks, databases and caching, among others. The guidance for how to apply the Golden Path when working on a technical design is as follows:

  • Every section in the document should have a matrix that outlines the best path forward for the use cases that we’ve faced in the past, or a description that clearly specifies this. If our use case is in that list, we should choose the best technology outlined in the matrix.
  • Even if we choose a technology that has been already evaluated in the past, we still need to come up with data for our specific scenario in key dimensions such as cost, latency, etc. to ensure that it will work for this specific scenario. 
  • If a section doesn’t have a matrix yet, or our use case is not included, we will conduct a technology evaluation and contribute to the matrix. The guidelines for this are:
    • We should consider at least two options and do a full bake off before we pick a winner. Choose based on the dimensions that are important for our scenario (features, use case fit, ease of use, cost, latency, consistency, etc).

    • We are not limited to AWS technologies. For the decisions that we make, we should evaluate both the AWS offering and any other leading non-AWS contender (e.g. DynamoDB and Cassandra), including compatibility and integration with other tools of the stack. We will not favor AWS by default and will only use it as a tie-breaker if both offerings are equivalent.


    • Technologies that are deprecated shouldn’t be re-evaluated unless there’s a strong belief that the particular scenario that is being designed will be different than the reasons why that technology was deprecated (e.g. we shouldn’t be looking into unmanaged solutions since those are deprecated). These exceptions will need to be approved by ARC.


Our Golden Path was published in early 2021, a few weeks after we finalized our 3-year tech vision, and every technical design or proposal that has emerged since then is following this new standard. We do envision that in a few years from now we should be able to remove these barriers since teams will have enough internal examples to decide the best tool for the job without the risk of significantly diverging the chosen options for similar use cases.

Here are a few examples of sections extracted from the Golden Path document:

Native Libraries and Wrappers

  • Native Libraries (recommended). We should favor using the native libraries of the tools that we use (e.g. AWS SDKs, feature flags, metrics, etc). Each team consuming those SDKs is responsible for upgrading to newer versions when needed.
  • Wrappers (deprecated). We do not want to use wrappers unless they provide clear additional benefit over native libraries (such as extended capabilities or use simplicity), and we do not believe in the argument that using native libraries is a lock-in to a specific technology, as the downside of building and consuming our own wrappers is a bigger problem. Wrappers tie us to specific underlying library versions, require migration effort as new native library versions are released, and are always a subset of the functionality that those libraries provide.

Microservice Programming Languages

  • Kotlin (recommended). This is the recommended language based on the JVM. It has several benefits over Python such as being multi-threaded, improved performance, and being strongly typed, among others. We should use this language whenever we need to build services that are scalable or performant.
  • Python (recommended). We support it given our extensive in-house knowledge and current stack. We should be careful when using it with services that are expected to have significant load since it’s single-threaded and interpreted languages are typically slower than compiled ones.
  • Node.js (emerging). We have experience with Node.js for frontend development but not microservices, although we’re evaluating it.
  • Go (emerging). We built the integration service in this language. We believe that Go has potential and we should do a feature evaluation at some point.

Service-to-service Communication

This is the communication that happens when a service calls another one directly, and can be either synchronous or asynchronous.

  • gRPC (recommended). This is the only recommended RPC protocol.
  • PySOA / Legacy SOA (deprecated). We support the services that are written in these protocols that are currently in production but don’t allow any new ones to use them.

Relational Databases

Useful when there are multiple entities in the data model that are strongly related.

  • AWS Aurora (recommended). We recommend AWS Aurora which is a managed database compatible with MySQL and PostgreSQL. However, we support only the MySQL flavor.
  • AWS RDS (rejected). We don’t allow RDS since it is less scalable than Aurora although it offers very similar functionality.
  • MySQL (deprecated). We maintain the current databases that we have on MySQL but don’t allow any new functionality to be implemented on this database.

Writing our 3-year technical vision

I joined Eventbrite as their first Technical Fellow, the most senior engineering individual contributor role in the company. One of my initial goals was to come up with an overarching technical vision for the whole company aligned with our 3-year business strategy, and that would move us away from a monolithic architecture and central SRE team to a distributed system where we shift ownership to each team. In our most recent post, Vivek Sagi described the list of problems that we identified and our future-looking goals, which to recap are:

  • Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
  • Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
  • Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
  • Drive automation and reduce toil. All feature dev teams should be  able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
  • Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

To accomplish these goals, I started working with other engineers and product leaders to understand the history of our technical architecture and the challenges that we were facing including developer productivity issues, site reliability problems or scalability limitations. From these goals, we derived a set of requirements for our 3-year technical vision:

  1. Features. As our product offering evolves to deliver high quality self-service experiences for Super Creators and Consumers, we must ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we will need to provide. For example, Super Creators require multi-event creating/editing, organization level reporting, and multi-event cart support – all of which will require significant architectural changes relative to our current offering. In addition, a new bundle of marketing tools will enhance creators’ ability to acquire new audiences and grow existing ones, especially by leveraging automation and machine learning to simplify the experience while increasing the impact. We seek to improve our offering for consumers to discover, and attend events and to maintain trust in our platform.
  2. Leveraging Data. We have the opportunity to power new data differentiated products based on data from over a decade of past events and round out our focused product offering with key 3rd party integrations (e.g. Mailchimp, Zoom).
  3. Performance. User perception of our product’s performance is paramount: a slow product is a poor product. In addition, better page performance leads to better SEO rankings. We decided to leverage Lighthouse’s performance score, an industry standard web dev performance metric, and we endeavor to achieve a green score (90 to 100) across our customer facing features. We also must enforce low latency in our internal infrastructure and API response times, and set reduction goals year-over-year.
  4. Scale. We will support two types of scaling improvements. We will scale our systems to handle 5x the current load as we grow our business and we need to have systems that support this load and scale to such limits. The second one is related to spikiness in our traffic due to large event sales, where today we use a Waiting Room to throttle calls to our services and DB. We will design systems that can autoscale and descale in certain events and avoid having to overprovision our infrastructure on a manual basis.
  5. Quality. The defect rate of our product offering can either make or break the experience for our users. In the past year, we have reduced the quantity of critical open bugs from 311 down to 175 and also reduced the number of bugs that missed our fix SLA from 200 to 110. We should aggressively lean into this trend and continue to reduce both by 50% YoY. We will improve our ability to deliver along that trend by increasing our test coverage, reducing our code complexity, having better tooling and increasing our level of automation.
  6. Self-Service. We will improve self-service both externally and internally. For the former we will aim for a 50% YoY customer support contact rate reduction relative to total ticket sales, while ensuring that help center page views don’t disproportionately grow – the point being that we deliver product experiences that have sufficient in-line guidance to result in successful experiences. Internally we will ensure that data is accessible by teams, each of the data sources and services has clear documentation and runbooks as well as contracts and use cases. We will define these in “How We Work” guidelines that every team will follow.
  7. Development Process. Finally, we must streamline our internal development processes and progress along the DevOps Big 4 to these levels: Deployment frequency: Elite (Daily for web and backend services and up to weekly for native apps), Lead time for changes: Elite (Less than one hour), Mean time to restore service: Elite (Less than one hour), and Change failure rate: Elite (0-15%).

Applying these principles to the problems that were outlined in the previous post, we thought about the following solutions to them:

  1. Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. We need to decouple our monolith into smaller microservices that can evolve and scale independently. This is a similar trend that many other companies have followed as they grow, and based on our professional experience prior to Eventbrite, we know it works.
  2. Our initial partial attempt to move to a Services Oriented Architecture (SOA) compounded the problem. In our prior attempt, we lacked a clear vision of what moving to SOA meant and how to accomplish it. We moved business logic out but not data, compounding the problem. This time around, we’ve prioritized this architecture transition at a company level, focusing first on the core business logic, including segregating and migrating the underlying data with every service.
  3. Our performance became suboptimal leading to a poor utilization of our hardware resources. We planned to fix this in two ways: by moving to managed services, letting cloud providers deal with this responsibility, and choosing technologies that would autoscale properly based on our traffic patterns, which are spiky by nature due to large onsale events.
  4. Our SDLC process was ad hoc and lacked sufficient controls in a few places. We’ve defined and set ownership boundaries between services and logical components. We’ve also enacted Architecture Review Committees to review designs to ensure we are building extensible services that don’t become monolithic themselves.
  5. Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. We are transitioning to DevOps where each team is the owner of the end-to-end lifecycle of their services. Similar to an earlier point, we’ve implemented this successfully in the past at other companies and we know it works.
  6. We lack automation in how we test, deploy, monitor and roll back our code. Our vision document has sections specifically addressing deployments, testing and operations, indicating that we should aspire to full automation and minimize (and remove, if possible) any manual intervention.
  7. Our core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We see this as an architectural issue where our data boundaries were never established and we had many different services reading from the same tables and writing to them. We also used the same database technology for all of our use cases which has proven to be inefficient.
  8. We also built homegrown tools such as our own RPC protocol, PySOA. We are no longer  investing our time in areas that are not business critical and where we can’t build competitive differentiation. For everything else where we need a commoditized solution; evaluate buying instead of building whenever possible. This allows us to focus on providing customer value.

As we can see, we’re trying to move ownership from a centralized SRE team and monolithic architecture to empower teams to build and own their systems. But moving from a situation where the technology set for building features is very limited to another one which is much more open has its risks as well, and we didn’t want to end up with a technology spectrum so wide that it would be difficult to maintain. This is why we wrote our Golden Path, a living document that details the technologies that teams are allowed to use in production for their services, and covers areas such as RPC protocols, storage layers or programming languages. We say it’s a living document because teams are still encouraged to evaluate other technologies when designing their systems, and, if proven they’re the right choice, we’ll update our Golden Path to reflect these. We’ll write another post with more details about this Golden Path.

From an architecture perspective we also depicted a high level view of how we’d design our end system, starting from the client-facing applications and APIs:


And then describing the set of components that we would have in our internal network:

Our 3-year technical vision was a collaborative effort where the entire engineering team was involved. We reviewed the proposal multiple times with different stakeholders, including all engineers, data scientists, product managers and other roles in the company. We received hundreds of comments that enriched and made the whole proposal better. We hosted several Q&As to ensure that all aspects of the vision were clear and there were no outstanding items to be resolved. We also presented it to our CEO and the board of directors. We needed the entire company to become owners of this vision, and leaders in achieving it. After our 3-year technical vision was finalized, a few subsequent long-term thinking proposals were driven by our engineering organization, such as:

  • Operational Model. We describe the infrastructure and networking that we’ll have to support our shift from centrally-owned infrastructure to a distributed mindset where each team owns the end-to-end lifecycle of their services.
  • Data. We describe our future internal and external reporting capabilities, and how these will work with a service oriented architecture where each service has its own storage layer, and not limited to a centralized MySQL DB. It also covers how to have a centralized data lake our data scientists can rely on to build their ML models.
  • Frontend. We propose how to unify our frontend stack and extract our server side rendering from our monolith to Backend-for-Frontends for each application.
  • Mobile. We are rethinking our integration with our core services and how to share logic between the different applications that we have today.

Apart from this, the roadmaps from all of our teams have been adapted to align with our vision and now include areas of focus such as moving away from the monolith into their own service, having their own storage layer, or moving from in-house technology to industry standards. This is also reflected in all recent technical proposals that have been written, all of which start by clarifying that what’s outlined in the proposal is in alignment with our 3-year technical vision and the Golden Path.

But writing a document and sharing that proposal was just the seed of the vision. We’re now making tangible progress to get there, such as:

  • We have deprecated our in-house RPC protocol PySOA in favor of gRPC (tenet: we will choose conforming over creating/reforming). We did an initial evaluation where we compared PySOA with gRPC, and did a proof-of-concept to understand which would better suit our use cases. We decided to move to gRPC because it allows us to focus on our business needs instead of maintaining our own RPC protocol, and gRPC is superior since it supports HTTP2 (while PySOA relies on Redis), has a smaller payload size since it relies on protocol buffers and binary serialization, supports multiple programming languages instead of just Python, and has TLS/SSL support, among other advantages. We have also started writing new services using this new protocol.
  • We are enabling self-service AWS account provisioning and defining our networking and security layers so that teams can own their service’s infrastructure (tenet: teams will have end-to-end ownership of their systems and services).
  • We are migrating our unmanaged MySQL database to AWS Aurora (tenet: we favor cloud managed services or serverless for commoditized systems and components).
  • We have worked on several long-term designs for some of our key components such as Ordering and Event Management instead of focusing on shorter-term and incremental improvements (tenet: we will favor long-term maintainability and scale over short-term deliveries for strategic solutions). We have also started their implementation and expect our initial deliveries later this year.
  • We are writing new designs that break the previous limitation/guidance to only use Python/Django and MySQL and consider databases such as DynamoDB or QLDB, Kotlin or Go, and SNS or Kinesis, as a few examples (tenet: we will standardize on a few stacks but also empower teams to choose the right tool for the job).
  • We have recently launched an Operational Readiness Review process to analyze the reliability of our current codebase, as well as new designs, are overhauling our Security Review process, moving to full CI/CD, Dockerizing our monolith, raising the bar in our testing and quality processes, and several other initiatives that we have in place (tenet: we will strive for continuous improvement and will ask why not instead of why?).

These are just a few examples that show how having a clearly outlined long-term technical direction can have a significant impact on an organization’s architecture and processes. We will detail many more of these examples for actual impact in upcoming posts.

We are excited about this new, long-term thinking technical vision that will provide the right guidance to our teams, indicate how the different pieces in our system should fit together, and help our every-day decision-making process. And what’s even more exciting is that the whole company participated in its definition and have embraced it with energy and passion.

How we created our 3-year technical vision

Eventbrite is a global company. We have employees in many countries around the globe, and we believe in the future of remote work. This is no different in Engineering. With offices in San Francisco, Nashville, Mendoza and Madrid, and many of our team members working remotely, our global team has a rich and diverse culture. We have grown organically and also by acquiring several companies such as Ticketfly, ToneDen, Ticketea or Eventioz.

Our tech stack grew rapidly as we scaled and we ended up becoming a Python/Django monolith, called Core, with a centralized MySQL database. Starting with a code base that over time becomes monolithic is a common occurrence in startups that scale since they need to deliver fast and provide business value. However, in the long run this approach impacts the speed at which the company can operate and innovate due to the number of hard dependencies between the code, which is now fractionally owned by many different teams.

When I came onboard, approximately nine months ago, there were multiple initiatives underway to reduce our technical debt. However, we did not have a unifying 3-year technical vision that would act as a guiding principle, our north star, to keep us on the right path and enable us to deliver against our business strategy.

Having that vision for the future is paramount to the success of our team, our company, and the event creators and attendees we build for.

This is the first of many posts from our team on how we built our 3-year technical vision and are executing against it to increase our code quality and development velocity while reducing infrastructure costs.

Our intent in sharing this is twofold. For anyone considering a similar exercise, hopefully some or all of this resonates and will help you on your journey. We also acknowledge there are better ways to accomplish this and hope to learn these methods from you through your comments and feedback below.

The first step in creating a technical vision is to have a shared understanding of the problems with your current architecture.  The following were our problems.

Problems

  • Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. A monolithic architecture is a software pattern where all the codebase and infrastructure are tightly coupled, live in the same artifact, and have the same development and deployment lifecycle. This contrasts with distributed architectures where each component or service has its own set of artifacts and lifecycle. A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
  • Our initial partial attempt to move to a SOA architecture compounded the problem. We retained a single data store for most services and writes continued to happen from the  monolith. We also had issues where multiple data stores were being updated based on a single action and had multiple code paths that CRUD data, leading to consistency issues. In effect we created a complex distributed monolith because of the depth of dependencies, circular calls, data coupling and Eventbrite’s specific architecture. This also increases our blast radius exposure, meaning that a failure in a given part of the system could potentially affect many others, increasing the severity of the issue and the customer impact.
  • Our performance became suboptimal leading to a poor utilization of our hardware resources. There are two main reasons for this:
    • Relying on a relational MySQL database allows us to scale vertically but not horizontally, impacting the overall performance and scalability of our architecture. It is also not the ideal solution for many scenarios like reporting, data science model development, etc. We are also affected by inconsistent data model and query design, which requires a lot of human effort to overcome..
    • Our Python code is inherently single-threaded, therefore unable to use most of the capacity that our hosts have, leading to overprovisioning in some cases and requiring mitigation features such as waiting rooms to handle very spiky traffic patterns when we have large events on sale. We do not autoscale the core monolith given all its complexities. There’s a lack of clear code, service or data ownership, and we have some orphaned services. Side effects between service interactions is also a common problem.
  • Our SDLC process was adhoc and lacked sufficient controls in a few places. Software Engineers (SWEs) make code contributions to different repositories without a consistent review or approval process.
  • Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. That has led to SREs being the only team with production access, which inadvertently leads to them being perceived as a bottleneck even as they try to do what we asked them to do. In addition, our architecture tends to be limited to the tools that SREs use, instead of being able to choose the best technology.
  • We lack automation in how we test, deploy, monitor and roll back our code, placing an undue burden on some of our engineering teams who need to spend their time on these mundane tasks instead of delivering value.
  • The core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We’ve introduced new ledger style datastores to capture history that have led to consistency challenges.
  • We also built homegrown tools such as our own RPC protocol, PySOA, and an open source library because of some unique needs in our business and also because there were no off the shelf tools that did the job well a decade ago.  As the industry has evolved we now have off the shelf solutions that offer similar or better functionality. Maintaining undifferentiated home grown systems, which consume time from our engineering team and make it difficult to integrate with other industry standards is no longer prudent. We need to start looking at buy vs build options for all of our non-differentiated technical needs.

The next step was to describe our end goals. After we executed on our technical vision what would we want to accomplish?

Goals

  • Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
  • Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
  • Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
  • Drive automation and reduce toil. All feature dev teams should be  able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
  • Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

Defining aspirational goals was great but then we needed a set of tenets that would guide our decisions going forward. Here was the set of tenets we came up with.

Tenets

  • We will choose conforming over creating/reforming. We will research industry standards and will favor using them instead of building our own standard(s). By default we choose to not build wrappers on top of such standards. We will only build our own custom standard when it gives us a competitive advantage.
  • Teams will have end-to-end ownership of their systems and services. This includes the business proposal, system design, implementation, testing, documentation, deploying to production, monitoring and maintenance and responding to on-call for any production issues. We believe this ownership leads to more productive teams and higher quality code. We will minimize the delegation of any of these activities to other teams or roles, although we expect some to have a central responsibility such as SREs being responsible for the overall health of our site and apps.
  • We favor cloud managed services or serverless for commoditized systems and components. By default we choose not to maintain and scale our own hosts, databases or other infrastructure, and rely on cloud computing. We will focus on creating business value over non-value added commoditized tasks.
  • We will favor long-term maintainability and scale over short-term deliveries for strategic solutions. We will have a long-term vision for strategic solutions that are a core component of our 3-year strategy. We value having a maintainable and scalable solution, and accept having short-term and iterative steps but always with the long-term in mind.
  • We will standardize on a few stacks but also empower teams to choose the right tool for the job. We will choose a preferred logging, metrics, alerting, front end development and issue management stack. However teams are encouraged to pick appropriate programming languages, data stores, frameworks for their use cases after a thorough analysis. We will not allow an unreasonable growth in the number of technologies that we use, though.
  • We will strive for continuous improvement and will ask why not instead of why? We will be bold in our choices and will constantly seek new and better ways of solving problems. We will be rigorous in our analysis and will take risks in pursuit of excellence.

This is where our 3-Year technical vision started to take shape. Daniel Micol, our tech fellow, will shed more light on that process in our next blog post.

MySQL High Availability at Eventbrite

Situation

Eventbrite has been using MySQL since its inception as a company in 2006. MySQL has served the company well as an OLTP database and we’ve leveraged the strong features of MySQL such as native replication and fast reads, as well as dealt with some of its pain points such as impactful table ALTERs. The production environment relies heavily on MySQL replication. It includes a single primary instance for writes and a number of secondary replicas to distribute the reads.  

Fast forward to 2019.  Eventbrite is still being powered by MySQL but the version in production (MySQL 5.1) is woefully old and unsupported. Our MySQL production environment still leans heavily on native MySQL replication. We have a single primary for writes, numerous secondary replicas for reads, and an active-passive setup for failover in case the primary has issues. Our ability to failover is complicated and risky which has resulted in extended outages as we’ve fixed the root cause of the outage on the existing primary rather than failing over to a new primary. 

If the primary database is not available then our creators are not creating events and our consumers are not buying tickets for these events. The failover from active to passive primary is available as a last resort but requires us to rebuild a number of downstream replicas. Early in 2019, we had several issues with the primary MySQL 5.1 database and due to reluctance to failover we incurred extended outages while we fixed the source of the problems. 

The Database Reliability Engineering team in 2019 was tasked first and foremost with upgrading to MySQL 5.7 as well as implementing high availability and a number of other improvements to our production MySQL datastores. The goal was to implement an automatic failover strategy on MySQL 5.7 where an outage to our primary production MySQL environment would be measured in seconds rather than minutes or even hours. Below is a series of solutions/improvements that we’ve implemented since mid-year 2019 that have made a huge positive impact on our MySQL production environment. 

Solutions

MySQL 5.7 upgrade

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances. Due to the nature of the upgrade and the large gap between 5.1 and 5.7, we incurred downtime to make it happen. The maintenance window lasted ~30 minutes and it went like clockwork. The DBRE team completed ~15 Failover practice runs against Stage in the days leading up to the cut-over and it’s one of the reasons the cutover was so smooth. The cut-over required 50+ Engineers, Product, QA, Managers in a hangout to support with another 50+ Engineers assuming on-call responsibilities through the weekend. It was not just a DBRE team effort but a full Engineering team effort!

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver! 

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value. We had a recent production incident that was related to hitting the max value on an INT auto-increment primary key column. When this happens in production, it’s an ugly situation where all new inserts result in a primary key constraint error. If you’ve experienced this pain yourself then you know what I’m talking about. If not then take my word for it.  It’s painful!

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

Improved MySQL ALTERs

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via  gh-ost on one of our larger MySQL tables.  The duration of the ALTER was 50 hours and it completed with no application impact. So what’s the big deal?  

The big deal is that we could now ALTER tables in our production environment with little to no impact on our application, and this included some of our larger tables that were ~500GB in size.

Here is a little background. The ALTER TABLE statement in MySQL is very expensive. There is a global write lock on the table for the duration of the ALTER statement which leads to a concurrency nightmare.  The duration time for an ALTER is directly related to the size of the table so the larger the table, the larger the impact.  For OLTP environments where lock waits need to be as minimal as possible for transactions, the native MySQL ALTER command is not a viable option. As a result, online schema-change tools have been developed that emulate the MySQL ALTER TABLE functionality using creative ways to circumvent the locking.

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag.  Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage. 

GitHub introduced a new Online Schema Migration tool for MySQL (gh-ost ) that uses a binary log stream to capture table changes, and asynchronously applies them onto a “duplicate” table. gh-ost provides control over the migration process and allows for features such as pausing, suspending and throttling the migration. In addition, it offers many operational perks that make it safer and trustworthy to use. It is:

  • Triggerless
  • Pausable
  • Lightweight
  • Controllable
  • Testable

Orchestrator

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator is a MySQL high availability and replication management tool. It will detect a failure, promote a new primary, and then reassign the name/VIP. Here are some of the nice features of Orchestrator:

  • Discovery – Orchestrator actively crawls through your topologies and maps them. It reads basic MySQL info such as replication status and configuration. 
  • Refactoring – Orchestrator understands replication rules. It knows about binlog file:position and GTID. Moving replicas around is safe: orchestrator will reject an illegal refactoring attempt.
  • Recovery – Based on information gained from the topology itself, Orchestrator recognizes a variety of failure scenarios. The recovery process utilizes the Orchestrator’s understanding of the topology and its ability to perform refactoring. 

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

Manual failover tests

In March of 2020 the DBRE team completed several manual/controlled fail-overs using Orchestrator and HAProxy. Eventbrite experienced some AWS hardware issues on the MySQL primary and completing manual failovers was the first big test. Orchestrator passed the tests with flying colors.

Automatic failover

In May of 2020 we enabled automatic fail-over for our production MySQL data stores. This is a big step forward in addressing the single-point-of-failure situation with our primary MySQL instance. The DBRE team also completed several rounds of testing in QA/Stage for ProxySQL in preparation for the move from HAProxy to ProxySQL.

Show time

In July 2020, Eventbrite experienced hardware failure on the primary MySQL instance that resulted in automatic failover.  The new and improved automatic failover process via Orchestrator kicked in and we failed over to the new MySQL primary in ~20 seconds. The impact to the business was astronomically low! 

ProxySQL

In August of 2020 we made the jump to ProxySQL across our production MySQL environments.  ProxySQL is a proxy specially designed for MySQL. It allows the Eventbrite DBRE team to control database traffic and SQL queries that are issued against the databases. Some nice features include:

  • Query caching
  • Query Re-routing – to separate reads from writes
  • Connection pool and automatic retry of queries

Also during this time period we began our AWS Aurora evaluation as we began our  “Managed Databases” journey. Personally, I prefer to use the term “Database-as-a-Service (DBaaS)”. Amazon Aurora is a high-availability database that consists of compute nodes replicated across multiple availability zones to gain increased read scalability and failover protection.  It’s compatible with MySQL which is a big reason why we picked it.

Schema Migration Tool

In September of 2020, we began testing a new Self-Service Database Migration Tool to provide interactive schema changes (invokes gh-ost behind the scene).  It supports all “ALTER TABLE…”, “CREATE TABLE…”, or “DROP TABLE…” DDL statements. 

It includes a UI where you can see the status of migrations and run them with the click of a button:

Any developer can file and run migrations, and a DBRE is only required to approve the DDL (this is all configurable though). Original source for the tool can be found here shift. We’ve not pushed to production yet but we do have a running version in our non-prod environment.

Database-as-a-Service (DBaaS) and next steps

We’re barreling down the “Database-as-a-Service (DBaaS)” path with Amazon Aurora. 

DBaaS flips the DBRE role a bit by eliminating mundane operational tasks and allowing the DBRE team to align our work more closely with the business to derive value from the data assets that we manage. We can then focus on lending our DBA skills to application teams and end users—helping deliver new features, functionality, and proactive tuning value to the core business. 

Stay tuned for updates via future blog posts on our migration to AWS Aurora! We’re very interested in the scaled read operations and increased availability that Aurora DB cluster provides. Buckle your seatbelt and get ready for a wild ride 🙂  I’ll provide an update on “Database-as-a-Service (DBaaS)” at Eventbrite in my next blog post!

All comments are welcome! You can message me at ed@eventbrite.com. Special thanks to Steven Fast for co-authoring this blog post.

Building a Protest Map: A Behind the Scenes Look!

Sometimes, a project that you’re most proud of or passionate about isn’t one that involves architecture diagrams or deep technical discussions. Sometimes, the most significant project is one that allows peers to come together to build something that can have a positive impact. When we can all collaborate for a cause we’re passionate about, and get quick buy-in from our colleagues and executives, we can create a feature we’re proud of. 

Such is the case with Eventbrite’s Protests map. If you’ve visited the Eventbrite homepage recently, you’ve probably seen our map highlighting protests and marches to support racial equality, hosted on Eventbrite.

The map features events across the U.S., and the data source for the events shown is a Curated Collection.

The birth of a map

On May 25, 2020, news of George Floyd’s murder in Minneapolis spread rapidly across the country, igniting protests and rallies seeking justice. On June 9, a small team consisting of a product manager, engineer, designer, and content editor came together to build a map for these events on Eventbrite.com.

The plan was to create a map on Eventbrite that led users to protests across the United States. This map would aggregate Black Lives Matter protests and allow our users to interact based on their location. While some projects exemplify technical ingenuity of a team, this one relied on quick thinking and action, teamwork, and a driving force of passion. There are always a number of technical solutions to test out, but sometimes it’s simpler to leverage people.

Company buy-in and teamwork

The fact that this map lives on our homepage means that everyone, including event organizers and consumers would see this high-touch feature. That means that we needed quick buy-in from the whole company, especially impacted department leads, to launch this feature while protests were still occurring nation-wide. This project was greenlit in a matter of days. 

The trust involved in getting this map out the door was important to our success. Legal, PR, and marketing all came together to bring this project to life, it was a true cross-functional effort.

Collections and discovery

When we were in the brainstorming phase, we faced the question of how to populate the map with events that were specifically protests in support of Black Lives Matter. Our first instinct was to construct a query to identify events that would fit into the theme.

For a simple example, we could return events where the event’s organizer had selected the “Rally” format when they created the event. Then further filter those results by events where the organizer had entered a description containing the text “BLM” or “George Floyd.”

We quickly realized this wasn’t going to produce good results. Variations on this approach would either be over-specific or too broad, including far too many irrelevant events or excluding relevant ones. Plus, it would have been vulnerable to trolling and require us to constantly adjust the query as people learned how to configure their events to show up on the map.

It was probably naive to have even tried that approach, but the obvious alternative had its downsides too. The alternative being manual, human-driven curation through the use of an existing feature on our platform called “Collections.”

Collections allow Eventbrite users to create collections of events. Any user on Eventbrite can create a collection simply by clicking the “heart” icon on an event card:

Our marketing team uses this feature to highlight particularly interesting events that adhere to certain themes: online yoga classes, drive-in concerts, etc. Sometimes you’ll see it in our product under the name of “Editor’s Picks.”

But as mentioned above, this approach has downsides. It takes significant manual work to identify all the events that we want to be part of the collection. And event inventory has a unique challenge: events end. At some point in time, a planned event will take place and the content is no longer relevant to show consumers. The Collections feature depends on the same Elasticsearch index we use for search, and due to various constraints we actively remove events from the index after they end. 

Since we remove ended events, we run the risk of showing a sparse or empty map if we don’t maintain the collection. This meant that someone would have to constantly keep an eye on this collection and keep it updated with new events as people create them on our platform.

Luckily, we had help.

The initial list of protests came together through help from the rest of the company, specifically our company’s global diversity group, collaborating on Slack and Google Sheets.

After we launched, Eventbrite Co-founder and CEO, Julia Hartz was so inspired by the work that she offered extra funding for the marketing team to continue the work of identifying relevant events for the map. That’s some of the most positive internal feedback you can get!

The nuts-and-bolts

While this was a straightforward project from a technical implementation standpoint, for the most part leveraging an existing feature rather than building something entirely new, it’s worth highlighting the work we’ve done in Eventbrite Engineering over the past year to invest in our architecture and infrastructure. Like many companies, Eventbrite is undergoing the difficult process of moving away from a monolithic architecture to a microservices architecture (ours has a domain-driven spin). We call the monolith “Core.”

Although we’re nowhere close to finished, we’re already starting to reap the benefits of the hard work invested in moving features out of Core, piece by piece. A little over a year ago, it wouldn’t have been feasible to ship a project of this nature so quickly because deploying any part of the site, no matter how small, required deploying all of Core. Today, we can ship changes to the homepage within a matter of minutes. 

Just as importantly, we have reliable alerting to notify us of problems that arise after shipping a change. As a matter of fact, shortly after we launched this project we were alerted of an unusual spike in errors on the homepage. Within 5 minutes of receiving this notification, we rolled back the release containing the map. After another 10 minutes, we re-released the map with the fix.

Bringing people together

Our company’s mission is simple: Bring the world together through live experiences. Protests themselves bring together like-minded individuals to fight towards a common goal. Working on a feature that exemplifies our mission as a company was a truly inspiring opportunity. This lit a fire in our team that translated to the company as a whole.

Authors: Eloise Porter, Jaylum Chen, Alex Rice, and Allen Jilo

How are you building/maintaining team cohesion?

Let’s face it! 2020 has been a year of huge change! One of the biggest changes is with how we work. Like many other companies, Eventbrite has transitioned to a fully remote work environment with engineers and managers located across the globe. We have development centers in San Francisco (USA), Nashville (USA), Mendoza (Argentina), and Madrid (Spain) with teams spread across all four locations. The challenge for managers and individual contributors alike is to build/maintain cohesion for remote teams! 

This is for sure one of the most common topics in management these days. How do you keep cohesion when your team has transitioned to fully remote? A group of Eventbrite managers decided to try a few strategies and to document the process. 

At Eventbrite, our mission is to bring the world together, and in a pre-pandemic world it was through live events, these days it’s more about connecting, somehow, with people.

Building a team is more art than science, nurturing the culture, the bonds, and, most importantly, building trust between the members needs time and dedication. Doing all of this is difficult even when everyone can be at the same place, make jokes, laugh, and work side by side. In this new COVID-19 world, we lost many of these tools. We decided it was time for some new tricks!

Daily standup meetings

Right from the beginning of the COVID-19 pandemic in March, one of our Engineering Managers (Nacho) suggested that we try to sync every morning. Before COVID-19, we wouldn’t do it regularly. We gave it a try and it worked! Since then we talk everyday, we are very active on Slack, and it’s not always about work. It’s helped us both professionally and emotionally as we shared problems, frustrations, … We make the daily standups a top priority as these virtual meetings are a perfect chance to build rapport with our teammates.  

It was super hard to get everyone onboard, but eventually we were all there, always, every morning. It’s great to think that I have my team when I need them, available and willing to help.

Weekly Board Games

Starting in April, we set up a board game meeting for 4pm every Tuesday. We play Settlers of Catan on colonist.io while chatting over video conference. It has become a weekly ritual for one of our teams in Argentina, with between 3 and 5 people playing. It lets us spend some time together, have fun and socialize. 

  • Catan:
https://colonist.io/
  • Versión en línea de juego Basta!, Tutti frutti, Lápiz quieto, ¡Mercadito!, Dulce de Membrillo el que todos conocemos. 8 players
https://bastaonline.net/

Coffee breaks

There’s always time for a coffee break to talk about anything, work or non-work related stuff. We created a Slack shortcut (!cafebreak) to easily share the meeting details for anyone to quickly jump in. These short spontaneous recesses provide distraction and allow us to relax from the daily hustle. 

Keeping the Culture alive

The Eventbrite’s company culture makes it a unique place to work! The challenge is to continue to foster this culture while functioning as a fully remote team. It might be easy for a team that is accustomed to spending 9 hours a day, 5 days a week together to forget what makes them unique when everyone is scattered, without interaction besides Slack or Hangout calls. Knowing what makes your team unique is a key thing to understand. 

On our team, it is our love for problem solving. Thus every Friday afternoon, we stopped our work earlier to take some time and play around Hackerrank for a couple of hours, solving different exercises and sharing our solutions. It was an incredible way of having fun and learning from each other by checking how each one of us approached the solutions. It helped us remember who we were as a team.

Engineering Development Academy

Eventbrite implemented EDA (Engineering Development Academy) in Argentina. This initiative is a new way to find and incorporate engineering talent in our company. The EDA group of professionals is trained in technical (Python, Django, Testing, CI, CD, Javascript, React, CSS) and non-technical subjects (agile methodologies, Scrum, English) in a 3-month program.

Every day at 9:00 am, we start with breakfast/chat about things that do and do not work. Movies, news, etc. The first talks we looked for topics to share because the whole team was new, and thus forced everyone to speak. We also do afternoon checkpoints for 30 minutes at 2:00 PM for both technical or non-technical questions, to help make the training as positive as possible.

Lightning Talks

Emilio (Principal Engineer) organizes Lightning Talks on a weekly basis. These are open talks about anything that is important to an engineer and she/he is interested in sharing with others. What we have learned during COVID-19?

Spontaneous 1:1’s

There are often times where discussions are better in 1:1 or group meetings rather than discussing in Slack or collaborating via a Google doc or email. We encourage our team members to create a Google hangout or Zoom session at any time if it helps resolve issues quicker. 

Quick text messages (via Slack) are great for clarifying simple matters. Often a more detailed discussion is required and this is when a Google Hangout is preferred.  It’s important to recognize the distinction as too many calls can burn engineers out. But you can also waste lots of time exchanging messages when a five-minute call could provide answers to multiple questions.

Virtual Meeting tips

It can be challenging to connect with teammates during virtual/online meetings. Simple things such as keeping your camera on and using verbal and non-verbal cues (such as head nods or thumbs-up) are great ways to make a connection during virtual meetings. Also, go out of your way to recognize teammates who have gone above and beyond during online meetings.  While working remote it is often easy to bypass/miss the great work that others are doing.

Many of the virtual meetings will start at 5 minutes past the starting time. Use this time to show personal interest for the other participants. Ask them about their weekend? Break the ice while you’re waiting for all participants to join. It’s a great way to learn more about some of your co-workers who you may not necessarily know very well.

 

This article’s co-authors are Henry Lyne, Gabriel Flores, Ed Presz, Emiliano André and Juan Pablo Marsano. Reviewed by Rainu Ittycheriah.

Teaching new Presto performance tricks to the Old-School DBA

I’ve spent much of my career working with relational databases such as Oracle and MySQL, and SQL performance has always been an area of focus for me. I’ve spent countless hours reviewing EXPLAIN plans, rewriting subqueries, adding new indexes, and chasing down table-scans. I’ve been trained to make performance improvements such as:  only choose columns in a SELECT that are absolutely necessary, stay away from LIKE clauses, review the cardinality of columns before adding indexes, and always JOIN on indexed columns.

It’s been an instinctual part of my life as a Database Administrator who supports OLTP databases that have sold in excess of 20K tickets per minute to your favorite events. I remember a specific situation where a missing index caused our production databases to get flooded with table-scans that brought a world-wide on-sale to an immediate halt. I had a lot of explaining to do that day as the missing index made it to QA and Stage but not Production!

In recent years, I’ve transitioned to Data Engineering and began supporting Big Data environments.  Specifically, I’m supporting Eventbrite’s Data Warehouse which leverages Presto and Apache Hive using the Presto/Hive connector. The data files can be of different formats, but we’re using HDFS and S3.  The Hive metadata describes how data stored in HDFS/S3 maps to schemas, tables, and columns to be queried via SQL. We persist this metadata information in Amazon Aurora and access it through the Presto/Hive connector via the Hive Metastore Service (HMS). 

The stakes have changed and so have the skill-sets required. I’ve needed to retrain myself in how to write optimal SQL for Presto. Some of the best practices for Presto are the same as relational databases and others are brand new to me. This blog post summarizes some of the similarities and some of the differences with writing efficient SQL on MySQL vs Presto/Hive. Along the way I’ve had to learn new terms such as “federated queries”, “broadcast joins”, “reshuffling”, “join reordering”, and “predicate pushdown”.

Let’s start with the basics:

What is MySQL? The world’s most popular open source database. The MySQL software delivers a fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL is intended for mission-critical, heavy-load production database usage.

What is Presto? Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel relational databases.

Presto uses ANSI SQL syntax/semantics to build its queries. The advantage of this is that analysts with experience with relational databases will find it very easy and straightforward to write Presto queries! That said, the best practices for developing efficient SQL via Presto/Hive are different from those used to query standard RDBMS databases.

Let’s transition to Presto performance tuning tips and how they compare to standard best practices with MySQL.

 

1. Only specify the columns you need

Restricting columns for SELECTs can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*). This applies to Presto as well as MySQL! 

 

2. Consider the cardinality within GROUP BY

When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.

The GROUP BY operator distributes rows based on the order of the columns to the worker nodes, which hold the GROUP BY values in memory. As rows are being ingested, the GROUP BY columns are looked up in memory and the values are compared. If the GROUP BY columns match, the values are then aggregated together.

 

3. Use LIMIT with ORDER BY

The ORDER BY clause returns the results of a query in sort order. To  process the sort, Presto must send all rows of data to a single worker and then sort them. This sort can be a very memory-intensive operation for large datasets that will put strain on the Presto workers. The end result will be long execution times and/or memory errors. 

If you are using the ORDER BY clause to look at the top N values, then use a LIMIT clause to reduce the cost of the sort significantly by pushing the sorting/limiting to individual workers, rather than the sorting being done by a single worker. 

I highly recommend you use the LIMIT clause not just for SQL with ORDER BY but in any situation when you’re validating new SQL. This is good practice for MySQL as well as Presto!

 

4. Using approximate aggregate functions

When exploring large datasets often an approximation (with standard deviation of 2.3%) is more than good enough! Presto has approximate aggregation functions that give you significant performance improvements. Using the approx_distinct(x) function on large data sets vs COUNT(DISTINCT x) will result in performance gains. 

When an exact number may not be required―for instance, if you are looking for a rough estimate of the number of New Years events in the Greater New York area then consider using approx_distinct(). This function minimizes the memory usage by counting unique hashes of values instead of entire strings. The drawback is that there is a small standard deviation.

 

5. Aggregating a series of LIKE clauses in one single regexp_like clause

The LIKE operation is well known to be slow especially when not anchored to the left (i.e. the search text is surrounded by ‘%’ on both sides) or when used with a series of OR conditions. So it is no surprise that Presto’s query optimizer is unable to improve queries that contain many LIKE clauses.  

We’ve found improved  LIKE performance on Presto by  substituting the LIKE/OR  combination with a single REGEXP_LIKE clause, which is Presto native.  Not only is it easier to read but it’s also more performant. Based on some quick performance tests, we see ~30% increase in run-times with REGEXP_LIKE vs comparable LIKE/OR combination.

For example:

SELECT  ...FROM zoo 
WHERE method LIKE '%monkey%' OR 
      method LIKE '%hippo%' OR 
      method LIKE '%tiger%' OR 
      method LIKE '%elephant%'

can be optimized by replacing the four LIKE clauses with a single REGEXP_LIKE clause:

SELECT  ...FROM zoo 
WHERE REGEXP_LIKE(method, 'monkey|hippo|tiger|elephant')

 

6. Specifying large tables first in join clause

When joining tables, specify the largest table first in the join. The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. If the right-hand side table is “small” then it can be replicated to all the join workers which will save CPU and network costs.  This type of join will be most efficient when the right-hand side table is small enough to fit within one node. 

If you receive an ‘Exceeded max memory’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your largest table is the first table in your sequence of joins. 

This was an interesting performance tip for me. As we know, SQL is a declarative language and the ordering of tables used in joins in MySQL, for example,  is *NOT* particularly important. The MySQL optimizer will re-order to choose the most efficient path. With Presto, the join order matters. You’ve been WARNED! Presto does not perform automatic join-reordering unless using the Cost Based Optimizer!

 

7. Turning on the distributed hash join

If you’re battling with memory errors then try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. 

At Eventbrite we have the distributed_join variable set to ‘true’. (SHOW SESSION). Also it can be enabled by setting a session property (set session distributed_join = ‘true’).

 

8. Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date or country.  You define partitions at table creation, and they help reduce the amount of data scanned per query, thereby improving performance. 

Here are some hints on partitioning:

  • Columns that are used as WHERE filters are good candidates for partitioning.
  • Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Use caution when partitioning and make sure you don’t partition too finely. 
  • If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

A key partition column at Eventbrite is transaction date (txn_date).

CREATE TABLE IF NOT EXISTS fact_ticket_purchase
(
    ticket_id STRING,
....
    create_date STRING,
    update_date STRING
)
PARTITIONED BY (trx_date STRING)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

 

9. Optimize columnar data store generation

Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by using column-wise compression based on data type, special encoding, and predicate pushdown. At Eventbrite, we define Hive tables as PARQUET using compression equal to SNAPPY….

CREATE TABLE IF NOT EXISTS dim_event
(
    dim_event_id STRING,
....
    create_date STRING,
    update_date STRING,

)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY')

Apache Parquet is an open-source, column-oriented data storage format. Snappy is designed for speed and will not overload your CPU cores. The downside of course is that it does not compress as well as gzip or bzip2.

 

10. Presto’s Cost-Based Optimizer/Join Reordering 

We’re not currently using Presto’s Cost-Based Optimizer (CBO)! Eventbrite data engineering released Presto 330 in March 2020, but we haven’t tested CBO yet.

CBO inherently requires the table stats be up-to-date which we only calculate for a small subset of tables! Using the CBO, Presto will be able to intelligently decide the best sequence based on the statistics stored in the Hive Metastore.

As mentioned above, the order in which joins are executed in a query can have a big impact on performance. If we collect table statistics then the CBO can automatically pick the join order with the lowest computed costs. This is governed by the join_reordering_strategy (=AUTOMATIC) session property and I’m really excited to see this feature in action.

Another interesting join optimization is dynamic filtering. It relies on the stats estimates of the CBO to correctly convert the join distribution type to “broadcast” join. By using dynamic filtering via run-time predicate pushdown, we can squeeze out more performance gains for highly-selective inner-joins.  We look forward to using this feature in the near future!

 

11. Using WITH Clause

The WITH clause is used to define an inline view within a single query.  It allows for flattening nested subqueries. I find it hugely helpful for simplifying SQL, and making it more readable and easier to support.

 

12. Use Presto Web Interface

Presto provides a web interface for monitoring queries (https://prestodb.io/docs/current/admin/web-interface.html). 

The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. If Presto cluster is having any performance-related issues, this web interface is a good place to go to identify and capture slow running SQL!

 

13. Explain plan with Presto/Hive (Sample)

EXPLAIN is an invaluable tool for showing the logical or distributed execution plan of a statement and to validate the SQL statements. 

Logical Plan with Presto

explain select SUBSTRING(last_modified,1,4) ,count(*)  from hive.df_machine_learning.event_text where lower(name) like ‘%wilbraham%’ or (REGEXP_LIKE(lower(name), ‘.*wilbraham.*’)) group by 1 order by 1;

 

14. Explain plan with MySQL (Sample)

In this particular case you can see that the primary key is used on the ‘ejp_events’ table and the non-primary key on the “ejp_orders’ table. This query is going to be fast!

 

Conclusion

Presto is the “SQL-on-Anything” solution that powers Eventbrite’s data warehouse. It’s been very rewarding for me as the “Old School DBA” to learn new SQL tricks related to a distributed query engine such as Presto. In most cases, my SQL training on MySQL/Oracle has served me well but there are some interesting differences which I’ve attempted to call-out above. Thanks for reading and making it to the end. I appreciate it!

We look forward to giving Presto’s Cost-Based Optimizer a test drive and kicking the tires on new features such as dynamic filtering & partition pruning!

All comments are welcome, or you can message me at ed@eventbrite.com. You can learn more about Eventbrite’s use of Presto by checking out my previous post at Boosting Big Data workloads with Presto Auto Scaling.

Special thanks to Eventbrite’s Data Foundry team (Jeremy Bakker,  Alex Meyer, Jasper Groot, Rainu Ittycheriah, Gray Pickney, and Beck Cronin-Dixon) for the world-class Presto support, and Steven Fast for reviewing this blog post. Eventbrite’s Data Teams rock!