3 Questions With Sapna Nair — Eventbrite’s New VP of Engineering in India

Sapna Nair joins Eventbrite as our new Managing Director and Vice President of Engineering in India. Sapna is a dynamic leader who will lead Eventbrite’s expansion into India and add to our engineering expertise.

Her experience building distributed teams will accelerate hiring of top-tier talent in India, helping to deliver on our ambitious technical vision and high-growth business strategy.

Learn why Sapna chose Eventbrite and the approach she’s taking to build out her new team with these three questions.

Sapna Nair

Q. What attracted you to Eventbrite?

There are three reasons that attracted me to Eventbrite. First, I strongly believe that life is more enjoyable and meaningful when people come together for shared experiences. Eventbrite has built a phenomenal ecosystem that powers event creators all over the world to cultivate connection, build community, and scale their businesses.

Second, my discussions with all the leaders of Eventbrite were very candid, inspiring and confident. The leadership had a clear multi-year strategy for the accelerated growth of the company. They have such a strong belief in the mission of the company. The entire experience was so welcoming, indicating the oft-sought after people oriented culture, with a motivating vision.
Third, given the first two reasons, the opportunity to build and grow that same organization ground up in India was a rare chance to leverage my past skills and experience most effectively, and at the same time, I also continue to learn more through this journey.

Q. What excites you most about building and developing engineering teams?

I find the opportunity to define the best practices in people, process and technology, on a clean slate — with no bias or baggage — highly challenging and satisfying. Having said that, now contradicting my own earlier statement of having a clean slate, even though there is no bias or baggage for the specific team(s), there always exists a reference with respect to another team in another geography or another company. That makes the entire dynamics very interesting.

I love the enormous prospect it offers to coach managers and ICs. Building engineering teams comes with a lot of learning moments. Though I have done it numerous times in the past, every new cycle teaches me something new.

There is a very common impression that engineering teams are solely focussed on technology. That is true, but it is also true that engineering teams need to understand the purpose of the use of their technology. That is what triggers their innovation and inspires them to deliver their best. It means engineering teams must remain connected with the geographically distributed business teams and leadership.

I am exhilarated when, keeping engineering teams in front and center, I get to bring together all the stakeholders, across different cultures/time zones/accountabilities, with a common purpose of delighting our customers. Ultimately, the pride and satisfaction I see on the faces of our engineers is priceless, when they establish themselves as the CoE, surpassing
all the teething troubles!

Q: How do you prioritize your well-being in a remote-first environment?

Setting clear expectations starting with:
  • Remote-first does not equal to 24/7 availability.
  • Making my work hours known to all.
  • Defining everyone’s accountability
  • Defining rules of engagement with all the stakeholders
  • Empowering and encouraging others to manage their own flexibility, like declining meetings if it’s not convenient to them.

Advocating use of technology and automation as much as possible (like dashboards, Slack) to reduce online meeting fatigue and avoid information silos. Blocking slots in my calendar for my ‘Me-Time’.

Looking to join Sapna’s team? She’s hiring! Check out her open roles here.

Monitoring Your System

As Eventbrite engineering leans into team-owned infrastructure, or DevOps, we’re obviously learning a lot of new technologies in order to stand up our infrastructure, but owning the infrastructure also means it’s up to us to make sure that infrastructure is stable as we continue to release software. Obviously, the answer is that we need to own monitoring our services, but thinking about what and how to monitor your system requires a different type of mental muscle than day-today software engineering, so I thought it might be helpful to walk through a recent example of how we began monitoring a new service.

The Use-Case

My team recently launched our first production use-case for our new service hosted within our own infrastructure. This use-case was fairly small but vital, serving data to our permissioning service to help it build its permissions graph in its calculate_permissions endpoint. Although we were serving data to only a single client, this client is easily the most trafficked service in our portfolio (outside of the monolith we’re currently decomposing) and calculate_permissions processes around 2000 requests per second. Additionally, performance is paramount as said endpoint is used by a wide variety of services multiple times within a single user request, so t, so much so that the service has traditionally had direct database access, entirely circumventing the existing service we were re-architecting. We needed to ensure that our new service architecture could handle the load and performance demands of a direct database call. If this sounds like a daunting first use-case, you’re not wrong. We chose this use-case because it would be a great test for the primary advantage of the new service architecture: scalability.

The Dashboards

For our own service, we created a general dashboard for service-level metrics like latency, error rate and the performance of our infrastructure dependencies like our Application Load Balancer, ECS and DynamoDB. Additionally, we knew that in order for the ramp-up to be successful, we’d need to closely monitor not only our service’s performance, but even more importantly, we’d need to monitor our impact to the permissions service to which we were serving data. For that, we created another dashboard focused on the use case combining metrics from both our service and our client.

We tracked the performance of the relevant endpoints in both services to ensure we were meeting the target SLOs:

We added charts for our success metrics, in this case, we wanted to decrease the number of direct-database calls from the permissions service which we watched fall as we ramped up the new service.

We added metrics inside of our client code to measure how long the permission service was waiting on calls to our service. In the example below, you can see that the client-implementation was causing very erratic latency (which was not visible on the server side). Seeing this discrepancy in performance on the client and server sides, we detected an issue in our client implementation which had a dramatic impact on performance stability. We addressed this volatility by implementing connection pooling in the client. 

As the ramp-up progressed, we also added new charts to the dashboard as we tested various theories. For instance, our cache hit rate was underwhelming. We hypothesized that the format of the ramp up (percentage of requests) actually meant that low percentages would artificially lower our cache hit rate so we added this chart to compare the hit rate against similar time periods. It’s important to keep context in mind; fluctuations may be expected throughout the course of a day or week (I actually disabled the day-over-day comparison below because the previous day was a weekend and traffic was impacted as a result). This new chart made it very easy to confirm our suspicions.

This is just a sampling of the data we’re tracking and the metrics collection we implemented, but the important lesson is that your dashboard is a living project of its own and will evolve as you make new discoveries about your specific system. Are you processing batch jobs? Add metrics to compare how various batch sizes impact performance and how often you get requests with those batch sizes. Are you rolling out a big feature? Consider metrics that allow you to compare system behavior with and without your feature flag on. Think about what it means for your system or project to be successful and think about what additional metrics will help you quantify that impact.

Alerting

Monitoring is particularly great when launching a new system or feature and it can be very helpful when debugging problematic behavior, but for day-to-day operations you’re not likely to pour over your monitors very closely. Instead, it’s essential to be alerted when certain thresholds are approached or specific events happen. Use what you’ve learned during monitors to create meaningful alerts (or alarms).

Let’s revisit the monitor above that leveled out after a client configuration change.

I would probably like to know if something else causes the latency to increase that dramatically. We can see that the P95 latency leveled off around 17ms or so. Perhaps I’d start with an alert triggered when the P95 latency rises above 25ms. Depending on various parameters (time of day, normal usage spikes, etc.), maybe it’s possible for the P95 to spike that high without the need to sound the alarms, so I’d set up the alert to only fire when that performance is sustained over a 5 minute period. Maybe I set up that alert, and it goes off 5 times in the first week and we choose not to investigate based on other priorities. In that case, I should consider adapting the alerts (maybe increasing the threshold or span of time) to better align with my team’s priorities. Alerts should be actionable and the only thing worse than no alert, is an alert that trains your teams to ignore alerts.

Like with monitoring, there is no cookie-cutter solution for alerting. One team’s emergency may be business as usual for another team. You should think carefully about what is a meaningful alert to your team based on the robustness of the infrastructure, your real-world usage patterns and any SLA’s the team is responsible for upholding. Once you’re comfortable with your alerts, they’ll make great triggers for your on-call policies. Taking ownership over your own infrastructure takes a lot of work and can feel very daunting at first, but with these tools in place, you’re much more likely to enjoy the benefits of DevOps (like faster deployments and triage) and spend less time worrying.

Packaging generated code from protobuf files for gRPC Services

Background

At Eventbrite, we identified in our 3-year technical vision that one of our goals is to enable autonomous dev teams to own their code and architecture so as to be able to deliver reliable, high quality and cost effective solutions to our customers. However,  this autonomy does not mean that our team has to work in complete isolation from other teams in order to achieve their goals.

Over the past year, we have started our transition from our monolithic Django + Python approach to a microservices architecture; we selected gRPC as our low-latency protocol for inter-microservice communication. One of the main challenges that we face is sharing Protobuf files between teams for generating client libraries. We want it to be as easy as possible by avoiding unnecessary ceremonies and integrating into  team development cycles.

Challenges managing Protobuf definitions

Since our teams have full autonomy of their code and infrastructure, they will have to share Protobuf files. Multiple sharing  strategies are available, so we identified key questions:

Should we copy and paste .proto files in every repository where they are needed? This is not a good idea and could be frustrating for the consuming teams. We should avoid any error-prone or manual activity in favor of a fully automated process. This will drive consistency and reduce toil.

How will changes in .proto files impact clients? We  should implement a versioning strategy to support changes.

How do we communicate changes to clients? We need a common place to share multiple versions with other teams and adopt a standard header to client  expectations, such as Deprecation and Sunset.

Our proposed solution

We will maintain protobuf files within the owning service’s repository to simplify ownership. The code owners are responsible for generating the needed packages for their clients. Their CI/CD pipeline will automatically generate the library code from the protobuf file for each target language.

Packages will be published in a central place to be consumed by all client teams. Each package will be versioned for consistency and communication. Before deprecating and sunsetting any package version, all clients must  be notified and given enough time to upgrade.

Repository Structure

In our opinion, having a monorepo for all protobuf definitions would slow down the teams’ development cycles: each  modification to a Protobuf definition would require a PR to publish  the change in the monorepo, waiting for an approval  before  generating required  artifacts and distributing them to clients. Once the package was published, teams would have to update the package and publish a new version of their services. We need to keep the Protobuf files with  their owning service. 

Project Structure

The project’s organization should  provide a clear distinction between the services that exist in the project and the underlying Protobuf version that the package is implementing. The proto folder will hold the definition of each proto file with a correctly formed version using the package specifier. The service folder will hold the implementation of each gRPC service which is registered against the server. 

The proto folder will hold the definition of each proto file with a correctly formed version using the package specifier.
The proto folder will hold the definition of each proto file with a correctly formed version using the package specifier.

This approach will allow us to publish a v2 version of our service with breaking change, while we continue supporting the v1 version. We should take into consideration the next points when we publish a new version of our service:

  • Try to avoid breaking changes (Backward and forward compatibility)
  • Do not change the version unless making breaking changes.
  • Do change the version when making breaking changes.

Proto file validation

To make sure the proto files do not contain errors and to enforce good API design choices we recommend using Buf as a linter and a breaking change detector. It should be used on a daily basis as part of the development workflow, for example, by adding a pre-commit check to ensure our proto files do not contain any errors.

Following our “reduce toil over automation” principle, we added a task in our CI/CD pipelines in CircleCI. A Docker image is available to add some steps for linting and breaking change detection. It helps us to ensure that we publish error-free packages:

Following our “reduce toil over automation” principle, we added a task in our CI/CD pipelines in CircleCI.
Following our “reduce toil over automation” principle, we added a task in our CI/CD pipelines in CircleCI.


If a developer pushes breaking changes or changes with linter problems, our CI/CD pipelines in CircleCI will fail as can be  seen in the pictures below:

breaking changes or changes with linter problems, our CI/CD pipelines in CircleCI will fail
Breaking changes or changes with linter problems, our CI/CD pipelines in CircleCI will fail.

Linter problems

Example Linter problems
Example Linter problems

Breaking changes

Example Breaking changes
Example breaking changes

Versioning packages

Another challenge is building and versioning artifacts from the protobuf file-generated code. We selected Semantic Versioning as a way to publish and release packages’ versions.

The package name should reflect the service name and follow the conventions established by the language, platform, framework and community.

Generating code for libraries

We have set up an automated process in CircleCI to generate code for libraries. Once a proto file is changed and tagged, CircleCI detects the changes and begins generating the code from the proto file.

We compile it using protoc. To avoid the burden of installing it, we use a Docker image that contains it. This facilitates our local development as well as CI/CD pipelines. Here is the CircleCI configurations:

We compile it using protoc. To avoid the burden of installing it, we use a Docker image that contains it.
We compile it using protoc. To avoid the burden of installing it, we use a Docker image that contains it.

In the previous example, we are generating code for python but it can also be generated for Java, Ruby, Go, Node, C#, etc.

Once code is generated and persisted into a CircleCI workspace it’s time to publish our package.

Publishing packages

This process could be overwhelming for teams if they had to figure out how to package and publish each artifact in all supported languages in our Golden Path. For this reason we took the same approach as docker-protoc and we dockerized a tool that we developed called protop.

Protop is a simple Python project that combines typer and cookiecutter to provide us a way to package the code into a library for each language. At the moment it only supports PyPI using Twine because our main codebase of consumers are in Python, but we are planning to addGradle support soon.

The use of protop is very similar to docker-protoc. We published a dockerized version of protop to an AWS Elastic Container Registry to allow teams to use it in their CI/CD pipelines in CircleCI:

We published a dockerized version of protop to an AWS Elastic Container Registry to allow teams to use it in their CI/CD pipelines in CircleCI
We published a dockerized version of protop to an AWS Elastic Container Registry to allow teams to use it in their CI/CD pipelines in CircleCI

At Eventbrite we use AWS CodeArtifact  in order to store other internal libraries so we decided to re-use it to store our gRPC service libraries. You can see a diagram of the overall process below.

 AWS CodeArtifact  stores both internal libraries and our gRPC service libraries.
AWS CodeArtifact  stores both internal libraries and our gRPC service libraries.

This AWS CodeArtifact repository should be shared by all teams in order to have only one common place to find those packages instead of having to ask each team what repository they have stored their packages in and having lots of keys to access them.

The teams that want to consume those packages should configure their CI/CD pipelines to pull the libraries down from AWS CodeArtifact when their services are built.

This process will help us reduce the amount of time spent in service integration without diminishing the teams’ code ownership..

Using the packages

The last step is to use our package. With the package uploaded to AWS CodeArtifact, we need to update our Pipfile:

Updated PIp File to use the artifact.
Updated PIP File to use the artifact.

or requirements txt

Alternative way of using Protobuf files.
Alternative way of using Protobuf files.

Conclusion

We started out by defining the challenges of managing Protobuf definitions at Eventbrite, explaining the key questions about where to store these definitions, how to manage changes and how to communicate those changes. We’ve also explained the repository and project structure.

Then, we proceed to cover protobuf validation using Buf as a linter and a breaking change detector in our CI/CD pipelines and how to version using Semantic Versioning as a way to publish and release packages’ versions.

After that, we’ve turned out to focus on how to generate, publish and consume our libraries as a kind of SDK for the service’s domain allowing other teams to consume gRPC services in a simple way..

But of course, this is the first iteration of the project and we are already planning actions to be more efficient and further reduce toil over automation. For example, we are working on generating the packages’ version automatically using something similar to Semantic Release to avoid teams having to update the package version manually and therefore avoiding error-prone interactions. 

To summarize, if you want to drastically reduce the time that teams waste on service integration avoiding a lot of manual errors, consider automating as much as you can the process of generating, publishing and consuming your gRPC client libraries.

Reflecting on Eventbrite’s Journey From Centralized Ops to DevOps

Once a scrappy startup, Eventbrite has quickly grown into the market leader for live event ticketing globally. Our technical stack changed during the first few years, but as with most things that reach production, pieces and patterns lingered. 

Over the years, we leaned heavily into a Django, Python, MySQL stack, and our monolith grew. We changed how our monolith was deployed and scaled as we went into the AWS cloud as an early adopter. This entailed building internal tooling and processes to solve specific problems we were facing, and doubling down on our internal tooling while the cloud matured around us. 

Keeping up with traffic bursts from high-demand events

Part of the fun and challenge of being a primary ticketing company is handling burst traffic from high traffic on-sales — these are high-demand events that generate traffic spikes when tickets are released to purchase at a specific time. Creators (how we refer to folks that host events) will often gate traffic externally, and post a direct link to an Eventbrite listing on a social network or their own websites. Traffic builds on these sites while customers wait for a link to be posted. The result is hundreds of thousands of customers hitting our site at once

Ten-plus years ago, this was incredibly difficult to solve, and it’s still a fun challenge from a speed of scale and cost perspective. Ultimately, challenges around the reliability of our monolithic architecture led to us investing in specialized engineering teams to help manually scale the site up during these traffic bursts as well as address the day-to-day maintenance and upkeep of the infrastructure we were running. 

A monolithic architecture isn’t a bad thing — it just means there are tradeoffs 

On one hand, our monolithic setup allowed us to move fast. Having some of Django’s core contributors helped us solve complex industry problems, such as high-volume on-sales in which small numbers of tickets go on sale to large numbers of customers. On the other hand, as we and our platform’s features grew, things became unwieldy, and we centralized our production and deployment maintenance in response to site incidents and bug triage. 

This led to us trying to break up the monolith. The result? Things got worse because we didn’t address the data layer and ended up with mini Django monoliths, which we incorrectly called services.

The decision to move from an Ops model to a DevOps model, and the hurdles along the way

Enter our three-year technical vision. In order to address our slowing developer velocity and improve our reliability, performance, and scale, we made an engineering-wide declaration to move away from an Ops model — in which a centralized team had all the keys to our infrastructure and our deployments— to a DevOps model in which each team had full ownership. 

An initial hurdle we had to jump over was a process hurdle. In order for teams to take any ownership, they’d have to be on call 24×7 for the services and code they owned. We had a small number of teams with production access that were on call, but the vast majority of our teams were not. This was an important moment in our ownership journey. And our engineering teams had many questions about the implications of what was not only a cultural but also a process change.

There are many technical hurdles to providing team-level ownership, and it’s tempting to get drawn into a “boil-the-ocean” moment and throw away all the historic learnings and business logic we developed over our history. Our primary building block towards team autonomy was leveraging a multi-AWS sub-account strategy. Using Terraform, we were able to build an account vending system allowing teams to design clear walls between their workloads, frontends, and services. With these walls in place each team had better control and visibility into the code they owned. 

Technical debt, generally, is a complicated ball of yarn to unwind

We had many centralized EC2-based data clusters: MySQL, Redis, Memcache, ElasticSearch, Kafka, etc. Migrating these to managed versions — and the transfer of ownership between our legacy centralized ownership directly to teams — required a high degree of cross-team coordination and focused team capacity. 

As an example, the migration of our primary MySQL cluster to Aurora required 60 engineers during the off-hours writer cutover — they  represented all of our development teams. The effort towards the decentralization of our data is leading us to develop full-featured infrastructure as code building blocks that teams can pull off the shelf to leverage the full capabilities of best-in-class managed data services.

Our systems powering our frontend as well as our backend services are process-wise similar to our data-ownership journey. We have examples of innovation around serverless compute patterns and new architectural approaches to address scale and reliability. We’re making big bets on some of our largest and most-impactful services — two of which still live as libraries in our core monolith. The learnings that are accrued through these efforts will power the second and third year of our three-year tech vision journey. 

The impact thus far, with more unlocks to come

By now, you’re probably realizing that at least some of our teams were shocked at the amount of change happening as their ownership responsibilities increased. We were confident that this short-term pain was worth it. After all, our teams were demanding this through direct feedback in our dev and culture surveys. 

The prize for us on this journey is customer value delivered through increased team velocity. While our monolithic architecture — both on the code and data sides of the house -— got us to where we are today, teams were not happy with their ability to bring change and improvements to things that they owned. This was frustrating for everyone involved, and the gold at the end of the rainbow for us is that teams can make fundamental changes with modern tools and processes. 

In the first year of our three-year technical vision big changes in ownership have been unlocked. As an example, we have migrated to Aurora where teams have ownership of their data. We’ve also provided direct team-level ownership of teams CI pipelines, improved our overall code coverage for testing, provided team autonomy for feature flag releases, and started re-architecting our two largest tier-1 services. It’s exciting to see new sets of challenges arise along the way — knowing these hurdles also unveil opportunities.

Crafting Eventbrite’s Data Vision

Data-driven decisions are the irrefutable holy grail for any company, especially one like Eventbrite, whose mission is to connect the world through live experiences.

I joined the Briteland to lead the Data Org, merging data-platform engineering, analytics engineering, product analytics, strategic insights and data science under one umbrella with a North Star of leveraging our scale and driving actionable insights from data.

When I first met fellow Britelings earlier this year, what immediately won me over was their infectious enthusiasm about the company’s mission, potential for impact — and, importantly, that data is a critical strategic asset to realizing Eventbrite’s vision.

Challenges and Opportunities

The Data Nerd in me couldn’t wait to uncover insights from this rich trove of data: social dynamics and the evolution of live experiences during and post the pandemic, regional microtrends, correlation to vaccinations rates, and so much more. 

However, to first get grounded in reality, I’ve had to play a couple of different roles. As a Data Lobbyist, I’ve been encouraging everyone, from our leaders to our engineers, to seek out data to guide their decisions. As a Data Therapist, I listen and learn from Britelings across all functions about the obstacles they encounter in gaining insights from data. Britelings’ current pain points broadly fall into three buckets: people, process, and technology.

People

Britelings are not aware of what data exists where, and how to start self-serving, especially when they may not have access to get started. As a result, “quick answers” aren’t quick enough, and thorough answers are even more time-consuming, especially when an analyst needs to spare cycles on techniques, tooling, or data semantics.

In addition, development teams are currently dependent on the data engineering team to aggregate and provide data for use in products. This does not align with our technical vision to have each development team own their solution end-to-end, including design, code, quality, deployment, monitoring, data, and infrastructure.

Focus areas: build data culture, remove knowledge silos

Process

There are multiple sources of truth (internal systems, data marts, etc.) that do not always reconcile. There are also several holes between data consumers and data producers in which context gets lost, and there’s a lack of standardized processes to define and update metrics.

Added to this, various manual processes are used for business-critical reporting due to legacy pipelines, data gaps, and incomplete context, causing transformation logic to be siloed with key employees rather than codified systematically with disciplined documentation.

Quick wins: align stakeholders, build end-to-end runbooks, alert proactively

Technology

Dated data infrastructure and stale models have been challenging to maintain and use. With insufficient isolation between production, development, and testing, some production pipelines have emerged as bottlenecks hindering quick iteration.

In addition, in the absence of consistent tooling and guidelines for getting data instrumented in products and integrated into existing pipelines, there are gaps in data coverage and quality that need to be addressed.

Slow down to speed up: modernize infrastructure, implement SDLC for data

These challenges are certainly not unique to Eventbrite; it’s an operational reality for businesses in the modern world. As a Data Leader, it’s heartening to know that Britelings are eager to lift barriers and invest in opportunities that deliver tangible value to our customers!

Goals

With a better understanding of where the gaps are across people, process, and technology, we set the following five goals:

  1. Provide a single source of truth with high-quality data for operational and financial reporting needs.  
  2. Provide tooling, training, and automation for Britelings to make informed decisions autonomously. 
  3. Provide reliable, resilient, scalable, and cost-effective data infrastructure.
  4. Make data more actionable to internal stakeholders, enabling a 360-degree perspective for strategic decisions.
  5. Make data actionable, insightful, and valuable to customers in-product — help creators grow their audience, help attendees find relevant events, and make the product more self-service. 

Tenets

To achieve these goals, we converged on tenets that would guide our execution, especially when confronted with tradeoffs.

  • Democratization over gatekeeping: We favor making data accessible to people (Britelings and Customers) and easier to create/collect more broadly to maximize creativity and value from data — but only within the boundaries of maintaining security and compliance.
  • Self-service over full-service: We will provide tools and consultation for people to better self-serve on data and not rely solely on a centralized team for insights.
  • Agility over uniformity: We believe in not blocking teams from their deliverables if they have a “good enough” option to run with sooner, and iteratively improve based on feedback. We will aim for developer autonomy.
  • Connect and enrich over clone and customize: We prefer to create and enrich modular datasets with additional context, annotations, information for consistent interpretation and use instead of making multiple copies that may eventually diverge and cause inaccuracies or confusion.
  • Comprehensive accuracy over partial freshness: We will prioritize having correct and complete information as of a (recent) point in time over up-to-date information that has not been vetted or reconciled, unless there is a use-case that demands otherwise.

Vision

With goals and ground rules established, we realize that the data team’s mission is to enable Creators, Attendees, and Britelings to self-serve on high-quality data at scale to derive actionable insights that drive business impact. 

It is our vision that: 

  • Creators obtain actionable insights to build their audience, increase ticket purchases, manage their events, and build loyalty amongst their attendees.
  • Attendees find interesting and relevant events from creators they trust. 
  • Britelings have accurate, timely, and actionable insights for operating the business, building better products, and delighting our creators and attendees.

In upcoming posts, we will talk about the Data Strategy and our plans to deliver on this vision. 

Creating the 3 Year Frontend Strategy

Last post we talked about Developing the 3 Year Frontend Vision, in this post we will go into how that vision, the tenets, requirements, and challenges shaped the Strategy moving forward.

One of the key themes in Eventbrite since I joined is DevOps, moving ownership from a single team who has been responsible for ops and distributing that responsibility to each individual team. To give them ownership over decisions, infrastructure, and to control their own destiny. The first step in defining the Strategy was to put together what a Technical Strategy is, and the foundation for that strategy.

Technical Strategy

The overall Technical Strategy is based on availability and ownership. Starting with the way we build our services and frontends, to the way we deploy and serve assets to our customers. The architecture is designed to reduce the blast radius of errors, increase our uptime, and give each team as much control over their space as possible.

Availability

Moving forward we will achieve High Availability (HA), in which our frontends and systems are resilient to faults and traffic, and will operate continuously without human intervention. In order to achieve HA, we will utilize Managed AWS Services or redundant fault tolerant software, and by utilizing content delivery networks (CDN) to increase our performance and resilience by putting our code as close to the customer as possible. We will ensure that all aspects of the system are tested, fault tolerant, and resilient, and that both the client-side and server-side gracefully degrade when downstream services fail.

Ownership

DevOps combines the traditional software development by one team and operations and infrastructure by another into a single team responsible for the full lifecycle of development and infrastructure management. This combination enables organizations to deliver applications at a higher velocity, evolving and improving their products at a faster pace than traditional split teams. The goal of DevOps is to shift the ownership of decision making from the management structure to the developers, improve processes, and remove unproductive barriers that have been put in place over the years.

Frontend

Once we had the foundation of the strategy defined, it was time to define the scope. To understand how to develop a strategy, or to even define one, we need to understand what makes up a “frontend”. In our case, the Frontend is everything from the backend service api calls to the customer. Because of this, we need to design a solution that allows for code to be run in a browser, on a server, service calls from a browser. Once you define the surface area of the solution, it becomes apparent that the scope and complexity of this problem is quickly compounding.

High Level Architecture

We need to define an architecture for everything above the red line in the above graphic. In order to simplify the design, I broke this down into three main areas; The UI Layer consisting of a micro-frontend framework with team built 

Custom Components, a shared Content Delivery Network (CDN) to front all customer facing pages, and a deployable set of bundled software that we code named Oberon, including a UI Rendering Service and a Backend-For-Frontend.

UI Layer

The UI leverages the micro-frontend architecture and modern web framework best practices to build frontends that leverage browser specifications while being resilient and team owned.

Micro-Frontend

When first approaching the micro-frontend architecture I realized that there is no clear definition of what a micro-frontend is.

Martin Fowler has a very high level definition which he states as

“An architectural style where independently deliverable frontend applications are composed into a greater whole”.

Xenon Stack describes a Micro-frontend as

“a Microservice Testing approach to front-end web development.”

Reading through the many opinions and definitions, I felt it was necessary to get a clearer understanding, and for everyone to agree what a micro-frontend architecture is. I worked with a couple of other Frontend Engineers to put together the following definition for a Micro-Frontend.

Definition

A Micro-Frontend is an Architecture for building reusable and shareable frontends. They are independently deployable, composable frontends made up of components which can stand on their own or be combined with other components to form a cohesive user experience. This architecture is generally supported by hosting a parent application which dynamically slots in child components. Components within a micro-frontend should not explicitly communicate with external entities, but instead publish and subscribe to state updates to maintain loose coupling. 

Micro-frontends are inspired by the move to microservices on the backend, bringing the same level of ownership and team independent development and delivery to the frontend.

Self-Contained Components

In order to avoid frontends that over time inadvertently tightly couple themselves and create fragile un-reusable components, we must build components that are encapsulated, isolated, and able to render without the requirement of any other component on the page. 

Component Rendering Pipeline

The Component Rendering pipeline renders components to the customer while the framework defines a set of Interfaces, Application Context, and a predictable state container for use across all of the rendering components.

State Management

State management is responsible for maintaining the application state, inter-component communication and API calls. State updates are unidirectional; updates trigger state changes which in turn invoke the appropriate components so they can act on the changes. 

Content Delivery Network

Our current architecture has resilience issues, where one portion of the site may become slow or unresponsive and that has a direct impact on the rest of the domain, and in many cases cause an overall site availability issue. In order to get around some of this issue, we add a CDN at the ingress of our call stack. Every downstream frontend rendering will contain Cache-Control headers, in order to control the caching of assets and pages in the CDN. During a site availability issue, the rendering fleet may increase the cache control header, caching for small amounts of time (60 seconds – 5 minutes max), for pages that don’t require dynamic rendering, or customer content. Thus taking load off the fleet and increasing it’s resource availability for other areas.

Oberon

Oberon is a collection of software and Infrastructure-as-Code (IaC) that enables teams to set up frontends quickly and to get in front of customers faster. It includes a configurable Gateway pre-configured for authentication as needed, a UI Rendering Service to server-side render UI’s, UI Asset Server to serve client side assets, and a stubbed out Backend-For-Frontend. 

Server Side UI Rendering Service

The UI Rendering Service defines a runtime environment for rendering applications, their components, and is responsible for serving pages to customers. The service maps incoming requests to applications and pages, gathers dependency bundles, and renders the layout to the customer. Oberon will leverage the traffic absorbing nature of a CDN with the scaling of a full serverless architecture. 

Backend-For-Frontends (BFF)

A BFF is part of the application layer, bridging the user experience and adding an abstraction layer over the backend microservices. This abstraction layer fills a gap that is inherent in the microservice architecture, where microservices must compete to be as generic as possible while the frontends need to be customer driven.  

BFFs are optimized for each specific user interface, resulting in a smaller, less complex, and faster than generic backend, allowing the frontend code to 1) limit over-requesting on the client, 2) to be simpler, and 3) see a unified version of the backend data. Each interface team will have a BFF, allowing them autonomy to control their own interface calls, giving them the ability to choose their own languages and deploy as early or as often as they would like.

Next Steps.

Now that we’ve published the 3 Year Frontend Strategy, the hard work begins. Over the next few months we will be defining the low level architecture of Oberon, and working on a Proof Of Concept that teams can start to leverage in early 2022.

Creating a 3 Year Frontend Vision

JC Fant IV
Oct-5th-2021

History

Over the course of the last 21 years I’ve spent time in nearly every aspect of the technical stack, however, I’ve always been drawn to the frontend as the best place to be able to impact customers. I’ve enjoyed the rapid iterations, and the ability to visualize those changes in the browser. It’s why I  spent much of the last 14 years prior to Eventbrite at Amazon (AWS) evangelizing the frontend stack. That passion led me to co-found one of the largest conferences internally to Amazon reaching over 7500 engineers across 6 continents. The conference is focused on all aspects of the Frontend, and helped to highlight technologies that teams could adopt and leverage to solve customer problems.

In March of 2021 I joined Eventbrite to help solve some of those same challenges that I’ve spent much of my career trying to solve. As part of my onboarding I was asked to ramp up on the current problem space and the technical challenges the company faces, and to dive into the issues impacting many of our frontend developers and designers. With all of that knowledge, I was tasked to come up with a 3 Year Frontend Strategy. 

Many of you have already read the first 3 posts in this series, Creating our 3 year technical vision, Writing our 3 year technical vision, and Writing our Golden Path. If you haven’t had a chance, those 3 posts help to set the context for how we defined and delivered our 3 year Frontend Strategy.

Current Challenges and Limitations

In those previous posts, Vivek Sagi and Daniel Micol described many of the problems that backend engineers, and engineers in general face at Eventbrite. My first task was to engage and listen to the Frontend Engineers around the company and to identify more specific frontend challenges and limitations that we face every day.

  • A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
  • Our performance is suboptimal leading to some poor customer interfaces and low lighthouse scores. 
  • We lack automation in how we test, deploy, monitor and roll back our frontend code.
  • Our frontends are currently written in both a legacy framework and a more modern framework where the rendering patterns have diverged, and are no longer swappable without a migration. 
  • Service or datastore performance issues have a high blast radius where  all aspects of the site are degraded including pages that are static in nature.
  • Our front end experiences are inconsistent across our product portfolio and making changes to deliver against our 3-year self service strategy requires too much coordination.

Developing Requirements

Now that we had a decent understanding of the issues we’ve been facing, we turned our attention to understanding the requirements to solve these problems. 

  1. Features. As our product offering evolves to deliver high quality self-service experiences for creators and attendees, we ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we provide. 
  2. Performance. User perception of our product’s performance is paramount: a slow product is a poor product that impacts our customers’ trust. 
  3. Search Engine Optimized. Through page speed, optimized content, and an improved User Experience, our frontends must employ the proper techniques to maintain or increase our SEO.
  4. Scale. Our frontends must out-scale our traffic, absorbing load spikes when necessary, and deliver a consistent customer experience.  
  5. Resilient. Our frontends will respond to customer requests, regardless of the status of downstream services. 
  6. Accessible. Our frontends will be developed to ensure equal access and opportunity to everyone with a diverse set of abilities.
  7. Quality.  The quality of our experiences should be prioritized to deliver customer value, solve customer problems, and be at a level of performance that meets our SLA’s and reduces customer reported bugs. 

Defining Our Tenets

We set out to define a core set of tenets for this strategy; a core set of principles designed to guide our decision making. These tenets help us to align the vision and decisions against our end goals. I wanted these tenets to be focused on driving the solution to be something that Frontend Engineers want to adopt, not something they must. We need to deliver something that is seductive, makes engineers’ lives better, and in turn is able to directly impact our customers; as engineers are able to move quicker, and have the autonomy and ownership to make decisions.

  1. Developer Experience. Start with the developer and work backwards. Tools and frameworks must enable rapid development. Developing inside the Frontend Strategy must be easy and fast, with limited friction.
  2. Metric Driven. We make decisions through the use of metrics; measuring how our pages and components behave and their latencies to drive changes.
  3. Ownership. Teams control their own destiny from end-to-end. From the infrastructure to the software development lifecycle (SDLC), owning the full stack leads to better customer focus, team productivity, and higher quality code.
  4. No Obstacles. We remove gatekeepers from the process by providing self-service options, reusable templates, and tooling.
  5. Features Over Infrastructure. We leverage solutions that unlock frontend engineer productivity, in order to focus on customer features rather than maintaining our infrastructure. 
  6. Pace of Innovation. We build solutions to obstacles that interfere with getting features in front of customers.
  7. Every Briteling. We build tools and leverage technology that allows every Briteling to build customer facing features. 

Developing Our Vision

Now that we had the challenges, requirements and tenets outlined, we needed to define a vision for this 3 year frontend strategy. Following the tenets, we want to empower Britelings to deliver customer impactful features, and make our customers lives better. We want this vision to be something everyone in the company can get behind, and as such we don’t actually reference Frontend Engineers, instead we strive to empower ALL Britelings to deliver customer impactful experiences.

Vision

Delight creators and attendees by empowering Britelings to easily design, build, and deliver best in class user experiences. 

Next Post we will talk about the Strategy and the architecture.

A day in the life of a Technical Fellow

In my two most recent blog posts, I talked about how to write a Long-Term Technical Vision and a Golden Path. These are future-looking and high-level artifacts so the question I keep hearing is: do I need to give up coding to grow in my career and become a Technical Fellow? In this post I will explain what it’s like being a Technical Fellow and how to strike a good balance between breadth and depth. Let’s also forget about the specific title for a moment, since different companies will have other names such as Distinguished or Senior Principal Engineer. What really matters is the scope and how to be able to cope with it while ensuring that you don’t become a person who’s too detached from the details and provides overly generic feedback and guidance.

Eventbrite has roughly 40 engineering teams and in theory I could say that my scope covers all of them. However, it’s unrealistic to be involved in so many of them and have enough context to provide meaningful contributions to each team. The two critical aspects for making this work are: knowing how to prioritize my time, and being able to delegate. But how did I learn this?

Earlier in my career, I was the tech lead for a small team with two other engineers. Over time the product that we had built was successful and we grew to three feature teams, with me being the uber tech lead for them. At first I was trying to be as embedded into each of them as I was when I belonged to just one team: attending their standups, being part of the technical design reviews, coding, etc. Soon enough I realized that this approach would not scale and I sought feedback on how to manage the situation. One piece of advice that was critical in my career was: “in order to grow, you need to find or grow other people to do what you’re doing now, so you can then become dispensable and start focusing on something else”. That “something else” could be taking on a larger scope or just finding another area to work on, but the key here is that what we should be aspiring to is growing others so that they end up doing a similar job to what we’re doing now, and we should become dispensable in our current role. It is interesting to think that our goal should be to reach a point where we’re almost irrelevant, and that took me time to properly understand, but it’s really key for career growth.

After growing tech leads in the three teams I was overseeing, I could start focusing on the larger picture. However, I didn’t want to become too detached from the lower level details, so I opted for working in a rotating way with the three teams, where each quarter I would become a part-time IC for each of those teams, including coding tasks, designs, code reviews and being on call. And I say part-time because I still had to invest time in my breadth activities and thinking about the long term. I structured my schedule in a way where my mornings would be mostly IC work and the afternoons would be filled with leading the overall organization and being a force multiplier. This dual approach where I oversaw the larger organization but also had time to tackle lower level aspects allowed me to focus on the bigger picture while being attached to the actual problems that teams were facing, and have enough context to be useful when providing them feedback and guidance.

Time has passed and at Eventbrite I now follow a similar model but with a larger set of teams. Since the rotational approach won’t work as well (rotating a team per quarter will take me 10+ years to complete each rotation), we decided to implement a model where Principal Engineers and above (including Technical Fellows) would have different engagement levels with each team, which could be divided into the three categories listed below:

  • Sponsors are part of a team and spend ~2 days/week working with that team, which includes attending the standup, participating in system designs, coding and being on call. We expect Principal+ engineers to sponsor at most 2 areas at any given point in time.
  • Guides spend ~2 hours/week on a given project. They are aware of the team’s mission and roadmap, protect the long-term architecture, provide the long-term direction of the product, and may be active in the code base.
  • Participants are available to a team for any questions they have or to help disambiguate areas of concern, they are active in meetings but may not be deep in the code base. Participants spend a few hours a month on the project/team.

With the above in mind, I am sponsoring two teams right now, and that is expected to rotate based on the teams who will need my involvement the most. As of today this means that I’m more involved in the Ordering and Event Infrastructure teams, including coding, working on technical designs, mentoring others in the team, etc.

So what’s a day in my life look like? 

As I mentioned before, I structure my day so that in the mornings I will do IC work and the afternoons will be for breadth work. Right now my main area of focus as an IC is getting our new Ordering Pipeline implemented and that’s where I spend most of my coding cycles. This is a brand new service written in Kotlin, gRPC and uses AWS technologies such as DynamoDB and Lambdas. It’s particularly critical not only because Ordering is at the core of Eventbrite, but because it’s paving the way for the new generation of services that we’re starting to build in the company, since this is the first one with the technologies and processes outlined in the 3-Year Technical Vision and the Golden Path. And such, many other services who will follow will use what Ordering is building today as their reference architecture, and we’re also finding a few unpredicted gaps that we have to solve before other teams find them. I was also on call for this team a couple of weeks ago.

In contrast, my afternoons are typically filled with breadth work, that is, with 1:1s, syncs with other people in Argentina or the US, company tech talks, design reviews, and others. For example, I was recently heavily involved in coming up with a new engineering career guide for the company (which we’ll blog about at some point), or attending leadership syncs with our CTO and CPO about the current state of our Foundations and the challenges ahead of us.

As time passes my focus will move away from Ordering to other areas where I can contribute in depth, and by then I will have expected to grow the team to a state where they don’t miss me and they can keep moving forward without my help. Breadth work is there to stay and can be very different each week depending on what the company needs the most at that particular moment.

Writing our Golden Path

In my last blog post I explained how we defined our 3-year technical vision for the company. One of the key pillars of this vision is shifting from a model where we used the same tool for every job (mostly a combination of Python + Django + MySQL), to the right tool(s) for each job. Given that this would be a new way of working for our organization, we wanted to have some guidelines that teams would follow to ensure that our services and applications wouldn’t have a completely different tech stack depending on the team developing them, which would harm the maintainability of our overall architecture. This is why we decided to write a Golden Path document that would guide teams on the best set of technologies for each potential scenario and recommended tools for common repeatable use cases like logging, security, etc. 

The Golden Path is a document that explains the allowed technologies available for use at Eventbrite when building software. It has been built collaboratively by the entire development organization and is in continuous evolution as teams find better solutions for the problems to be solved. We require any technology choice that is not included in this list to have explicit approval from the Architecture Review Committee (ARC), which is our engineering governance body, before implementing it.

Therefore, one principle around our Golden Path is that we are recommending the use of the “right tool for the job,” which most often means opting for industry standard technologies (enabling us to focus our limited innovation tokens on technological advancements unique to live experiences). Teams are encouraged to evaluate other alternatives that are not in this document when working on their system designs, or challenge currently deprecated ones, and propose these edits to ARC if they find them superior or better suited for their use case than the currently approved ones. This is the way we keep this as a living document that improves over time and adapts to new industry trends.

We divide technologies into the following life cycle phases:

  • Emerging. New technologies that are very likely to become recommended but are not production-ready yet.
  • Recommended. The default choice as of today.
  • Allowed. Technologies that we allow although the recommended one should be used if possible.
  • Deprecated. Discouraged for new development but could be maintained for currently-existing systems.
  • Rejected. Technologies that we don’t use or haven’t used in the past but have been rejected in previous evaluations.

Our Golden Path contains several sections such as programming languages (for microservices, data science, frontend), source package managers, web frameworks, databases and caching, among others. The guidance for how to apply the Golden Path when working on a technical design is as follows:

  • Every section in the document should have a matrix that outlines the best path forward for the use cases that we’ve faced in the past, or a description that clearly specifies this. If our use case is in that list, we should choose the best technology outlined in the matrix.
  • Even if we choose a technology that has been already evaluated in the past, we still need to come up with data for our specific scenario in key dimensions such as cost, latency, etc. to ensure that it will work for this specific scenario. 
  • If a section doesn’t have a matrix yet, or our use case is not included, we will conduct a technology evaluation and contribute to the matrix. The guidelines for this are:
    • We should consider at least two options and do a full bake off before we pick a winner. Choose based on the dimensions that are important for our scenario (features, use case fit, ease of use, cost, latency, consistency, etc).

    • We are not limited to AWS technologies. For the decisions that we make, we should evaluate both the AWS offering and any other leading non-AWS contender (e.g. DynamoDB and Cassandra), including compatibility and integration with other tools of the stack. We will not favor AWS by default and will only use it as a tie-breaker if both offerings are equivalent.


    • Technologies that are deprecated shouldn’t be re-evaluated unless there’s a strong belief that the particular scenario that is being designed will be different than the reasons why that technology was deprecated (e.g. we shouldn’t be looking into unmanaged solutions since those are deprecated). These exceptions will need to be approved by ARC.


Our Golden Path was published in early 2021, a few weeks after we finalized our 3-year tech vision, and every technical design or proposal that has emerged since then is following this new standard. We do envision that in a few years from now we should be able to remove these barriers since teams will have enough internal examples to decide the best tool for the job without the risk of significantly diverging the chosen options for similar use cases.

Here are a few examples of sections extracted from the Golden Path document:

Native Libraries and Wrappers

  • Native Libraries (recommended). We should favor using the native libraries of the tools that we use (e.g. AWS SDKs, feature flags, metrics, etc). Each team consuming those SDKs is responsible for upgrading to newer versions when needed.
  • Wrappers (deprecated). We do not want to use wrappers unless they provide clear additional benefit over native libraries (such as extended capabilities or use simplicity), and we do not believe in the argument that using native libraries is a lock-in to a specific technology, as the downside of building and consuming our own wrappers is a bigger problem. Wrappers tie us to specific underlying library versions, require migration effort as new native library versions are released, and are always a subset of the functionality that those libraries provide.

Microservice Programming Languages

  • Kotlin (recommended). This is the recommended language based on the JVM. It has several benefits over Python such as being multi-threaded, improved performance, and being strongly typed, among others. We should use this language whenever we need to build services that are scalable or performant.
  • Python (recommended). We support it given our extensive in-house knowledge and current stack. We should be careful when using it with services that are expected to have significant load since it’s single-threaded and interpreted languages are typically slower than compiled ones.
  • Node.js (emerging). We have experience with Node.js for frontend development but not microservices, although we’re evaluating it.
  • Go (emerging). We built the integration service in this language. We believe that Go has potential and we should do a feature evaluation at some point.

Service-to-service Communication

This is the communication that happens when a service calls another one directly, and can be either synchronous or asynchronous.

  • gRPC (recommended). This is the only recommended RPC protocol.
  • PySOA / Legacy SOA (deprecated). We support the services that are written in these protocols that are currently in production but don’t allow any new ones to use them.

Relational Databases

Useful when there are multiple entities in the data model that are strongly related.

  • AWS Aurora (recommended). We recommend AWS Aurora which is a managed database compatible with MySQL and PostgreSQL. However, we support only the MySQL flavor.
  • AWS RDS (rejected). We don’t allow RDS since it is less scalable than Aurora although it offers very similar functionality.
  • MySQL (deprecated). We maintain the current databases that we have on MySQL but don’t allow any new functionality to be implemented on this database.

Writing our 3-year technical vision

I joined Eventbrite as their first Technical Fellow, the most senior engineering individual contributor role in the company. One of my initial goals was to come up with an overarching technical vision for the whole company aligned with our 3-year business strategy, and that would move us away from a monolithic architecture and central SRE team to a distributed system where we shift ownership to each team. In our most recent post, Vivek Sagi described the list of problems that we identified and our future-looking goals, which to recap are:

  • Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
  • Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
  • Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
  • Drive automation and reduce toil. All feature dev teams should be  able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
  • Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.

To accomplish these goals, I started working with other engineers and product leaders to understand the history of our technical architecture and the challenges that we were facing including developer productivity issues, site reliability problems or scalability limitations. From these goals, we derived a set of requirements for our 3-year technical vision:

  1. Features. As our product offering evolves to deliver high quality self-service experiences for Super Creators and Consumers, we must ensure that our technology stack enables teams to efficiently create, optimize, and maintain the net new functionality we will need to provide. For example, Super Creators require multi-event creating/editing, organization level reporting, and multi-event cart support – all of which will require significant architectural changes relative to our current offering. In addition, a new bundle of marketing tools will enhance creators’ ability to acquire new audiences and grow existing ones, especially by leveraging automation and machine learning to simplify the experience while increasing the impact. We seek to improve our offering for consumers to discover, and attend events and to maintain trust in our platform.
  2. Leveraging Data. We have the opportunity to power new data differentiated products based on data from over a decade of past events and round out our focused product offering with key 3rd party integrations (e.g. Mailchimp, Zoom).
  3. Performance. User perception of our product’s performance is paramount: a slow product is a poor product. In addition, better page performance leads to better SEO rankings. We decided to leverage Lighthouse’s performance score, an industry standard web dev performance metric, and we endeavor to achieve a green score (90 to 100) across our customer facing features. We also must enforce low latency in our internal infrastructure and API response times, and set reduction goals year-over-year.
  4. Scale. We will support two types of scaling improvements. We will scale our systems to handle 5x the current load as we grow our business and we need to have systems that support this load and scale to such limits. The second one is related to spikiness in our traffic due to large event sales, where today we use a Waiting Room to throttle calls to our services and DB. We will design systems that can autoscale and descale in certain events and avoid having to overprovision our infrastructure on a manual basis.
  5. Quality. The defect rate of our product offering can either make or break the experience for our users. In the past year, we have reduced the quantity of critical open bugs from 311 down to 175 and also reduced the number of bugs that missed our fix SLA from 200 to 110. We should aggressively lean into this trend and continue to reduce both by 50% YoY. We will improve our ability to deliver along that trend by increasing our test coverage, reducing our code complexity, having better tooling and increasing our level of automation.
  6. Self-Service. We will improve self-service both externally and internally. For the former we will aim for a 50% YoY customer support contact rate reduction relative to total ticket sales, while ensuring that help center page views don’t disproportionately grow – the point being that we deliver product experiences that have sufficient in-line guidance to result in successful experiences. Internally we will ensure that data is accessible by teams, each of the data sources and services has clear documentation and runbooks as well as contracts and use cases. We will define these in “How We Work” guidelines that every team will follow.
  7. Development Process. Finally, we must streamline our internal development processes and progress along the DevOps Big 4 to these levels: Deployment frequency: Elite (Daily for web and backend services and up to weekly for native apps), Lead time for changes: Elite (Less than one hour), Mean time to restore service: Elite (Less than one hour), and Change failure rate: Elite (0-15%).

Applying these principles to the problems that were outlined in the previous post, we thought about the following solutions to them:

  1. Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. We need to decouple our monolith into smaller microservices that can evolve and scale independently. This is a similar trend that many other companies have followed as they grow, and based on our professional experience prior to Eventbrite, we know it works.
  2. Our initial partial attempt to move to a Services Oriented Architecture (SOA) compounded the problem. In our prior attempt, we lacked a clear vision of what moving to SOA meant and how to accomplish it. We moved business logic out but not data, compounding the problem. This time around, we’ve prioritized this architecture transition at a company level, focusing first on the core business logic, including segregating and migrating the underlying data with every service.
  3. Our performance became suboptimal leading to a poor utilization of our hardware resources. We planned to fix this in two ways: by moving to managed services, letting cloud providers deal with this responsibility, and choosing technologies that would autoscale properly based on our traffic patterns, which are spiky by nature due to large onsale events.
  4. Our SDLC process was ad hoc and lacked sufficient controls in a few places. We’ve defined and set ownership boundaries between services and logical components. We’ve also enacted Architecture Review Committees to review designs to ensure we are building extensible services that don’t become monolithic themselves.
  5. Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. We are transitioning to DevOps where each team is the owner of the end-to-end lifecycle of their services. Similar to an earlier point, we’ve implemented this successfully in the past at other companies and we know it works.
  6. We lack automation in how we test, deploy, monitor and roll back our code. Our vision document has sections specifically addressing deployments, testing and operations, indicating that we should aspire to full automation and minimize (and remove, if possible) any manual intervention.
  7. Our core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We see this as an architectural issue where our data boundaries were never established and we had many different services reading from the same tables and writing to them. We also used the same database technology for all of our use cases which has proven to be inefficient.
  8. We also built homegrown tools such as our own RPC protocol, PySOA. We are no longer  investing our time in areas that are not business critical and where we can’t build competitive differentiation. For everything else where we need a commoditized solution; evaluate buying instead of building whenever possible. This allows us to focus on providing customer value.

As we can see, we’re trying to move ownership from a centralized SRE team and monolithic architecture to empower teams to build and own their systems. But moving from a situation where the technology set for building features is very limited to another one which is much more open has its risks as well, and we didn’t want to end up with a technology spectrum so wide that it would be difficult to maintain. This is why we wrote our Golden Path, a living document that details the technologies that teams are allowed to use in production for their services, and covers areas such as RPC protocols, storage layers or programming languages. We say it’s a living document because teams are still encouraged to evaluate other technologies when designing their systems, and, if proven they’re the right choice, we’ll update our Golden Path to reflect these. We’ll write another post with more details about this Golden Path.

From an architecture perspective we also depicted a high level view of how we’d design our end system, starting from the client-facing applications and APIs:


And then describing the set of components that we would have in our internal network:

Our 3-year technical vision was a collaborative effort where the entire engineering team was involved. We reviewed the proposal multiple times with different stakeholders, including all engineers, data scientists, product managers and other roles in the company. We received hundreds of comments that enriched and made the whole proposal better. We hosted several Q&As to ensure that all aspects of the vision were clear and there were no outstanding items to be resolved. We also presented it to our CEO and the board of directors. We needed the entire company to become owners of this vision, and leaders in achieving it. After our 3-year technical vision was finalized, a few subsequent long-term thinking proposals were driven by our engineering organization, such as:

  • Operational Model. We describe the infrastructure and networking that we’ll have to support our shift from centrally-owned infrastructure to a distributed mindset where each team owns the end-to-end lifecycle of their services.
  • Data. We describe our future internal and external reporting capabilities, and how these will work with a service oriented architecture where each service has its own storage layer, and not limited to a centralized MySQL DB. It also covers how to have a centralized data lake our data scientists can rely on to build their ML models.
  • Frontend. We propose how to unify our frontend stack and extract our server side rendering from our monolith to Backend-for-Frontends for each application.
  • Mobile. We are rethinking our integration with our core services and how to share logic between the different applications that we have today.

Apart from this, the roadmaps from all of our teams have been adapted to align with our vision and now include areas of focus such as moving away from the monolith into their own service, having their own storage layer, or moving from in-house technology to industry standards. This is also reflected in all recent technical proposals that have been written, all of which start by clarifying that what’s outlined in the proposal is in alignment with our 3-year technical vision and the Golden Path.

But writing a document and sharing that proposal was just the seed of the vision. We’re now making tangible progress to get there, such as:

  • We have deprecated our in-house RPC protocol PySOA in favor of gRPC (tenet: we will choose conforming over creating/reforming). We did an initial evaluation where we compared PySOA with gRPC, and did a proof-of-concept to understand which would better suit our use cases. We decided to move to gRPC because it allows us to focus on our business needs instead of maintaining our own RPC protocol, and gRPC is superior since it supports HTTP2 (while PySOA relies on Redis), has a smaller payload size since it relies on protocol buffers and binary serialization, supports multiple programming languages instead of just Python, and has TLS/SSL support, among other advantages. We have also started writing new services using this new protocol.
  • We are enabling self-service AWS account provisioning and defining our networking and security layers so that teams can own their service’s infrastructure (tenet: teams will have end-to-end ownership of their systems and services).
  • We are migrating our unmanaged MySQL database to AWS Aurora (tenet: we favor cloud managed services or serverless for commoditized systems and components).
  • We have worked on several long-term designs for some of our key components such as Ordering and Event Management instead of focusing on shorter-term and incremental improvements (tenet: we will favor long-term maintainability and scale over short-term deliveries for strategic solutions). We have also started their implementation and expect our initial deliveries later this year.
  • We are writing new designs that break the previous limitation/guidance to only use Python/Django and MySQL and consider databases such as DynamoDB or QLDB, Kotlin or Go, and SNS or Kinesis, as a few examples (tenet: we will standardize on a few stacks but also empower teams to choose the right tool for the job).
  • We have recently launched an Operational Readiness Review process to analyze the reliability of our current codebase, as well as new designs, are overhauling our Security Review process, moving to full CI/CD, Dockerizing our monolith, raising the bar in our testing and quality processes, and several other initiatives that we have in place (tenet: we will strive for continuous improvement and will ask why not instead of why?).

These are just a few examples that show how having a clearly outlined long-term technical direction can have a significant impact on an organization’s architecture and processes. We will detail many more of these examples for actual impact in upcoming posts.

We are excited about this new, long-term thinking technical vision that will provide the right guidance to our teams, indicate how the different pieces in our system should fit together, and help our every-day decision-making process. And what’s even more exciting is that the whole company participated in its definition and have embraced it with energy and passion.