Eventbrite is a global company. We have employees in many countries around the globe, and we believe in the future of remote work. This is no different in Engineering. With offices in San Francisco, Nashville, Mendoza and Madrid, and many of our team members working remotely, our global team has a rich and diverse culture. We have grown organically and also by acquiring several companies such as Ticketfly, ToneDen, Ticketea or Eventioz.
Our tech stack grew rapidly as we scaled and we ended up becoming a Python/Django monolith, called Core, with a centralized MySQL database. Starting with a code base that over time becomes monolithic is a common occurrence in startups that scale since they need to deliver fast and provide business value. However, in the long run this approach impacts the speed at which the company can operate and innovate due to the number of hard dependencies between the code, which is now fractionally owned by many different teams.
When I came onboard, approximately nine months ago, there were multiple initiatives underway to reduce our technical debt. However, we did not have a unifying 3-year technical vision that would act as a guiding principle, our north star, to keep us on the right path and enable us to deliver against our business strategy.
Having that vision for the future is paramount to the success of our team, our company, and the event creators and attendees we build for.
This is the first of many posts from our team on how we built our 3-year technical vision and are executing against it to increase our code quality and development velocity while reducing infrastructure costs.
Our intent in sharing this is twofold. For anyone considering a similar exercise, hopefully some or all of this resonates and will help you on your journey. We also acknowledge there are better ways to accomplish this and hope to learn these methods from you through your comments and feedback below.
The first step in creating a technical vision is to have a shared understanding of the problems with your current architecture. The following were our problems.
- Our monolith became a bottleneck to our developer velocity and overall site reliability and scalability. A monolithic architecture is a software pattern where all the codebase and infrastructure are tightly coupled, live in the same artifact, and have the same development and deployment lifecycle. This contrasts with distributed architectures where each component or service has its own set of artifacts and lifecycle. A monolithic architecture leads to teams having unnecessary dependencies and being forced to move at the speed of the monolith. They are often blocked by other changes or the release schedule of the monolith.
- Our initial partial attempt to move to a SOA architecture compounded the problem. We retained a single data store for most services and writes continued to happen from the monolith. We also had issues where multiple data stores were being updated based on a single action and had multiple code paths that CRUD data, leading to consistency issues. In effect we created a complex distributed monolith because of the depth of dependencies, circular calls, data coupling and Eventbrite’s specific architecture. This also increases our blast radius exposure, meaning that a failure in a given part of the system could potentially affect many others, increasing the severity of the issue and the customer impact.
- Our performance became suboptimal leading to a poor utilization of our hardware resources. There are two main reasons for this:
- Relying on a relational MySQL database allows us to scale vertically but not horizontally, impacting the overall performance and scalability of our architecture. It is also not the ideal solution for many scenarios like reporting, data science model development, etc. We are also affected by inconsistent data model and query design, which requires a lot of human effort to overcome..
- Our Python code is inherently single-threaded, therefore unable to use most of the capacity that our hosts have, leading to overprovisioning in some cases and requiring mitigation features such as waiting rooms to handle very spiky traffic patterns when we have large events on sale. We do not autoscale the core monolith given all its complexities. There’s a lack of clear code, service or data ownership, and we have some orphaned services. Side effects between service interactions is also a common problem.
- Our SDLC process was adhoc and lacked sufficient controls in a few places. Software Engineers (SWEs) make code contributions to different repositories without a consistent review or approval process.
- Given all the intricate moving parts to release the monolith, we trust our Site Reliability Engineers (SREs) to be the only ones who can coordinate all that infrastructure. That has led to SREs being the only team with production access, which inadvertently leads to them being perceived as a bottleneck even as they try to do what we asked them to do. In addition, our architecture tends to be limited to the tools that SREs use, instead of being able to choose the best technology.
- We lack automation in how we test, deploy, monitor and roll back our code, placing an undue burden on some of our engineering teams who need to spend their time on these mundane tasks instead of delivering value.
- The core “eb” database is not only monolithic but also mutable, and capturing historical changes has been challenging. We’ve introduced new ledger style datastores to capture history that have led to consistency challenges.
- We also built homegrown tools such as our own RPC protocol, PySOA, and an open source library because of some unique needs in our business and also because there were no off the shelf tools that did the job well a decade ago. As the industry has evolved we now have off the shelf solutions that offer similar or better functionality. Maintaining undifferentiated home grown systems, which consume time from our engineering team and make it difficult to integrate with other industry standards is no longer prudent. We need to start looking at buy vs build options for all of our non-differentiated technical needs.
The next step was to describe our end goals. After we executed on our technical vision what would we want to accomplish?
- Deliver reliable, high quality, cost effective software solutions to our creators and consumers that allows the business to grow revenue 5x by 2023.
- Enable autonomous dev teams that own their code and architecture. Provide these teams the platform, tooling, and access required to own end-to-end production support for their services.
- Improve dev team accountability to deliver against high level OKRs while giving them autonomy to decide on the path to get there.
- Drive automation and reduce toil. All feature dev teams should be able to apply 60% of their capacity to deliver new business value by 2023. This balance is an estimate based on best performing mature product teams that we have seen in our past experience.
- Establish an operational excellence bar. Deliver 99.99% uptime across all customer facing services.
Defining aspirational goals was great but then we needed a set of tenets that would guide our decisions going forward. Here was the set of tenets we came up with.
- We will choose conforming over creating/reforming. We will research industry standards and will favor using them instead of building our own standard(s). By default we choose to not build wrappers on top of such standards. We will only build our own custom standard when it gives us a competitive advantage.
- Teams will have end-to-end ownership of their systems and services. This includes the business proposal, system design, implementation, testing, documentation, deploying to production, monitoring and maintenance and responding to on-call for any production issues. We believe this ownership leads to more productive teams and higher quality code. We will minimize the delegation of any of these activities to other teams or roles, although we expect some to have a central responsibility such as SREs being responsible for the overall health of our site and apps.
- We favor cloud managed services or serverless for commoditized systems and components. By default we choose not to maintain and scale our own hosts, databases or other infrastructure, and rely on cloud computing. We will focus on creating business value over non-value added commoditized tasks.
- We will favor long-term maintainability and scale over short-term deliveries for strategic solutions. We will have a long-term vision for strategic solutions that are a core component of our 3-year strategy. We value having a maintainable and scalable solution, and accept having short-term and iterative steps but always with the long-term in mind.
- We will standardize on a few stacks but also empower teams to choose the right tool for the job. We will choose a preferred logging, metrics, alerting, front end development and issue management stack. However teams are encouraged to pick appropriate programming languages, data stores, frameworks for their use cases after a thorough analysis. We will not allow an unreasonable growth in the number of technologies that we use, though.
- We will strive for continuous improvement and will ask why not instead of why? We will be bold in our choices and will constantly seek new and better ways of solving problems. We will be rigorous in our analysis and will take risks in pursuit of excellence.
This is where our 3-Year technical vision started to take shape. Daniel Micol, our tech fellow, will shed more light on that process in our next blog post.