Design System Wednesday: A Supportive Professional Community

Design systems produce a lot of value by providing an effective solution both for design and engineering. Yet, they take considerable time and work to set up and maintain. Many times, only a few people get tasked with this mammoth task and knowing where to begin is hard.

Design System Wednesday is a monthly community event where we welcome anyone working on or wanting to learn about design systems. These events provide a much-needed place to show off your system, tooling, or pose a burning question to the group. You get a group of incredible product designers, front-end engineers, and product managers. Their insightful answers and battle stories directly apply to the work you’re doing.

Keep reading to learn Design System Wednesdays. Our design system community meetings promote learning, cross-discipline partnership, and systems thinking.

Get input from other design system experts

As a design systems developer/designer, surrounding yourself with others facing the same challenge is incredibly beneficial. Most likely, you are one of a handful of designers and engineers dedicated to this vast undertaking. How daunting! Where do you begin? Have you found the most effective solution? How do you manage the balance between being too design or engineering centric? Design System Wednesday provides a space to bounce ideas off of others, ask for advice, or even crack some hilarious systems jokes!

We once had the pleasure of meeting a new design system lead whose company wanted to start a design system and they charged her with starting it. She asked her design system questions and got advice from people from over 10 companies! Questions on how to get buy-in, recommendations on tech stacks, and what design tools to use. What better way to learn than from peers working on similar things? I remember everyone’s willingness to answer her questions and help steer her in the right direction.

Grow and collaborate

I attended my very first Design System Wednesday the second day at my new job. It was exciting meeting everyone and, at the same time, a little intimidating. Still, I remember people’s welcoming and open spirit. I now look forward to attending these every month. We have a different group of people join us and different companies graciously host us every session. The open dialog, hospitality, and open day structure foster a space for growth and collaboration.

Become part of a community

As a front-end engineer, I seem to always be around other engineers. How refreshing to meet people from other roles and responsibilities! A diverse group of people from companies of all sizes and disciplines comprises the Design System Wednesday community. You can usually find product designers, front-end engineers, and product managers all sitting around the same table. I get to hear how they approach problems and how they solve them.

I even get to foster new friendships over silly easter eggs their products have that I didn’t know about. One Design System Wednesday, some Atlassian designers showed me Jira Board Jr. A Jira board for kids so they don’t miss out on the joy of building a Jira Board – their April fools joke!  I find it very refreshing to step out of my bubble and build connections with peers outside my company and discipline.

Design System Wednesday at Zendesk, Aug 2018

Design System Wednesdays is a community event for the community, by the community. I love being part of this community and helping plan these events, the same way I love helping other design system-ers come together, collaborate, and inspire each other.

We enjoy community events here at Eventbrite, what about you? What are some ways you help your community come together and inspire each other? Drop us a comment below or ping me on Twitter @mbeguiluz.

Featured Image: Design System Wednesday at Zendesk – August 8, 2018

BriteBytes: Nam-Chi Van

An Eventbrite original series, BriteBytes features interviews with Eventbrite’s growing global engineering team, shining a light on the individuals whose jobs are to build the technology that powers live experiences.

Nam-Chi Van is a Senior Software Engineer who works out of Eventbrite’s San Francisco office. She has been a part of the Eventbrite team for 6 years; writing code while taking photos of skateboarders on the side. In this interview, she tells us about a critical point in her career and why she loves working at Eventbrite.

Delaine Wendling: What brought you to Eventbrite and engineering?

Nam-Chi Van: Well, I went to art school for web design and interactive media. My first job after school was with a web agency based out of San Diego. While I was there, a recruiter from Eventbrite reached out to me. They flew me to San Francisco to visit the office, and I was blown away. The agency in San Diego was business casual, and Eventbrite was a more casual environment where I felt like I could be myself. My first role at Eventbrite was as a content developer. I worked on building WordPress themes and hacking together landing pages like this one. Technically, I was on the engineering team, but I wasn’t doing anything super heavy.

At the time, I was also doing a lot of photography on the side. I would take time off and go to events like the X Games to photograph skateboarders. It didn’t take long for me to get burned out on this schedule so I sat down with my manager to figure out my life. I told him I didn’t love what I was doing at Eventbrite and was thinking about maybe pursuing photography full time. If I was going to stay at Eventbrite, I wanted to move into a more traditional engineering role so that I would be challenged. He helped me talk through my options and made an offer for me to move into a more challenging engineering role. I decided to take it and have been really happy with my decision. I still do photography on the side but not as intensely.

Samarria Brevard, Street League 2017

D: What has kept you at Eventbrite?

N: I love that I’m surrounded by a lot of smart and supportive people. When I first switched into a more traditional engineering role, I had a lot of impostor syndrome. My teammates were amazing though and never made me feel like I couldn’t do the work. They encouraged me and helped me learn the things I needed to learn. I love being in a supportive environment like that. Eventbrite offers a lot of opportunity for growth and working with new technologies, so I don’t feel like I’ll ever get bored. I also love working at a place that encourages me to be my authentic self.

D: What project has been your favorite at Eventbrite? What made it so great?

N: Eventbrite used to have some ugly landing pages, like the career listings page, about page, etc. During a hackathon at Eventbrite, a coworker and I decided to redesign all of those pages. I took a lot of photos to make these pages more welcoming and reflective of the Eventbrite culture. Many of these pages are still being used today, like the about page.

I enjoyed this project so much because it made a real impact on the company and was something I came up with on my own. I was also able to use my photography skills for the project, which was fun.

D: What is the most complex problem you have had to solve recently?

N: I guess I haven’t actually solved this problem, but tech debt is probably the most complex problem I’ve had to face. It’s something that’s always on my mind and can feel overwhelming. We are constantly trying to find a balance between writing code that is reusable and extensible and meeting deadlines. It’s a difficult balance to find and something I will continue to try to improve.

D: Do you have a role model? Who is it and why are they your role model?

N: Yeah, my mom. She is an amazing woman who taught me the importance of being myself and being independent. I wanted to skateboard when I was younger, but there weren’t a lot of girls doing that. My mom didn’t care and encouraged me to do it anyway. She said I could do anything I put my mind to.

D: What advice would you give a new female engineer starting?

N: Have confidence in yourself, don’t be afraid to fail, learn constantly, challenge yourself, and keep going. You’ve got this.

DW: And, just because it’s fun: If you were a wrestler, what would be your theme song?

N: (laughs) Hmmm…probably some heavy metal Megadeth song. It would need to have something with a super heavy guitar riff.

Congratulations to Nam-Chi for her recent promotion to Senior Software Engineer! We are thankful to have her on the team. How has your experience been in the engineering world? Who inspires you? Share your comments with us, we would love to get to know you.

Why Would Webpack Stop Re-compiling? (The Quest for Micro-Apps)

Eventbrite is on a quest to convert our “monolith” React application, with 30+ entry points, into individual “micro-apps” that can be developed and deployed individually. We’re documenting this process in a series of posts entitled The Quest for Micro-Apps. You can read the full Introduction to our journey as well as Part 1 – Single App Mode outlining our first steps in improving our development environment.

Here in Part 2, we’ll take a quick detour to a project that occupied our time after Single App Mode (SAM), but before we continued towards separating our apps. We were experiencing an issue where Webpack would mysteriously stop re-compiling and provide no error messaging. We narrowed it down to a memory leak in our Docker environment and discovered a bug in the implementation for cache invalidation within our React server-side rendering system. Interest piqued? Read on for the details on how we discovered and plugged the memory leak!

A little background on our frontend infrastructure

Before embarking on our quest for “micro-apps,” we first had to migrate our React apps to Webpack. Our React applications originally ran on requirejs because that’s what our Backbone / Marionette code used (and still does to this day). To limit the scope of the initial switch to React from Backbone, we ran React on the existing infrastructure. However, we quickly hit the limits of what requirejs could do with modern libraries and decided to migrate all of our React apps over to Webpack. That migration deserves a whole post in itself.

During our months-long migration in 2017 (feature development never stopped by the way), the Frontend Platform team started hearing sporadic reports about Webpack “stopping.” With no obvious reproduction steps, Webpack would stop re-compiling code changes. In the beginning, we were too focused on the Webpack migration to investigate the problem deeply. However, we did find that turning off editor autosave seemed to decrease the occurrences dramatically. Problem successfully punted.

Also the migration to Webpack allowed us to change our React server-side rendering solution (we call it react-render-server or RRS) in our development environment. With requirejs react-render-server used Babel to transpile modules on-demand with babel-register.

if (argv.transpile) {
  // When the `transpile` flag is turned on, all future modules
  // imported (using `require`) will get transpiled. This is 
  // particularly important for the React components written in JSX.
      stage: 0

  reactLogger('Using Babel transpilation');

This code is how we were able to import React files to render components. It was a bit slow but effective. However because Node caches all of its imports, we needed to invalidate the cache each time we made changes to the React app source code. We accomplished this by using supervisor to restart the server every time a source file changed.

#!/usr/bin/env bash

./node_modules/.bin/supervisor \
  --watch /path/to/components \
  --extensions js \
  --poll-interval 5000 \
  -- ./node_modules/react-render-server/server.js \
    --port 8991 \
    --address \
    --verbose \
    --transpile \
    --gettext-locale-path /srv/translations/core \
    --gettext-catalog-domain djangojs

This addition, unfortunately, resulted in a poor developer experience because it took several seconds for the server to restart. During that time, our Django backend was unable to reach RRS, and the development site would be unavailable.

With the switch, Webpack was already creating fully-transpiled bundles for the browser to consume, so we had it create node bundles as well. Then, react-render-server no longer needed to transpile on-demand

Around the same time, the helper react-render library we were using for server-rendering also provided a new --no-cache option which solved our source code caching problem. We no longer needed to restart RRS! It seemed like all of our problems were solved, but little did we know that it created one massive problem for us.

The Webpack stopping problem

In between the Webpack migration and the Single Application Mode (SAM) projects, more and more Britelings were having Webpack issues; their Webpack re-compiling would stop. We crossed our fingers and hoped that SAM would fix it. Our theory was that before SAM we were running 30+ entry points in Webpack. Therefore if we reduced that down to only one or two, we would reduce the “load” on Webpack dramatically.

Unfortunately, we were not able to kill two birds with one stone. SAM did accomplish its goals, including reducing memory usage, but it didn’t alleviate the Webpack stoppages. Instead of continuing to the next phase of our quest, we decided to take a detour to investigate and fix this Webpack stoppage issue once and for all. Any benefits we added in the next project would be lost due to the Webpack stoppages. Eventbrite developers are our users so we shouldn’t build new features before fixing major bugs.

The Webpack stoppage investigations

We had no idea what was causing the issue, so we tried many different approaches to discover the root problem. We were still running on Webpack 3 (v3.10.0 specifically), so why not see if Webpack 4 had some magic code to fix our problem? Unfortunately, Webpack 4 crashed and wouldn’t even compile. We chose not to investigate further in that direction because we were already dealing with one big problem. Our team will return to Webpack 4 later.

Sanity check

First, our DevTools team joined in on the investigations because they are responsible for maintaining our Docker infrastructure. We observed that when Webpack stopped re-compiling, we could still see the source file changes reflected within the Docker container. So we knew it wasn’t a Docker issue.

Reliably reproducing the problem

Next, we knew we needed a way to reproduce the Webpack stoppage quickly and more reliably. Because we observed that editor autosave was a way to cause the stoppage, we created a “rapid file saver” script. It updated dummy files by changing imported functions in random intervals between 200 to 300 milliseconds. This script would update the file before Webpack finished re-compiling just like editor autosave, and enabled us to reproduce the issue within 5 minutes. Running this script essentially became a stress test for Webpack and the rest of our system. We didn’t have a fix, but at least we could verify one when we found it!

var fs = require('fs');
var path = require('path');

const TEMP_FILE_PATH = path.resolve(__dirname, '../../src/playground/tempFile.js');

// Recommendation: Do not set lower than 200ms 
// File changes that quickly will not allow webpack to finish compiling

const getRandomInRange = (min, max) => (Math.random() * (max - min) + min)
const getTimeout = () => getRandomInRange(REWRITE_TIMEOUT_MIN, REWRITE_TIMEOUT_MAX);

const FILE_VALUES = [
    {name: 'add', content:'export default (a, b) => (a + b);'},
    {name: 'subtract', content:'export default (a, b) => (a - b);'},
    {name: 'divide', content:'export default (a, b) => (a / b);'},
    {name: 'multiply', content:'export default (a, b) => (a * b);'},

let currentValue = 1;
const getValue = () => {
    const value = FILE_VALUES[currentValue];
    if (currentValue === FILE_VALUES.length-1) {
        currentValue = 0;
    } else {
    return value;

const writeToFile = () => {
    const {name, content} = getValue();
    console.log(`${new Date().toISOString()} -- WRITING (${name}) --`);
    fs.writeFileSync(TEMP_FILE_PATH, content);
    setTimeout(writeToFile, getTimeout());


With the “rapid file saver” at our disposal and a stroke of serendipity, we noticed the Docker container’s memory steadily increasing while the files were rapidly changing. We thought that we had solved the Docker memory issues with the Single Application Mode project. However, this did give us a new theory: Webpack stopped re-compiling when the Docker container ran out of memory.

Webpack code spelunking

The next question we aimed to answer was why Webpack 3 wasn’t throwing any errors when it stopped re-compiling. It was just failing silently leaving the developer to wonder why their app wasn’t updating. We began “code spelunking” into Webpack 3 to investigate further.

We found out that Webpack 3 uses chokidar through a helper library called watchpack (v1.4.0) to watch files. We added additional console.log debug statements to all of the event handlers within (transpiled) node_modules, and noticed that when chokidar stopped firing its change event handler, Webpack also stopped re-compiling. But why weren’t there any errors? It turns out that the underlying watcher didn’t pass along chokidar’s error events, so Webpack wasn’t able to log anything when chokidar stopped watching.

The latest version of Webpack 4, still uses watchpack, which still doesn’t pass along chokidar’s error events, so it’s likely that Webpack 4 would suffer from the same error silence. Sounds like an opportunity for a pull request!

For those wanting to nerd out, here is the full rabbit hole:

This whole process was an interesting discovery and a super fun exercise, but it still wasn’t the solution to the problem. What was causing the memory leak in the first place? Was Webpack even to blame or was it just a downstream consequence?


We began looking into our react-render-server and the --no-cache implementation within  react-render, the dependency that renders the components server-side. We discovered that react-render uses decache for its --no-cache implementation to clear the require cache for every request for our app bundles (and their node module dependencies). This was successful in allowing new bundles with the same path to be required, however, decache was not enabling the garbage collection of the references to the raw text code for the bundles.

Whether or not the source code changed, each server-side rendering request resulted in more orphaned app bundle text in memory. With app bundle sizes in the megabytes, and our Docker containers already close to maxing out memory, it was very easy for the React Docker container to run out of memory completely.

We found the memory leak!


We needed a way to clear the cache, and also reliably clear out the memory. We considered trying to make decache more robust, but messing around with the require cache is hairy and unsupported.

So we returned to our original solution of running react-render-server (RRS) with supervisor, but this time being smarter with when we restart the server. We only need to take that step when the developer changes the source files and has already rendered the app. That’s when we need to clear the cache for the next render. We don’t need to keep restarting the server on source file changes if an app hasn’t been rendered because nothing has been cached. That’s what caused the poor developer experience before, as the server was unresponsive because it was always restarting.

Now, in the Docker container for RRS, when in “dynamic mode”, we only restart the server if a source file changes and the developer has a previous version of the app bundle cached (by rendering the component prior). This rule is a bit more sophisticated than what supervisor could handle on its own, so we had to roll our own logic around supervisor. Here’s some code:

// misc setup stuff
const createRequestInfoFile = () => (
        JSON.stringify({start: new Date()}),

const touchRestartFile = () => writeFileSync(RESTART_FILE_PATH, new Date());

const needsToRestartRRS = async () => {
    const rrsRequestInfo = await safeReadJSONFile(RRS_REQUEST_INFO_PATH);

    if (!rrsRequestInfo.lastRequest) {
        return false;

    const timeDelta = Date.parse(rrsRequestInfo.lastRequest) - Date.parse(rrsRequestInfo.start);

    return Number.isNaN(timeDelta) || timeDelta > 0;

const watchSourceFiles = () => {
    let isReady = false;

        .on('ready', () => (isReady = true))

        .on('all', async () => {
            if (isReady && await needsToRestartRRS()) {

const isDynamicMode = shouldServeDynamic();
const supervisorArgs = [
    '--extensions', extname(RESTART_FILE_PATH).slice(1),

    ...(isDynamicMode ? ['--watch', RESTART_FILE_PATH] : ['--ignore', '.']),
const rrsArgs = [
    '--port', '8991',
    '--address', '',
    '--request-info-path', RRS_REQUEST_INFO_PATH,

if (isDynamicMode) {

    [...supervisorArgs, '--', RRS_PATH, ...rrsArgs],
        // make the spawned process run as if it's in the main process
        stdio: 'inherit',
        shell: true,

In short we:

  1. Create __request.json and initialize it with a start timestamp.
  2. Pass the _request.json file to RRS to update it with the lastRequest timestamp every time an app bundle is rendered.
  3. Use chokidar directly to watch the source files.
  4. Check to see if the lastRequest timestamp is after the start timestamp when the source files change and touch a file if that is the case. This means we have the app bundle cached because we’ve rendered an app bundle after the server was last restarted.
  5. Set up supervisor to only watch the file. That way, we restart the server only when all of our conditions are met.
  6. Recreate and reinitialize the __request.json file when the server restarts, and start the process again.

All of our server-side rendering happens through our Django backend. That’s where we’ve been receiving the timeout errors when react-render-server is unreachable by Django. So, in development only, we also added 5 retry attempts separated by 250 milliseconds if the request failed because Django couldn’t connect to the react-render-server.

The results are in

Because we had the “rapid file saver” with which to test, we were able to leverage it to verify all of the fixes. We ran the “rapid file saver” for hours, and Webpack kept humming along without a hiccup. We analyzed Docker’s memory over time as we reloaded pages and re-rendered apps and saw that the memory remained constant as expected. The memory issues were gone!

Even though we were once again restarting the server on file changes, the react-render-server connection issues were gone. There were some corner cases where the site would automatically refresh and not be able to connect, but those situations were few and far between.

Coming up next

Now that we finished our detour of a major bug we’ll return to the next milestone towards apps that can be developed and deployed independently.

The next step in our goal towards “micro-apps” is to give each application autonomy and control with its own package.json and dependencies. The benefit is that upgrading a dependency with a breaking change doesn’t require fixing all 30+ apps at once; now each app can move at its own pace.

We need to solve two main technical challenges with this new setup:

  • how to prevent each app from having to manage its infrastructure, and
  • what to do with the massive, unversioned, shared common/ folder that all apps use

We’re actively working on this project right now, so we’ll share how it turns out when we’re finished. In the meantime, we’d love to hear if you’ve had any similar challenges and how you tackled the problem. Let us know in the comments or ping me directly on Twitter at @benmvp.

Photo by Sasikan Ulevik on Unsplash

How To Move From Customer Support to Engineering in 5 Steps

When I explain that I did a career move from customer support to full-time software engineer at Eventbrite, I’m often met with dubious looks: “Wait, what? How is that even possible? How did you do that?”. They are even more surprised to learn that I didn’t go back to school or even take a coding boot camp.

With the right strategy, you don’t need a technical degree to become a software engineer. Read on to learn about several steps you can take to move from a customer facing role at your company into engineering.

A pipeline to Engineering within Eventbrite

I’m not the first to do a career move from customer support into software engineering. At Eventbrite alone, eight people have moved to technical roles in engineering from our customer support team. This pipeline has also been beneficial for our dev teams in many ways. For instance, we’ve seen an increase in customer empathy when a former customer support representative joins, which usually helps to boost quality in our product development. In fact, roles in quality assurance (QA) are an especially good fit for those coming from customer support. This step in the pipeline can be a good choice for those looking to later move to full-time software engineering roles. (For more info on Eventbrite’s QA philosophy, check out Andrew’s post on rethinking quality).

As a high performer in customer support, you too can move from a customer facing role at your company into software engineering. However, you won’t get there by continuing to do only your assigned role. You need to take some actions to put yourself into a position to succeed.

A step by step approach

Imagine this conversation: an engineering manager is chatting with her team about a new role she’s opening up for a QA engineer to join. What if at the moment she announced this, her team immediately piped up with “We should hire {insert name} for that role, {he/she} would be fantastic at that!”? How do you guarantee that your name is the one brought up?

For me, the five steps outlined below were crucial to making sure I’d be recognized when a hiring opportunity came up for a QA engineer position. I was later able to make another transition to a full-time software engineering position because I continued these practices of putting myself into a position to succeed.

Step 1: Be a top performer in your day job

Before everything else, dedicate yourself to excellence in your core role. You want to be recognized as a highly qualified individual. Maintain a high customer satisfaction rating while still answering a high number of customer queries. Your company will likely be more willing to provide you with new opportunities in engineering if you are a top performer in your current role. Top performers are smaller risks for lateral moves, and no company wants to lose high-potential talent to another company.

Step 2: Build relationships in engineering

You’ll need to get friendly with engineering so that your name is top of mind when new opportunities are available. Grab a 1:1 lunch with individual developers and ask them about their path to software engineering. I talked to a mix of engineers: QA engineers, senior software engineers, junior software engineers that had gone through coding boot camps.

Gain some name recognition by leading a hackathon team and presenting your team’s work to the company. You don’t need engineering experience to do this. In fact, I led a project with a cross-functional team of support members, engineers, and marketers having no technical expertise at all. It was a small project, but it allowed me to work with engineers and to show my interest in engineering projects to the company. Plus there were plenty of engineering leaders watching the project demos who afterward recognized my name.

Step 3: Leverage your product expertise

Your product and customer expertise are invaluable to product and engineering. Leverage this knowledge by sharing it with your engineering teams and advocating for your customers. Reach out to engineers to ask for help when a customer encounters a bug. Alternatively, tell a product manager about your ideas for small product improvements that would enhance the customer experience.

The first time I did this was intimidating, but I was surprised to find that the engineers on the other side were more than happy to help. By doing this, you’ll establish yourself as a trusted customer expert. Engineers and product managers will begin to turn to you when they have questions or ideas for how to build the product, and later they will want to have your expertise on their teams full-time.

Step 4: Invest in your technical learning

Prepare yourself for a transition to engineering by learning the basics of whatever programming language your company uses. There are abundant resources for you to learn new technical skills. I started with Python, Javascript, and SQL by taking free Codecademy classes online. If your company already has a good learning culture (check out Randall’s post on supporting junior engineers), ask to attend peer-led training or to participate in a mentorship program to supplement your learning. Show everyone around you that you invest in your learning by spending time outside of work developing these new skills. Even a consistent 30 minutes per day can be very effective. If you demonstrate a growth mindset by dedicating time to improving yourself, you will also build trust with engineering leaders who will be more willing to disregard your lack of formal technical education.

Step 5: Advocate for yourself

Carefully look for situations that might help you, today or later on. Even bite-sized opportunities can be beneficial in the long run, but you must advocate for yourself to take them on and reap the rewards.

While I was still in customer support, I looked for an opportunity to get involved with our Support Triage team. That team’s responsibility is to investigate incoming bug reports and send them to engineering. It wasn’t an official position, but I saw that they were overwhelmed with their workload and I volunteered my help. I was able to contribute to the team by investigating bugs, but I also got to learn about our bug process, try out new tools, and talk to engineers. Through this work, I built a reputation for submitting well-investigated and detailed bug tickets. That helped me stand out when a QA position was later opened up.

Another example of advocating for myself happened after I was in QA for a few months. I asked for help from my manager to learn how to fix small bugs I reported, resulting in dedicated pair programming time. After that, I asked for small feature projects that I could pair on with my team’s developers to continue building programming skills. Some time later, I asked my leaders in engineering to move me to a full-time software engineering position. They helped me make the transition with little hesitation. Even though I have no formal education in computer science, I had proven to them that I was invested in my learning and capable of being a software engineer.

Final thoughts

My collegiate track and field coach’s favorite piece of advice to me was to “Put yourself in a position to succeed.”

In the running world, this meant pushing hard during practice sessions to get a little bit faster, stronger, and better every day. This way you allow yourself to be successful on race day when it matters the most. You had already put in miles of effort and hours of mental practice to support a personal best at the finish line. This same strategy applies to lateral career moves as well. Take the time now to prepare and put yourself in a position to succeed so that you are ready for new opportunities when they arise.

I hope that the steps above will help you get closer to achieving your career dreams of moving to software engineering from a customer support position. Of course, beyond these five steps, there are many other details to discuss such as communication strategies, technical learning tips, and how to create a support system.

If you have any questions or want to chat more about how I made this career move, feel free to reach out. Leave a comment below, or you can also reach me at @snazbala on Twitter or through my website at

P.S.: Engineering leaders, keep an eye out for a follow-up post. I’ll cover why you should hire QA engineers from customer support and how you can create a supportive culture for these lateral career moves.

Boosting Big Data workloads with Presto Auto Scaling

The Data Engineering team at Eventbrite recently completed several significant improvements to our data ecosystem. In particular, we focused on upgrading our data warehouse infrastructure and improving the tools used to drive Eventbrite’s data-driven analytics.

Here are a few highlights:

  • Transitioned to a new Hadoop cluster. The result is a more reliant, secure, and performant data warehouse environment.
  • Upgraded to the latest version of Tableau and migrated our Tableau servers to the same  AWS infrastructure as Presto. We also configured Tableau to connect via its own dedicated Presto cluster. The data transfer rates, especially for Tableau extracts, are 10x faster!
  • We upgraded Presto and fine-tuned the resource allocation (via AWS Auto Scaling) to make the environment optimal for Eventbrite’s analysts. Presto is now faster and more stable. Our daily Tableau dashboards, as well as our ad-hoc SQL queries, are running 2 to 4 times faster.

This post focuses on how Eventbrite leverages AWS Auto Scaling for Presto using Auto Scaling Groups, Scaling Policies, and Launch Configurations. This update has allowed us to meet the data exploration needs of our Engineers, Analysts, and Data Scientists by providing better throughput at a fraction of the cost.

High level overview

Let’s start with a high-level view of our data warehouse environment running on AWS.

Auto Scale Overview

Analytics tools: Presto, Superset and Tableau

We’re using Presto to access data in our data warehouse. Presto is a tool designed to query vast amounts of data using distributed queries. It supports the ANSI SQL standard, including complex queries, aggregations, and joins. The Presto team designed it as an alternative to tools that query HDFS using pipelines of MapReduce jobs. It connects to a Hive Metastore allowing users to share the same data with Hive, Spark, and other Hadoop ecosystem tools.

We’re also using Apache Superset packaged alongside Presto. Superset is a data exploration web application that enables users to process data in a variety of ways including writing SQL queries, creating new tables and downloading data in CSV format. Among other tools, we rely heavily on Superset’s SQL Lab IDE to explore and preview tables in Presto, compose SQL queries, and save output files as CSV.

We’re exploring the use of Superset for dashboard prototyping although currently the majority of our data visualization requirements are being met by Tableau. We use Tableau to represent Eventbrite’s data in dashboards that are easily digestible by the business.

The advantage of Superset is that it’s open-source and cost-effective, although we have performance concerns due to lack of caching and it’s missing some features (triggers on charts, tool-tips, support for non-SQL functions, scheduling) that we would like to see. We plan to continue to leverage Tableau as our data visualization tool, but we also plan to adopt more Superset usage in the future.

Both Tableau and Superset connect to Presto,  which retrieves data from Hive tables located on S3 and HDFS commonly stored as Parquet.

Auto scaling overview

Amazon EC2 Auto Scaling enables us to follow the demand curve for our applications, and thus reduces the need to manually provision Amazon EC2 capacity in advance. For example, we can use target tracking scaling policies to select a load metric for our application, such as CPU utilization or via the Presto metrics.

It’s critical to understand the terminology for AWS Auto Scaling. Tools such as “Launch Configuration,”  “Auto Scaling Group” and “Auto Scaling Policy” are vital components we show below. Here is a diagram that shows the relationship between the main components of AWS Auto Scaling. As an old-school data modeler, I tend to think in terms of entities and relationships via the traditional ERD model 😀

Auto Scaling ERD

Presto auto scaling

We’re using AWS Auto Scaling for our Presto “spot” instances based on (I) CPU usage and (II) number of queries (only used for scaledown). Here is an overview of our EC2 auto-scaling setup for Presto.

Auto Scaling with Presto

Here are some sample policies:

Policy type:  Simple scaling (I)

Execute policy when:  CPU Utilization >= 50 for 60 seconds for the metric dimensions .

Take the action:  Add 10 instances (provided by EC2).

Policy type: Simple scaling (II)

Execute policy when: running Queries <= 0 for 2 consecutive periods of 300 seconds for the metric dimensions.

Take the action: Set to 0 instances.

Note: A custom Python script was developed by Eventbrite’s Data Engineering team to handshake with Cloudwatch concerning scaledown.  It handles the race condition where another query comes in during the scaledown process. We’ve added “termination protection” which leverages this Python script (running as a daemon) on each Presto worker node. If it detects a query is currently running on this node, then it won’t scale down.

Tableau scheduled actions

We’re using “Scheduled Scaling” features for our Tableau Presto instances as well as our “base” instances used for Presto. We scale up the instances in the morning and scale down at night. We’ve set up scheduled scaling based on predictable workloads such as Tableau.

“Scheduled Scaling” requires configuration of scheduled actions, which tells Amazon EC2 Auto Scaling to act at a specific time. For each scheduled action, we’ve specified the start time, and the new minimum, maximum, and the desired size of the group. Here is a sample setup for scheduled actions:

Auto scale actions


We’ve enabled Auto Scaling Group Metrics to identify capacity changes via CloudWatch alarms. When triggered, these alarms will cause autoscaling groups to execute the policy when a threshold is breached. In some cases, we’re using EC2 alerts and in others, we’re pushing custom metrics through python scripts to Cloudwatch.

Sample Cloudwatch alarms:

Multiple Presto clusters

We’ve separated Tableau connections from ad-hoc Presto connections. This abstraction allows us to separate ad-hoc query usage from Tableau usage.


Our Presto workers read data that is written by our persistent EMR clusters.  Our ingestion and ETL jobs run on daily and hourly scheduled EMR clusters with access to Spark, Hive and Sqoop. Using EMR allows us to decouple storage from computation by using a combination of S3 and a custom HDFS cluster. The key is we only pay for computation when we use it!

We have multiple EMR clusters that write the data to Hive tables backed by S3 and  HDFS. We launch EMR clusters to run our ETL processes that load our data warehouse tables daily/hourly. We don’t currently tie our EMR clusters to auto-scaling.

By default, EMR stores Hive Metastore information in a MySQL database on the master node. It is the central repository of Apache Hive metadata and includes information such as schema structure, location, and partitions. When a cluster terminates, we lose the local data because the node file systems use ephemeral storage. We need the Metastore to persist, so we’ve created an external Metastore that exists outside the cluster.

We’re not using the AWS Glue Data Catalog. The Data Engineering team at Eventbrite is happy managing our Hive Metastore on Amazon Aurora. If something breaks, like we’ve had in the past with Presto race conditions writing to the Hive Metastore, then we’re comfortable fixing it ourselves.

The Data Engineering team created a persistent EMR single node “cluster” used by Presto to access Hive. Presto is configured to read from this cluster to access the Hive Metastore. The Presto workers communicate with the cluster to relay where the data lives, partitions, and table structures.

The end

In summary, we’ve focused on upgrading our data warehouse infrastructure and improving the tools used to drive Eventbrite’s data-driven analytics.  AWS Auto Scaling has allowed us to improve efficiency for our analysts while saving on cost.  Benefits include:

Decreased Costs

AWS Auto Scaling allows us to only pay for the resources we need. When demand drops, AWS Auto Scaling removes any excess resource capacity, so we avoid overspending.

Improved Elasticity

AWS Auto Scaling allows us to dynamically increase and decrease capacity as needed. We’ve also eliminated lost productivity due to non-trivial error rates caused by failed queries due to capacity issues.

Improved Monitoring

We use metrics in Amazon CloudWatch to verify that our system is performing as expected. We also send metrics to CloudWatch that can be used to trigger AWS Auto Scaling policies we use to manage capacity.

All comments are welcome, or you can message me at Thanks to Eventbrite’s Data Engineering crew (Brandon Hamric, Alex Meyer, Beck Cronin-Dixon, Gray Pickney and Paul Edwards) for executing on the plan to upgrade Eventbrite’s data ecosystem. Special thanks to Rainu Ittycheriah, Jasper Groot, and Jeremy Bakker for contributing/reviewing this blog post.

You can learn more about Eventbrite’s data infrastructure by checking out my previous post at Looking under the hood of the Eventbrite data pipeline.

Automated Cross-Browser Testing for WebGL— It’s Not Going to Happen

Apologies to the folks who found this post while searching for “automated WebGL testing,” “how to write cross-browser WebGL tests,” or similar. I’ve been there, and it is not my favorite part of the job. Sadly I do not know a magic recipe for writing cross-browser acceptance tests for web apps that integrate WebGL canvas interactions as part of a larger user flow. This post offers a look into how the Reserved squad at Eventbrite uses Rainforest QA to test complex WebGL flows.

I’m a frontend software engineer on the Reserved squad, which recently (at the time of writing) launched an end-to-end experience for reserving seats within Eventbrite’s embedded checkout flow. While we were developing this feature, we ran into a roadblock: how could we write reliable acceptance tests for our WebGL-dependent flows? Furthermore, how could we reliably test our user flows without sinking hundreds of additional engineering hours into coercing Selenium to click on the precise canvas coordinates necessary to reserve a seat? We decided to try testing some of our user flows with a crowdsourced quality assurance (QA) platform called Rainforest QA, and have been quite happy to ship the results.

WebGL: What it’s good at, and one unfortunate consequence

WebGL is useful for rendering complex 2D and 3D graphics in the client’s web browser. It’s natively supported by all major browsers and under the hood interfaces with OpenGL API to render content in the canvas element. Because it allows code to run in the client’s GPU, there are significant performance benefits when you need to render and listen to actions on hundreds or thousands of elements.

My squad at Eventbrite uses WebGL (with help from Three.js, which you can learn more about in an earlier blog post) to render customizable venue maps that allow organizers to determine seat selling order. Once the organizer publishes the event, we allow attendees to choose the location of their seat on the rendered venue map. Because WebGL draws the venue maps in the canvas element rather than needlessly generating DOM elements for every seat, we can provide a relatively performant experience, even for maps with tens of thousands of seats. The only major drawback is that there is no DOM element to target in our acceptance tests when we want to test what happens when a user clicks on a seat.

The code to render a seat map using Three.js looks roughly like this:

// Initialize scene, camera values based on client browser width
const {scene, camera} = getSceneAndCamera();
const element = document.getElementById('canvas');
const renderer = new THREE.WebGLRenderer();

// Add objects like seats, stage, etc. to the scene, then render it
renderer.render(scene, camera);

This code renders content in the canvas element:

But when we inspect the generated markup, this is all that we see:

<canvas width="719" height="656"></canvas>

Because the canvas element does not contain targetable DOM elements, simulating a seat click using WebDriver or other test scripting frameworks requires specifying exact coordinates within the canvas element.

How did Rainforest solve our testing problem?

For several months, my squad had been working in a green pasture of unreleased code as we made steady progress on new pick-a-seat features. Throughout the development process, we maintained test coverage with unit tests, integration tests, and page-level JS functional tests using enzyme and fetch-mock. However, our test coverage contained a glaring hole: we had not yet written tests that fully verified our user stories.

Acceptance tests are black-box tests that formally describe a user story and that we run at the system level. An acceptance test script might load a URL in a virtual machine (VM), automate some user actions, and confirm that the user can complete a flow (such as checkout) as expected. Eventbrite engineers rely on acceptance tests to ensure that our user interfaces don’t break when squads across the organization push code to our shared, continuously deployed repositories. Most acceptance tests at Eventbrite are written using Selenium WebDriver and often look something like this:

    def test_checkout_widget_free_event(self):
        """Verify it is possible to purchase a free ticket."""
        # Go to the test page

        # Select a ticket and click the checkout button
        self.checkout_widget.select_ticket_quantity(, 1)

        # Verify the purchase confirmation page is displayed

But when targeting a canvas element, clicking on a seat looks a bit more like this:

   action = ActionChains(webdriver_instance)
   action.move_by_offset(seat_px, seat_py)

In other words, we need to know the exact x and y coordinates of the seat within the canvas element. Even after the chore of automating clicks on precise coordinates within the canvas, we knew that minor style changes might require us to revisit each test and hunt down updated coordinates.

As the projected release date loomed near, we considered our options and determined that it would require several dedicated sprints to write the tests needed to thoroughly cover all of our new features. What if, instead of wrangling data and coordinates, we could write out test plans that could be quickly verified by human QA testers?

Enter Rainforest! Rainforest is a crowdsourced QA solution that puts our flow in front of real users. Because testers access sessions through a VM, we can specify which browsers they need to test, and they can run the tests against our staging environment. The Rainforest app runs the test suite on a customizable schedule, and the entire test run is parallelized and completed in less than 30 minutes. We wrote out all of our as-yet-untested user story test cases (in plain English) and got the system up and running.

Our Rainforest tests look like this:

We write each step of the test as a direction, followed by a yes-or-no question for the tester to answer. During a testing session, the tester follows the instructions, such as: “Click ‘Buy on Map’ located on the right-hand side.” Next, they mark the step as passed if the click caused the rendered map to zoom to the two highlighted seats.

Our key to Rainforest success: one-step event creation

Once we decided to proceed with this approach, our squad invested some time into developing an API that would allow us to automate a critical step of this workflow. When Rainforest testers log into their VMs, we provide them a URL that will, upon load, create a new QA user account with an event that is in the exact state needed to test the features covered by the test. A tester loading this URL is analogous to an acceptance test run instantiating the factory classes that generate test data for our WebDriver tests.

The endpoint accepts URL parameters that define relevant features of the event:


Loading this URL creates a new QA user with restricted permissions, builds an event with a medium-sized seat map and four ticket types (authored by the new user), and then redirects to the embedded checkout test URL for the given event.

Without this tool, Rainforest testing would require a manual tester dozens of clicks and page refreshes to create an event, design a venue map, publish the event, and then finally reach the checkout flow. Eventbrite engineers have already covered all of these actions with automated acceptance tests elsewhere—when we are testing the seat reservation flow, we want to focus on precisely that. One-step event creation has allowed us to get testers into the correct state to access our flow with a single keystroke.

Additionally, because we have configured Rainforest to run against our staging environment, Rainforest QA testers catch bugs for us before they are released. While unit and integration tests give us confidence that our code works at a more granular level, Rainforest has given us an additional layer of security, assuring that the features we already built are still working so that we can move on to the next challenge.

Universal takeaways

Yes, Rainforest does cost money, and I’m not here to tell you how your company should spend its money. (If you’re curious about Rainforest, you can always request a demo). It’s also not the only solution in this space. Rainforest works very well for us, but a related platform such as Testlio, GlobalAppTesting, TestingBot, or UseTrace may be a better fit for your team.

Here are some takeaway learnings from our case study that might still come in handy:

  • Cross-browser testing pays off. If your current acceptance suite only runs tests against one browser, it might be worth re-evaluating. (If you’re doing your own cross-browser QA, Browserstack is indispensable.)
  • When you automate testing user stories as part of your continuous integration (CI) flow, you ensure that your system reliably meets product requirements.
  • Don’t stop writing automated tests, but do consider how much time you are spending writing and maintaining tests that could be more reliably tested by a human QA tester.
  • You can get the most out of your testing and QA by automating critical steps of the process.

For my squad, Rainforest has been an excellent solution and has helped us catch many browser-specific and complex multi-page bugs before they made their way to the release branch. While we are still working on improving its visibility in our CI flow so that newly introduced bugs are surfaced earlier in the development cycle, automated test runs assure us that our features remain stable across all major browsers. As a developer, I love that I get to spend my time building new features rather than writing and maintaining fussy WebDriver tests.

Have you found another way to save time writing acceptance tests for complex WebGL flows? Do you have questions about our Rainforest experience that I didn’t cover? Do you want to have a conversation about the ethics of crowdsourcing QA work? Let me know what you think in the comments below or on Twitter.

Varnish and A-B Testing: How to Play Nice

Here at Eventbrite, we love building sites that are fast, delightful, and reliable. Caching HTML responses using edge caches, such as Varnish, ensures a lighter load on your servers and a performant experience for the end user. However, doing so can often cause A/B testing frameworks to fail in a sneaky fashion.

Read on to learn some key things to know if you find yourself running an A/B test on a page served via Varnish.

First, a Quick overview

What is A/B testing? The Wikipedia page covers the topic well, but here’s a quick TL;DR: A/B testing allows us to expose our users to two slightly different experiences: a control and a variant, where the variant only differs in a singular controlled manner. Then track each variant by pre-determined performance metrics, such as conversion rate to purchase, to decide if the variant provides a real lift over the control.

A/B testing is one of the most useful tools a developer and product managers can use to determine what engages with their audience the best. Often these tests need to live on pages that must be reliable and must be performant. That’s where Varnish comes in.

Varnish is an open sourced caching HTTP reverse proxy. Essentially, a super fast cache, which sits in front of any server that understands HTTP. It receives requests from the client and attempts to serve an HTTP response from the cache. If it cannot, it then forwards the request to the backend server, stores the server’s response and pass it along to the client.

Varnish sounds great! Why is it troublesome with A/B Testing?

Varnish caches an entire HTML response, so some requests from the client never hit any server-side application code. If the A/B testing framework assigns variants on the server or relies on any server-side logic, then a person enrolled in variant A may be served a cached response of variant B (and vice versa). This is bad. The experiment data becomes corrupt, and any potential insights are useless. If the A/B test is entirely separate from any backend logic code, there may not be any problem at all!

What is the Solution?

Utilizing Edge-Side Includes (ESI) with our Varnish layer!

ESI is a small markup language that allows for the dynamic web content assembly. It provides an edge server (like our Varnish cache) the ability to mix and match content (or fragments) from multiple cached URLs into a single response.

Let’s look at a simple example with a global header we want included via ESI on multiple pages:

//HTML file with ESI Include
        <esi:include src=”/my_global_header.html” />
         <div>Lots of other content</div>

What is happening here?

The Varnish server understands how to parse the <esi:include and will see if it has the path dictated in src value cached.

On a hit (the asked for item is in the cache): It inserts that cached fragment into the response our system returns to the client. The server did not have to do any additional work to create our global header again; rather, Varnish simply inserted the cached global header directly into the response.

On a miss (the asked for item is not in the cache): The cache checks back with the server and asks for content represented by the provided path. It then inserts that response into the cache using the src value as the key. Varnish then inserts the fragment into the response, and pass it along to the client.

Why not Varnish the whole page?

This way we can re-use the global header component on any number of templates, including those that may contain user-specific information which we should not serve via Varnish. It allows us to be surgical with what content we determine we want to cache, and that which we do not.

Applying ESI to our use case

We can utilize ESI to include an entire view, rather than just a fragment of a view, in such a way that we don’t impact performance negatively. Let’s run through an example.

Say we have a complicated homepage at Our server resolves incoming requests for /home to our view handler HomePageView which returns an HTML response. HomePageView does massive amounts of logic and heavy lifting to provide a great experience to our users. It receives heavy traffic, regularly, so we naturally serve it with Varnish to avoid such heavy lifting for every request.

However, our team has been asked to run an experiment on the homepage which would display a picture of a cool cat to users with an odd-numbered guest_id. Here guest_id is a semi-permanent identifier stored in a cookie for a logged out user.

We then can do the following:

  1. Remove any “standard” Varnish configuration that may have been implemented on the homepage to ensure that every single request hits the server. When a request comes from the client for, every single one should resolve to the HomePageView.

  2. Move all of the heavy logic that HomePageView was previously doing to a new view titled HomePageViewESI. We’ll come back to this in step 5.

  3. Now instead of the normal heavy logic our HomePageView previously did, we only parse the guest_id from the request. For purposes of the example, let’s say the guest_id is odd. The view then creates an ESI specific path that represents a homepage covered in cats:
    esi_path = my_esi/home/?my_experiment_variant=show_cats

    Aside: The esi_path here, acts as our unique cache key.

  4. Then the response which HomePageView returns from our application server is just the following:
    <esi:include src=”my_esi/home/?my_experiment_variant=show_cats” />

    That’s it. We don’t include anything else on the server response. Our varnish server understands how to parse the <esi:include, and if it is a hit, inserts the cached cat covered homepage specified by the provided esi_path. No application logic was necessary beyond parsing the guest_id to serve the correct content to the end user.

  5. However, what if the esi_path is a miss? Varnish will look back to our server, and request the content represented by the provided esi_path. Which looks like:

    Meaning that the server needs to resolve incoming requests for /my_esi/home/ in addition to /home.

    This is where we use HomePageViewESI. We configure the server to resolve incoming requests for /my_esi/home/ with HomePageViewESI.

    HomePageViewESI understands how to parse experiment variants encoded into the path, does the heavy lifting, and returns a full, complex, HTML response.

    Varnish consumes this rich HTML content, insert the returned content into the <esi:include tag HomePageView returned initially as a fragment, and store it in the cache under the key:


    This process guarantees that even cache hits serve the expected variant to a given user. The variant is encoded into the esi_path guaranteeing a unique cache key for each version of the content to be served.


This approach allows for the a/b testing of heavily trafficked, yet performant pages. Listed below are some “gotchas” to avoid!

Keep any logic done before returning the initial <esi:include very light.

This logic runs for every request. To hold onto the benefits that our cache provides us, be sure not to bloat this with extraneous logic.

The URL path in the browser does not match the path of the request itself.

On a cache miss, the server now receives a url prefixed with some ESI specific identifier, in our example, my_esi was used. This means it doesn’t match the URL represented by the browser.

For example, the browser’s URL may read:

<a href=""></a>

However, the URL path that the server is receiving is:

<a href=""></a>

This can quickly cause downstream issues. Many error loggers and other forms of reporting rely on the request path server side, but that will no longer be an accurate representation of the request put forward by the user. Instead, it will be the constructed ESI URL. Additionally, if the frontend stack relies on the request path or query params, it will no longer be in sync with what is in the browser for these same reasons.

Solutions? There are many. The core of each comes down to two things:

  1. Communication
  2. Abstraction

Which seem pretty counter to each other, huh?

The communication is inward.

It is easy for issues to arise when implementing complex caching solutions, so it is necessary to utilize verbose logging on any page that has ESI implemented for the response. Doing so allows for better ability to track down bugs that could otherwise be incredibly cryptic to decipher.

Always be sure to include the full path, with query params, in the backend logs for pages served via ESI. The query params provide necessary information as to exactly what response we served to the client.

The abstraction is outward.

It should never become apparent to the user that the request path is something different than what the browser represents as that would negatively impact their trust in the application.

How do we solve for this? If possible, remove any inclusion of the request path to your client, and instead rely on window.location. However, if your application is tied tightly to the request query params and path hydration, another option is to abstract your request on the server in an ESI aware way such that the critical elements needed represent the original request and not the path.

On a cache miss: Do not enroll a user when building the full view.

Often it is necessary to enroll users based on a specific set of conditions, those conditions, however, must be met outside of the ESI layer. Attempting to enroll users from within the built view of an ESI layer causes your data to quickly become unreliable, as there is no guarantee that the server will be hit for anything encapsulated within that view.

The solution is to perform any user enrollments on the outer-most layer which we call on every request before returning the <es:include src={} /> response and encode the value into the path provided to src as that is the only way to ensure that the data is correct.

All in all, implementing an ESI layer to solve for A/B testing Varnish Cached pages can be difficult and cause confusion; however, it often is the only way to test critical flows in a given application.

Have you ever had issues A/B testing with a cache? Let us know below! You can also ping me on Twitter @VincentBudrovic.

Photo by Christopher Burns on Unsplash

The Fundamental Problem of Search

There is a fundamental problem with search relevance. Users are unaware of their own internal processing as they search, modern search interfaces glean only sparse information from users, and ultimately it is impossible to definitively know what a user really needs. Nevertheless, search applications are expected to act upon this sparse information and provide users with results that best match their intent.

In this post, you will learn about these challenges through several examples. By understanding the blind side of search you can accommodate these challenges and provide your users with a better search experience.

The search problem space

A couple years back I co-authored “Relevant Search,” where I described the mechanics of information retrieval and how to build search applications that match users with the information they seek. But even as I wrote the book something at the back of my mind was weighing me down. Only now has the problem taken shape so that I can describe it. I call it the fundamental problem of search. Consider the following:

  • Modern search interfaces are minimalistic, and users don’t have much opportunity to tell you what they want – usually just a text box.
  • Users have lots of criteria in mind when making a decision. If they want to find an event, then they are considering the type of event,  location, time, date, overall quality, and probably many other things as well.
  • Different users have different criteria in mind when making decisions and different weighting of said criteria.
  • Users often don’t know they have all of these criteria in mind. They believe you can “simply” find a set of matching documents and return them in some “simple” specified order.
  • Users believe that deciding whether or not documents match their search criteria is a binary decision, and users believe that the order of the results can be exact. In truth, both the match and the ordering are naturally “fuzzy.”
  • Despite uncertain user intent and the fuzzy nature of matching and ordering, the relevance engineer has to make both matching and ordering concrete and absolute.

The fundamental problem of search, then, is the fact that relevance engineers are required to perform a fool’s errand. With missing information, with ambiguous information, and with high user expectations, we have to coerce a search engine to somehow return documents that match the user’s intent. Further, the documents are expected to be ordered by hopelessly ill-defined notions of quality and relevance.

In the sections below I’ll delve into several examples of the fundamental problem of search as well as some ideas for rising above the problem. Here’s a hint though: it won’t be easy.

Matching and sorting by relevance

In the simplest possible scenario, the user enters text into a search box, and we match documents and sort the results based solely on relevance. That is, we find all the documents that contain at least one of the user’s keywords, and we score the documents based upon Term Frequency Inverse Doc Frequency scoring (TF*IDF). This process seems pretty straightforward, right? But fuzziness is already starting to creep in.

First, let’s consider how to determine which set of documents match. If your user searches for “green mile” then we as humans recognize that the user is probably talking about the movie called The Green Mile. The search engine is just going to return all documents that match the term green or the term mile. However, documents that match only one of these terms probably isn’t going to be very relevant. One option is to require both terms to match. But this strategy is ill-advised because there are plenty of times where an either/or match might be relevant. If the user searches for “comedy romance” then they might prefer a comedy romance, but toward the end of the list, a comedy or romance film might be just fine.

In principle, another option would be to return every document with a score above some cutoff value X. In practice, this isn’t possible because TF*IDF scoring is not absolute; you don’t score 0 for perfectly irrelevant documents and 1 for perfectly relevant documents. Consider two queries, one for “it” (as in the movie It) and another query for “shawshank” (as in The Shawshank Redemption). The term “it” is very common and so the best matching document – the movie It – will likely get a relatively low TF*IDF score. However, in the case of “shawshank,” let’s say that we don’t actually have a document for The Shawshank Redemption, and the only document that matches is because of a footnote in the description stating “from the director of The Shawshank Redemption.” Though this is a poor match, the score will be quite high because the word “shawshank” is so rare. In this example, we have a low scoring document that is a great match and a high scoring document that is a terrible match. It is just not possible to delineate between matching and non-matching documents based upon score alone.

We see that even in the most basic text search scenario we already begin running into situations where we can’t know the user’s intent and where the hope of perfect matching and perfect sorting break down. But wait, it gets worse!

Matching and sorting by relevance and quality

“I want to find a documentary about Abraham Lincoln.” Seems simple enough. So, we retrieve all documents that match either abraham or lincoln, and we sort by the default scoring algorithm so that documents that match both terms appear at the top. However, there’s a problem: The user told you they want Abraham Lincoln documents but implicit in the request is that they really want just the high-quality results.

If you’re used to databases and if you’ve just started working with search, then the answer seems obvious – just sort by quality (or popularity or whatever field that serves as a proxy for quality). If you do this you’ll immediately find yourself with a new issue: when sorted by quality, the top results will contain documents that are high quality, but only match one of the terms and aren’t very relevant at all. If you had a documentary on the life and times of the Biblical Abraham for example and if it was a really high-quality documentary, then it would jump up above the documents that are actually about Lincoln.

So again, for someone new to search, the next “answer” is clear: just turn minimum_should_match parameter to 100% to ensure that we only return documents if they have all the terms that the user queries. But this doesn’t really fix the problem. Consider a high-quality documentary about Ulysses S. Grant which merely mentions Abraham Lincoln – a high-quality result, but nevertheless irrelevant to the user. What’s more, minimum_should_match=100% can get you in trouble when the user searches by dumping lots of words in the search box and hoping that some of them match. For example “civil war abraham lincoln” – a documentary entitled “President Lincoln and the Civil War” would be entirely relevant yet would not be a match!

The best thing to do here is to boost by quality rather than use absolute sorting. By default, the score of the document is based solely upon how well the text matches according to TF*IDF. You can incorporate quality into the score by adding in some multiplier times the quality: total_score = text_score + k * quality. With this approach, you can in principle adjust k so that the total score is the appropriate balance between text score sorting (k = 0) and absolute quality sorting (k = inf).

Though this approach of linearly adding in quality is a very commonly used approach and is often effective, it can come with some nasty problems of its own. Ideally, you would be able to find some k that works best in all cases. In practice, you can not. Refer back to the example of the “it” search and the “shawshank” search. In the “it” search, the best matching document will have a much lower text score than a typical query. And in the “shawshank” query, even average matching documents will have potentially high scores. In both of these cases if we calculate total_score as text_score + k * quality, then in the “it” search quality component have a much greater effect on sorting than it will for the “shawshank” query. It would be nice if somehow we could automatically scale k so that it was proportional to the general tests scores for a given search. More on this in a future post!

Sidebar: multiple objective optimization

A big part of the underlying theme here is that search is a multiple-objective optimization problem. That is, we are trying to optimize the scoring function so that multiple objectives are optimized simultaneously. However we do not know – and we can not know – how important the objectives are relative to one another.

The issue is perhaps most evident in applications like Yelp where the different objectives are called out in the application: You’re looking for a restaurant – how would you like to organize the results? Distance? Price? The number of stars? If you’ve selected a food category or typed in a search, then how important should that be? From Yelp’s perspective, the answer cannot entirely be known. The best we can do is to find some balance between the various dimensions that empirically tends to maximize conversion rates. In modern implementations of search, Learning-to-Rank is a machine learning approach that does precisely this.

Matching and sorting by relevance, quality, and date

Things get even worse when we involve precise quantities like date or price. Often users want to sort their results by date, and this sounds like a perfectly reasonable thing to do. However, you will encounter some terrible side effects when exact sorting by things like date. Here’s why: the Pareto principle drives the world, and your inventory is no different. If you are in e-commerce, then 20% of your inventory is where you get 80% of your sales, and 80% of your inventory is where you get 20% of your sales.

Let’s say our users are searching for “beer events,” but they want to sort the results by date. Rather than showing the most relevant events such as beer festivals, brewing classes, and beer tastings, we’re going to show them irrelevant, date-ordered events such as business dinners or speed dating events that merely mention beer in their descriptions. Effectively, we are scooping way down into the 80% of less desirable events simply because they happen sooner than the more relevant events that we should be returning.

Solving this is quite a challenge. Consider some alternatives:

  • Boost by date: As presented in the last section you can boost by date and make sure that the best documents are right at the top of the search results kinda sorted by date. But when users choose to sort by a precise quantity like the date, they will see any deviation from date order as evidence that search is broken and not to be trusted.
  • Re-sort the most relevant documents by date: You can use the Elasticsearch rescore feature to find a set of the N most relevant documents and then re-sort them by date. But how do you find a good value for N? If N is too low, then users may page past all N results and you’ll have to either tell them there are no more results OR you’ll have to “show omitted results” and start over by date. On the other hand, if N is too high, then the returned set will dip past the most relevant document and pull up some of the 80% of less desirable results. Sorting by this group means that some of these irrelevant or low-quality documents end up at the top of the search results.
  • Sort by date then sort by relevance: If you think this is a good idea, then you haven’t put your thinking cap on yet today. Nevertheless, I hear this tossed around as an option quite a bit. The problem is that if date includes a timestamp, then it is a continuous value. If your documents have timestamps with granularity down to the second then sorting by date followed by quality is no different than just sorting by date.
  • Bucket by date and sort by relevance within each bucket: As a variant on the previous idea, you do have the option of discretizing the date and chunking documents into buckets of day or week and within each bucket sort by static quality. This might be a great solution. If the user doesn’t expect exact date/time, then they will be more forgiving when the documents don’t appear in exact date order down to the second within the buckets. However, there are still problems – within each bucket, there are fewer documents to draw from. Nevertheless, the search engine will faithfully provide the best documents it has for each bucket. This means that as your bucket size gets smaller, the chances of the bucket getting filled with irrelevant documents become higher. It would be better if we don’t return the bucket at all, but per our Matching and Sorting by Relevance section, scoring is not absolute, so it might still be difficult to decide which buckets we should omit from the search results.

No hard and fast solutions

As you can see in the sections above, I’m doing an excellent job of outlining a huge problem, but I’m not providing any easy solutions. That’s because there aren’t any!

By its very nature, search and recommendation is and forever will be filled with nasty corner cases. Human language is dirty and imprecise, and your users’ information needs will be uncertain and highly varied. However, don’t lose hope. Despite the many corner cases, search technology is still an excellent tool for helping users to satisfy their information needs in the vast majority of use cases.

What’s more, search is getting better. Learning to Rank is a machine learning technique for scoring documents that can automatically find the best balance between features like text relevance, static quality, and innumerable other things. Similarly, there has been lots of conversation in the search community about embedding vectors into search so that traditional inverted-index search can be used in conjunction with recent developments with machine learning (check it out!).

Finally, I would expect the user experience to continue to develop and improve. The dominant search experience for the past 15 years has been a text box at the top of the screen. However, we see this giving way to more conversational search experiences like when you ask Siri to look up a phone number or when you ask Alexa to play a particular song. The visual experiences are changing too. Take a look at Google’s image search. There the left-nav faceted search has been replaced with a very intuitive tag-based slice-and-dice experience that allows you to quickly narrow down the small set of results that fit your information needs. I expect we will continue to get better and better experiences.

You can learn more about building a relevant search by checking out my previous post on understanding the ideas of precision and recall as they relate to search.

Have you run into the fundamental problem of search in your own search applications? What have you done to overcome it? I’d like to hear from you! Ping me on Twitter @JnBrymn or add a response at the bottom of the page. If you’d like, we can jump onto a hangout and share some war stories.

Photo by Andrew Neel on Unsplash

How to Make Swift Product Changes Using a Design System

Redesigning an entire site is a daunting challenge for a frontend team. Developers approach extensive visual changes with caution as they can be challenging. You might have to go through hundreds of stylesheets updating everything from hex values to custom spacing. Did you use the same name for colors on all your files? No typos? Do your colors have accessible contrasts? What a nightmare!

At Eventbrite, our design system helps our developers make those sweeping changes all while saving time and money. Keep reading to see how a design system can help your team with consistency, accessibility, and lightning-fast redesigns.

The Key to Consistency

A design system is a library of components that developers across teams can use as building blocks for their projects. A shared library allows everyone to use components, or reusable chunks of styling and code, that look and work the same way. You don’t want ten similar but different copies of the same thing, do you? Take custom file uploader components, for example. If each team builds their custom version of the component, not only does it create a confusing user experience, but it also means that developers across teams have to maintain and test all of them. No, thank you!

As part of the Frontend Platform team here at Eventbrite, my team and I maintain the Eventbrite Design System (EDS). Because we wrote EDS in React, some of our apps use EDS while legacy apps that use other JS frameworks do not. As we move more of our products move over to React, adoption of our design system is increasing. Our user experiences across all of our platforms look and feel more cohesive than ever before. Every EDS file uploader looks and behaves the same way (with minor variations).

Accessibility for All

When everyone uses the same component, you can build accessibility features in one place, and others can inherit it for free. Furthermore, you or a dedicated team can now thoroughly test each component to ensure they work for users of all abilities and needs. The result? People that navigate your site using screen readers or keystrokes can now use your product!

We love taking advantage of this benefit here at Eventbrite. We ensure the colors in our design system components have the right contrast ratios, which means that all Eventbrite pages are usable by people with colorblindness. Our color documentation page uses CromaJS to help calculate the rations for our text and color combinations. We also use WCAG AA as our contrast standard.

A sample of one of our colors on the Eventbrite Design System colors documentation page. It includes the color name, hex, RGB, and Luma values along with the WCAG score.

We also strive for our components and our pages to work well with keyboards and screen readers. EDS has a Keyboard higher-order component (HOC) where we use react-hotkeys to help us set up our React pages for optimal keyboard accessibility. Eventbrite works towards having all our components be accessible to all. Thanks to our design system, when Frontend Platform doubles down on accessibility, all teams that use EDS inherit the accessibility improvements by keeping up with our latest version.

Quick Turn-Arounds and Fast Redesign

Now, back to the redesign scenario. If you’ve defined all your colors and variables in one place, your team no longer has to hunt down definitions for each component. One developer can change a hex value (say, from #DB5A2C to #F05537), and every app that uses your design system inherits all changes right away.

In spite of all our planning and prep work, every once in a while our team needs to set a tight deadline. In our latest redesign, we made sweeping typography and color changes. While it seemed like a massive task, EDS enabled us to make many of these changes very quickly. We spent most of our time and energy making these changes to our products that don’t yet use EDS and thus require specific updates and quality assurance.  Check out the results of the transformation below!

Search Results Page Before the Rebrand

Eventbrite Search Results Page Before Redesign
Search Results Page After the Rebrand

Eventbrite Search Results Page After Redesign

Home Page Before Rebrand

Eventbrite Home Page Before Redesign

Home Page After the Rebrand

Eventbrite Home Page After Redesign

While adopting, implementing, and maintaining a new design system took serious work, the benefits have been well worth it. A design system might save your team a lot of time and work, too. However, they are not a magic bullet, and it takes time to get it right. Don’t despair if it doesn’t look as fleshed out as some of the more popular and well-staffed design systems, like Google’s Material UI or Airbnb’s Design Language System. Start saving time and money by having a shared library to increase consistency, increase the accessibility of your product, and make broad changes safe. Create a design system as unique as your product and start reaping the benefits.

What about you? Is your team using a design system? Is it a custom built one? Drop us some lines in the comments below or ping me directly on Twitter @mbeguiluz.

BriteBytes: Diego “Kartones” Muñoz

An Eventbrite original series, BriteBytes features interviews with Eventbrite’s growing global engineering team, shining a light on the individuals whose jobs are to build the technology that powers live experience.

One of my favorite things about Eventbrite is getting to work with engineers from all over the world. In September, I had the pleasure of sitting down with Diego “Kartones” Muñoz, a Principal Engineer visiting Eventbrite’s headquarters HQ in San Francisco from our Spain office. He joined Eventbrite through our Ticketea acquisition in May and works out of Madrid with the Ticketing and Registration Business Unit (TRBU) Mapache team. In this interview, he tells us about his path, what it’s like onboarding onto a larger company, and things he likes most working at Eventbrite.

Tamara Chu: How did you come to work for Ticketea/Eventbrite? What was your path as a software engineer?

Diego “Kartones” Muñoz: I started early in development and computers, so before entering university I already knew a bit and wasn’t sure if I wanted to study it or not. I started studying, then I quit after a few years because I thought it was boring [laughs]. I started working, and I felt I was learning way more by working. Since then I’ve switched a lot: I started consulting with .NET, then switched to PHP and more open-source stacks, then I switched to Ruby, and since 2015, Python, which I’m in love with.

In 2009, I switched from consulting for other companies to product development, and since then I have been in multiple different areas: social networks, web gaming portals, mapping tools, video generation tools, and now ticketing.

T: How long had you been at Ticketea before Eventbrite?

D: I joined March 2017, so one year. In total it’s now been one year and a half between Ticketea and Eventbrite.

T: And did you like the culture of Ticketea compared to the other companies you’ve worked at?

D: Yes, that was probably the deciding factor. A friendlier company, not willing to jump on the startup unicorn hype but preferring to focus on a single product; not so worried about growing a lot, but keeping the product stable when adding new features. Also, while Ticketea had investing, it was a small amount, and it was profitable, so it was nice that we weren’t in such a hurry to always be generating lots of new users or lots of new revenue, just growing steady but at a slower pace than other startups.

It’s not that that’s bad in itself, but other places I’ve been were just growing, growing, growing, and they didn’t care about quality as much.

T: Mm, like growth for growth’s sake, no matter what happens to the team or what kind of culture you’re building.

D: Yes, exactly, or when things are failing often because the platform is not stable enough.

T: Has the transition to Eventbrite felt natural? Or what was that shift like?

D: I think for us it has been quite natural, also because our stack at Ticketea was more or less similar; we already used most of the tech stack. [The shift] has been learning a new platform, adjusting to mostly everything in English, and the time difference.

T: Yeah, [the time difference] is a big one the teams are still figuring out. Was there anything about Eventbrite that surprised you when you joined?

D: The size and the scale of some things, like the size of some big events that [Eventbrite] has might be more than the total of what Ticketea sells in one year. And some parts of the technology, you can actually look at it and see that it has years of experience put into there, and [years of] thought evolving those parts. That’s something I appreciate a lot, spending time improving and making things better.

T: Was there something that excited you, like “oh cool, this is something new that I can look into?” Something specific?

D: Yes, for example, the way the APIs work — the internals of how to build and expand them and how they communicate between themselves — it was a problem that I’ve seen in the past but never solved as cleanly as here. I’m not an expert on API development, but here I think we have a good and elegant solution.

T: How were you doing it at Ticketea versus here?

D: For example regarding API design, ours were less advanced, more built in a classical way of “load data, fetch all related entities and return everything.” It was more manual work, without the EB API magic. We also didn’t have the scale as Eventbrite, so usually performance wasn’t a problem; things would go slower, but it would still work. At Ticketea also we were just two technical teams, so also there’s been a big jump to now being part of a company with hundreds of engineers.

T: Was there anything from Ticketea that you wish had come over to Eventbrite?

D: The automated deployment, the quicker release cycles. As we didn’t have Ops, we were all tiny part DevOps, mostly developers. We handled our own infrastructure. That’s also why we were switching from AWS to GCP [Google Cloud Platform] because it removes an additional layer of complexity. So we can self-deploy without systems or release engineers. We had automatic deploys, canary releases, simple traffic splitting, automatic with a slider with one button. Those things, here with so many people and so many services, it’s not as quick.

T: What has been your favorite thing about working at Eventbrite?

D: Probably being able to work on such a big project. Because we’re thinking, you build something, it’s not something that three or four people are going to use, but it’s a thing that millions of people are going to use. But still, I don’t know what else, because it has just been a few months [laughs].

T: [laughs] I’ll ask you again in another 6 months.

D: Yeah, let’s do that!

T: How about your least favorite thing?

D: Adapting, maybe, to the way of releasing things. We have lots of services with complex interactions, so you have to be careful and take additional steps to deploy services. Every change takes extra effort to update and release, etcetera, which I wasn’t used to due to our smaller scale and mostly automated platform.

T: Do you see opportunities to change that?

D: I think yes. I don’t know what the future is for our team, but yes, of course, I feel there are opportunities to improve the way things are done. There’s PySOA (Eventbrite’s Python library for writing microservices and their clients), there are tools in place to migrate services, and probably going to be more alignment between product and tech — is this important, or are there more pressing issues, or can we take advantage of doing something with the service to also separate it?

T: What are you most excited about?

D: All the things that I can learn from the platform. I am just grasping the tip of the iceberg, how everything works: the backend parts, learning React, how the tools we use work (internally), DevOps, the infrastructure that we have, the general learning opportunity of the architecture, and the platform.

Diego has been an active part of Spain’s tech scene for many years, and it’s fantastic having him on the team. Learn more about him at A big thank you to Diego for sharing his background and experience. We’re looking forward to hearing more from him and the rest of the team in the future, so stay tuned for more BriteBytes!