Here at Eventbrite we’re focused on providing the best event organizing experience. We believe it should be super simple for organizers to create events and sell tickets faster, and that our users should be able to find the most relevant tickets quickly. But in order to satisfy this basic goal we must solve many challenges behind the scenes.
Perhaps you saw IBM’s Watson computer on Jeopardy the other week. Watson used artificial intelligence to defeat two of the most successful contestants in the show’s history. And while we’re not competing on game shows, A.I. plays an important role in what we do too.
We collect a significant amount of data while publishing and selling tickets, and we store all our transactional data in MySQL, our social graphs in MongoDB, and collect lots of user and site engagement and actions in logs. We have systems that automatically create metadata about events and users in real time such as event categorization, spam, fraud etc. In order to create new services and improve existing services on top of this dataset we have to use similar kinds of data over and over again. We also have to make lots of connections between datasets, which might be living on different data stores such as MySQL, MongoDB or flat files. We hate moving data around from one system to another every time we have a need for data. We also stay away from joining large tables in order to do analytics and aim to have least impact on production systems while doing data sciences. This is why we rely on Hadoop to store all the data that we need for data sciences. (In my next post I will go in detail on our data pipeline architecture and data processing systems.)
Other than processing this large social graph we also have to calculate lots of metadata around users and events. These metadata are calculated using machine learning, specifically supervised learning to categorize events as technology, conference, art etc. This helps in mapping interests between users and organizers. We also have similar systems for anti-spam and anti-fraud detection to protect our users and keep our site safe and free of spam. Finally we utilize all this information to better serve events to our users via recommendation, search and personalized notifications. To learn more about our recommendation engine please, check out Brian Zambrano’s post.
The scale of our dataset is also growing rapidly. We believe that Eventbrite’s social graph might be the fastest growing social graph. For us, everyone you attend an event with is a part of your social graph. When you attend an event, you tell us that there is a shared interest between you, all the other attendees and the organizer of that event. So every time you attend an event with hundreds and even thousands of other attendees, your social graph increases by that many nodes.
We also use Facebook’s social graph and planning to add other graphs within your Eventbrite social graph. By our estimates, if we use a depth of 2 of this social graph to recommend events to our users, we process more than a trillion nodes. And the good news is that Eventbrite is growing rapidly, bringing new users and events faster than ever and making data processing even more challenging and fun.