I had the pleasure of participating in a Big Data Cost Optimization event in San Francisco (October 17) at the AWS Loft location. I was a panelist in a round-table discussion focusing on the use of AWS spot instances. It was a terrific opportunity to share how Eventbrite’s Data Engineering team is smartly using spot instances for Big Data & EMR to drive down costs. Thanks to Jasper Groot from the Data Engineering team for helping to co-author this blog post!
At Eventbrite, we’re using spot instances to leverage AWS auto scaling and I’ve published a blog entry on this topic (Big Data workloads with Presto Auto Scaling). Two key focus areas for the data engineering team are ephemeral computing (something that lasts for a very short time) and idempotence (repeatable behaviour with the same outcome). Making a commitment in these areas has allowed us to leverage spot instances to effectively manage costs.
Spot instances are a cost-effective choice if you can be flexible about when your applications run and if your applications can be interrupted. They are well-suited for data analysis, batch jobs, background processing, and non mission-critical tasks. Below are some of the panel questions and my answers, and I’d love to share what I learned based on the conversation.
How does Eventbrite use EMR and spot instances today? Could you give a quick sense of business use case and your overall big data environment?
At Eventbrite, we use spot instances primarily for reporting and data warehouse tasks:
- AWS Auto Scaling for our Presto/Tableau spot instances
- Scaling Policies
- Scale up: CPU Utilization >= 50 for 60 seconds for the metric dimensions
- Scale down: running Queries <= 0 for 2 consecutive periods of 300 seconds for the metric
- Scaling Policies
- Extending our compute layer with EMR to support data ingestion/ETL and other non mission-critical services
- Databricks/Delta Lake
Quick note on Eventbrite’s big data environment:
- Infrastructure: AWS
- Cloud compute: AWS EMR, Apache Spark, Databricks
- Cloud storage: Hive backed by S3 and HDFS
- Visualization: Tableau/Superset
- Databases & Data warehouses: MySQL, Presto, Redis, Elasticsearch
More details can be found in a previous blog post: Looking under the hood of the Eventbrite data pipeline!
What has been your best experience in delivering business value/ cost optimization with spot for EMR?
We leverage spot instances to save ~65% on the on-demand prices. We’re generally not requesting the latest and greatest instances (m4.2xlarge vs m5.2xlarge) for spot because we get better availability. One area where spot instances are key is for auto-scaling our Presto/Tableau where instances can go off-line and queries are cancelled, but we’re willing to tolerate because it happens very infrequently.
Why did you pick EMR and spot, against any alternatives? Any specific use cases/ scenarios, where it’s best suited for?
We picked AWS EMR because it fits in with current expertise at Eventbrite since we have a large footprint on AWS. EMR is a flexible compute layer for running applications such as Hive, Spark, Hadoop, and Pig. We’re also using Apache Spark stand-alone for compute but that is limited to *just* spark.
The Data Engineering team is responsible for managing their own AWS infrastructure, and spot instances work well for the many data analytics type jobs that we run.
Note: Other cloud vendors provide similar spot Instance options:
- Low-Priority VM instances in Azure
- Preemptible VM Instances in Google Cloud
What was your biggest challenge during the implementation?
We moved our Presto/Tableau cluster to m5.2xlarge instances (from m4.2xlarge) and we experienced a pattern of losing spot instances due to lack of capacity. We learned from this experience and started requesting older instances to allow for more capacity (i.e. more instances in the pool).
It’s worth noting that on-demand instances take priority over spot instances (overstock – hardware not being used). We experience situations where spot instances go-offline from time to time. The error message is “Instance terminated capacity oversubscribed” and queries running on that instance are canceled/terminated.
Note: Interestingly enough, we hit capacity oversubscription (m5-2xlarge instances) the week prior to the AWS panel on the Presto-Tableau cluster. Here’s an example of what the message looks like:
Did you have to convince others in the company with a business case justification?
Transparent communication with the business is key. We get spot instances at a fraction of the cost of on-demand instances, but the risk is oversubscription.
We’ve had to educate our end users (Data Analysts, Product Analysts, and Data Scientists). The cost savings for spot instances is real but it does require buy-in from the business.
Any spot best practices that you have applied and found especially valuable (e.g., diversification, instance flexibility, using instance fleet)
We have several best practices that we follow at Eventbrite with regard to spot instances:
- Our standard practice is to request older instances – (m4.2xlarge vs m5.2xlarge). We’re not looking for the latest and greatest instances. This gives us the best bang for the buck!
- We’re constantly monitoring the demand for instance types per availability zone. This allows us to choose the correct availability zone for spot instances at the best price.
- We find the availability zones that have the most excess hardware. We look at pricing trends per availability zone.
- Some availability zones have higher prices (based on demand or the data center doesn’t have much hardware).
What advice would you give someone in the audience starting on the spot adoption journey for EMR/ big data?
Be ready for things to break with AWS spot instances.
- You need to prepare for interruptions.
- Think properly what your workloads will be and which workloads are more resilient. The more resilient ones are strong candidates to run on spot instances.
What do you see in the future for big data in the cloud from a cost optimization perspective? E.g., serverless/ other technologies?
A few big data trends that offer huge cost savings are:
- Portable, reproducible, and cost-effective
- Apache Mesos, Google Kubernetes, and Amazon Elastic Container Registry