Data Scientist Happy Hour: Machine Learning in Apache Spark and Apache Hadoop
Thursday, August 14, 2014 from 6:00 PM to 8:00 PM (PDT)
Palo Alto, CA
San Francisco, California
London, United Kingdom
Please join us in Cloudera's Palo Alto offices on the evening of Aug. 14 for a couple hours of pizza, beer, and machine learning!
6:00 - 6:30pm - Networking with pizza and beer
6:30 - 7:00pm - An Introduction to Apache Spark through Clustering for Anomaly Detection (Sandy Ryza, Cloudera)
Cloudera's open source platform, CDH, recently included Apache Spark, a general purpose distributed execution engine with a set properties that make it ideal for advanced analytics. The talk will introduce the power of the Spark programming model and its machine learning library MLLib of Spark through a use in anomaly detection with K-means clustering.
Sandy Ryza is a data scientist at Cloudera. He recently led Cloudera's Apache Spark development, and is a member of the Apache Hadoop project management committee.
7:00 - 7:30 - Extending ML with Apache Spark (DB Tsai, Alpine Data Labs)
Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project:
1. Sequoia Forest
Random Forest is a popular workhorse of machine learning. However, individual decision trees that aggregate to Random Forests are difficult to train on big data because 1. decision trees are non-parametric models and thus their complexities and sizes tend to get larger with data sizes 2. in Random Forests, trees are rarely pruned or regularized. Here, we present Sequoia Forest, a Spark-based distributed implementation of Random Forest. We show that Sequoia Forest consisting of hundreds of trees with millions of nodes can be trained on billions of rows with thousands of features within a reasonable amount of time. Additionally, to justify building fully-grown trees on big data, we show that Sequoia Forest usually outperforms Random Forest trained on smaller pruned trees.
2. Multinomial Logistic Regression
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Finally, we’ll tie all of this together with an open dialogue from our Head of Engineering Chester Chen who will describe our best known methods and key learnings from development. Also, we’ll have our Head of Product Marketing, Joel Horwitz, who will provide insight into how customers are starting to see the benefits and value of Spark in the enterprise.
DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master's degree in Physics from National Taiwan University.