Address: Classrooms 4/5 at Building C at Yahoo Sunnyvale campus
Start time: 6:30pm (food and drinks thanks to Yahoo), talks start at 7pm
This meetup will cover multiple topics related to Spark MLlib:
- Spark MLlib Overview
- Scalable Distributed Decision trees in MLlib
- Model search at scale
- Scalable machine learning at Yahoo
Spark MLlib Overview: MLlib is Apache Spark’s scalable machine learning library. This talk will provide a brief overview of MLlib’s functionality, its interplay with other Spark components, and its future directions.
Scalable Distributed Decision trees in MLlib: Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Their popularity stems in part from their interpretability, strong empirical accuracy, and their ability to capture non-linearities and feature interactions. In this talk, we will describe the implementation of tree algorithms in MLlib that is able to handle massive datasets. We will also demonstrate how the decision tree implementation can be used as a building block for ensemble methods like boosting and random forests, both which will soon be added to MLlib.
Model search at scale: MLlib is a terrific library for fitting large scale machine learning models. However, translating a vague problem statement like “learn a classifier” into a working model presently requires significant manual effort (via ad hoc parameter tuning) and computational resources (to fit several models). We present our work on the MLbase optimizer – a system designed on top of Spark to quickly and automatically search through a hyperparameter space and find a good model. By leveraging performance enhancements, better search algorithms, and statistical heuristics, our system offers an order of magnitude speedup over standard methods.
Scalable Machine Learning at Yahoo: Yahoo’s emerging business demands machine learning that is massively scalable (millions of examples and features), low latency (model training in seconds/minutes) and supports a variety of algorithms. This talk highlights our latest efforts on scalable machine learning through two use cases: Flickr auto-tagging and Search Ranking. We leverage big-data technologies (Hadoop, Spark and Storm), develop novel solutions for robust and scalable learning, and collaborate with the open-source community.
Manish Amde is a software engineer at Origami Logic developing machine learning and information retrieval algorithms for their marketing intelligence platform. Prior to Origami Logic, he worked at two other startups focusing on large-scale signal processing problems. His past research spans multiple fields in electrical and computer engineering and has led to several papers, patents and a book chapter. He holds a bachelor’s degree from IIT Bombay and received his doctorate degree from UC San Diego.
Hirakendu Das is currently a research scientist at Yahoo Labs. His research work is centered around applications and development of scalable machine learning algorithms for advertising systems and web data analytics at large. Prior to joining Yahoo, he received a Ph.D. degree from University of California, San Diego and his B.Tech. degree from Indian Institute of Technology Madras.
Andy Feng is a Distinguished Architect at Yahoo leading the architecture and design of nextgen Big Data platforms as well as machine learning initiatives. He is a PPMC member and committer of the Apache Storm project and a contributor to the Apache Spark project. He served as a track chair and program committee member at the Hadoop Summit and Spark Summit in both 2013 and 2014.
Evan Sparks is a PhD Student in the Computer Science Division in the UC Berkeley AMPLab. His research focuses on the design and implementation of distributed systems for large scale data analysis and machine learning. Prior to Berkeley he spent several years in industry tackling large scale data problems as a Quantitative Financial Analyst at MDT Advisers and as a Product Engineer at Recorded Future. He holds a bachelor’s degree from Dartmouth College.
Ameet Talwalkar is a postdoctoral fellow in the UC Berkeley AMPLab and a consultant at Databricks. His research addresses scalability and ease-of-use issues in the field of statistical machine learning, with applications related to large-scale genomic sequencing. He started the MLlib project in Apache Spark and is also a co-author of the graduate-level textbook entitled “Foundations of Machine Learning” (2012, MIT Press). Next year he will join UCLA’s Computer Science Department as an Assistant Professor.