Main talk (50-mins):
Stream processing with R in AWS
by Gergely Daróczi
Abstract: R is rarely mentioned among the big data tools, although it's fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards. Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at CARD.com, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs.
Bio: Gergely Daróczi is an enthusiast R user and package developer, founder of an R-based web application at rapporter.net, Ph.D. candidate in Sociology, Director of Analytics at CARD.com with a strong interest in designing a scalable data platform built on the top of R, AWS and a dozen APIs. He maintains some CRAN packages mainly dealing with reporting and API integrations, co-authored a number of journal articles in social and medical sciences, and recently published the "Mastering Data Analysis with R" book at Packt.
Lightning talk (10-mins):
Opening the black box: Attempts to understand the results of machine learning models
by Michael Tiernay
Abstract: Sophisticated machine learning models (like GBMs and Neural Networks) produce better predictions than simpler models (like linear or logistic regression), but sophisticated models do not produce interpretable 'effects' that specify the relationship between predictors and and outcome. This is because sophisticated models can learn non-linear, interactive, or even higher level relationships between the predictors and outcome without being explicitly specified.
In many settings it is often important to understand, as best as possible, how 'black box' models are producing because:
1. If users do not understand how a prediction is being made, they may not trust the model/prediction enough to act upon the model's suggestions
2. Significant business value can be derived from understanding what drives an outcome of interest (e.g. purchase or churn) in order to make product changes to accentuate or minimize desired effects
3. Understanding how predictors relate to an outcome can inform subsequent feature generation that can improve a model's predictive power
This talk will discuss two methods that have been proposed to better understand machine learning models: simulating changes in input variables (the R ICEbox package) or building a simpler model locally around specific predictions (the Python LIME package).
Bio: Mike is a data scientist on the R&D team at Edmunds.com (currently hiring!). In a previous life, Mike earned a Ph.D. from NYU in Political Science with a focus on examining civil conflict throughout the world from an econometric perspective.
– 6:30pm arrival, food/drinks and networking
– 7:30pm talks
– 9 pm more networking
You must have a confirmed RSVP and please arrive by 7:25pm the latest. Please RSVP here on Eventbrite.
Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds' NEW LOCATION, don't go to the old one)
Park underneath the building (Colorado Center), Edmunds will validate.