
Practical Methods for Overcoming the Machine Learning Data Bottleneck (Aust...
Event Information
Description
Based on feedback from the community, we are announcing a new course for ML practitioners. The course was designed by and will be taught by Jonathan Mugan.
Practical Methods for Overcoming the Machine Learning Data Bottleneck
Course Abstract
Machine learning is powerful, but it can be hard to reap its benefits without large amounts of labeled training data. Labeling data by hand can be time consuming, expensive, and impractical; and sometimes you don’t even have sufficient examples to label, especially of the rare events that are most important. This class will provide practical methods to overcome this data bottleneck. You will learn how to use heuristics to label data automatically, and you will learn how to generate synthetic training examples of rare events using generative adversarial networks (GANs). You will also learn other data augmentation approaches and methods for training models when the training data is imbalanced. The class will also cover how to use machine learning when you only have one or a few examples.
Course Outline
A lack of high-quality training data is a common impediment to applying machine learning to practical problems. This data deficiency can be because we lack labels for classification, or it can be because we lack sufficient examples of what we care about. You will leave this class with knowledge and access to code that addresses both problems. The methods learned in this class will be applicable to text, images, and other domains, both structured and unstructured.
1. Data programming with Snorkel. We will show how to automatically generate training labels by encoding domain knowledge. In addition, we will show how to apply Snorkel to standard classification problems, which, to our knowledge, is not covered anywhere else.
2. Learning with imbalanced classes with the scikit-learn contrib package imbalanced-learn. We will show how to do machine learning even when the case you care about is drowned out in the rest of the data.
3. Creating synthetic training data using Generative Adversarial Networks (GANs). We will show how do use deep learning to generate more variations of rare examples.
4. Classifying examples into many classes where you only have a few examples of each class using Siamese networks. We will show how to learn to compare entities for similarity where the similarity space itself is learned. This method is used for automatic face recognition, but it can be applied to many domains.
5. Other topics including unsupervised learning, semi-supervised learning, and using pre-trained models.
Requirements
The only prerequisite for this course is some programming or scripting experience. We will be using Python with Jupyter Notebook.
About the Instructor
Jonathan Mugan (Linkedin) is a researcher specializing in artificial intelligence, machine learning, and natural language processing. His current research focuses in the area of deep learning for natural language generation and understanding. Dr. Mugan received his Ph.D. in Computer Science from the University of Texas at Austin. His thesis was centered in developmental robotics, which is an area of research that seeks to understand how robots can learn about the world in the same way that human children do. Dr. Mugan also held a post-doctoral position at Carnegie Mellon University, where he worked at the intersection of machine learning and human-computer interaction. One of the most requested speakers at the Data Day conferences, he recently also spoke on the topic of NLP at the O’Reilly AI conference, and is the creator of the O’Reilly video course Natural Language Text Processing with Python. Dr. Mugan is also the author of The Curiosity Cycle: Preparing Your Child for the Ongoing Technological Explosion.
Registration at:
https://overcoming-ml-bottlenecks.eventbrite.com