2025 ASA Traveling Course: Tree-Based Machine Learning Methods for Prediction and Variable Selection
Format: Virtual
Instructors:
Hemant Ishwaran, Professor, Biostatistics, University of Miami
Min Lu, Research Associate Professor, Biostatistics, University of Miami
Abstract:
Tree-based machine learning methods offer several benefits in data analysis, including non-linearity, robustness, scalability and handling mixed data types. This course emphasizes practical learning with hands-on code examples and result interpretations, which is essential for understanding and applying these techniques. Based on the widely popular R package "randomForesSRC", we will present methods for computing predicted outcomes, variable importance indices and other inference estimates. In addition, we will introduce a new model-independent variable selection method, called the rule-based variable priority, and present its implementation using the R package "varPro". For all these analyses, we will cover different types of outcomes including continuous, categorical, multivariate, survival and competing risk outcomes. Utilizing real-world datasets from medicine and public health, topics in these analyses will provide hands-on code, working examples and result interpretations. We will provide additional code for visualizing model results and constructing coefficient tables for interpretation, and address scenarios such as imbalanced classes, unsupervised problems, fast implementation on big data and protection of confidential data.
Agenda
Morning (8:00AM - 12:00PM)
- 8:00AM - 8:05AM Introduction
- 8:05AM - 9:15AM Part I – Training: a brief overview, a quick start, and training (grow) examples in regression, classification, and survival.
- 9:15AM - 10:00AM Part II – Inference and Prediction: inference (OOB), prediction error, prediction, restore, and partial plots.
- 10:00AM - 10:15AM Break
- 10:15AM - 11:10PM Part III – Variable Selection: VIMP, subsampling (confidence intervals), minimal depth, and VarPro.
- 11:10AM - 12:00PM Part IV – Advanced Examples: class-imbalanced data, competing risks, multivariate forests, and missing data imputation.