Thursday, May 15

Large Language Model Inference

Large Language Model Inference

By Research Computing Center

Date and time

Thursday, May 15 · 2 - 4pm CDT

Location

John Crerar Library - Kathleen A. Zar Room

5730 South Ellis Avenue Chicago, IL 60637

About this event

Event lasts 2 hours

In recent years, large language models (LLMs) like GPT, BERT, and Llama have dramatically advanced natural language processing (NLP), enabling tasks ranging from text generation to sophisticated semantic understanding. Central to leveraging these models effectively is the process of inference—the act of generating predictions from trained LLMs. Efficient inference is critical to harnessing the power of LLMs in real-world applications, especially given their substantial computational demands.

This workshop provides an in-depth introduction to Large Language Model Inference, beginning with fundamental concepts such as the transformer architecture, attention mechanisms, and decoding strategies, such as top-k and top-p. Participants will explore how inference operates at scale, learn methods to optimize model speed and efficiency, and examine best practices for deploying models in production environments. We will highlight popular frameworks, including Hugging Face Transformers and inference optimization tools such as model quantization techniques.

The workshop emphasizes hands-on applications, showcasing practical inference scenarios such as text generation and question-answering. Attendees will engage with live demonstrations using Hugging Face Transformers and interactive coding sessions designed to illustrate inference optimization strategies, including GPU acceleration, quantization, and batching for maximum throughput.

By the end of this workshop, participants will have:

- A solid grasp of inference fundamentals and optimization techniques for LLMs.

- Hands-on experience deploying and accelerating inference tasks using popular tools.

- Insights into practical considerations for deploying LLM inference solutions at scale.

Prerequisites: Basic knowledge of ML, DL, PyTorch, Transformers.

Length: 2 Hours

Level: Intermediate/Advanced

Organized by

Research Computing Center