Tokenization in NLP: From Basics to Advanced Techniques

Tokenization in NLP: From Basics to Advanced Techniques

Explore tokenization's pivotal role in language model training.

By Data Science Dojo

Date and time

Wednesday, June 26 · 10am - 12pm PDT

Location

Online

About this event

  • 2 hours

Have you ever wondered how machines understand the nuances of human language? It all starts with tokenization, the foundational step in training language models to grasp our complex languages.

In this live session, you will learn the mysteries behind tokenization in natural language processing (NLP). From the initial challenges of segmenting text into manageable pieces to the sophisticated techniques that enable deeper language understanding, this talk is tailored for enthusiasts eager to deepen their knowledge and refine their skills in NLP.

Whether you're just starting out or looking to brush up on the latest in NLP, this session promises a blend of foundational knowledge and advanced insights, all presented in an accessible and engaging format.


  1. Understand tokenization's impact on language models.
  2. Learn text splitting for deeper analysis.
  3. Explore Byte Pair Encoding's efficiency.
  4. Discover sliding windows for better training data.
  5. Learn about converting tokens into vectors.

Organized by

At Data Science Dojo, we're extremely passionate about data science. We've helped educate and train 10,000+ employees from more than 2,500 companies globally, including many leaders in tech like Microsoft, Apple, and Facebook.