Distributed Data Processing in the RCC HPC Platform

Distributed Data Processing in the RCC HPC Platform

By Research Computing Center

This workshop will provide an overview of distributed data processing techniques and how to apply them with commonly used Python packages.

Date and time

Location

John Crerar Library - Kathleen A. Zar Room

5730 South Ellis Avenue Chicago, IL 60637

Good to know

Highlights

  • 2 hours
  • In person

About this event

Science & Tech • High Tech

The need to process large datasets and perform computationally intensive calculations efficiently is popular across disciplines. To scale up your calculations for datasets larger than 10s GB in size and/or with thousands of data files, it is necessary to employ distributed data parallel approaches where individual processing units handle separate data chunks and communicate the results only when necessary. Understanding the key concepts of data parallel processing and modern software tools available in the RCC system will help users take full advantage of our HPC platforms.

This workshop will provide an overview of distributed data processing techniques and how to apply them with commonly used Python packages. We will go through typical use cases and discuss how efficient parallelization helps speed up your calculations substantially.

Objectives:

After the workshop, the attendees will be able to

  • Understand the key concepts of data parallel processing such as multithreading and multiprocessing, strong scaling performance, data parallel in the machine learning model training.
  • Become familiar with commonly used Python packages such as multiprocessing, h5py and mpi4py for distributed data parallel processing on single nodes and across multiple nodes.
  • Apply the PyTorch Distributed data parallel module in illustrative examples

Please bring your laptop. Attendees should have basic familiarity with Python programming. An RCC account is helpful, but not required.

Level: Intermediate


Organized by

Research Computing Center

Followers

--

Events

--

Hosting

--

Free
Oct 23 · 1:00 AM CDT