Massively Parallel Algorithms - SS 2024

About this Course

The era of increased performance from faster single cores and optimized single core programs has ended. Instead, highly parallel GPU cores, initially developed for shading, can now run hundreds or thousands of threads in parallel. Consequently, they are increasingly being adopted to offload and augment conventional (albeit multi-core) CPUs. And the technology is getting better, faster, and cheaper. It will probably even become a general computing processor on mobile devices, because it offers more processing power per energy amount.

The high number of parallel cores, however, poses a great challenge for software and algorithm design that must expose massive parallelism to benefit from the new hardware architecture. The main purpose of the lecture is to teach practical algorithm design for such parallel hardware.

Simulation is widely regarded as the third pillar of science (in addition to experimentation and theory). Simulation has an ever increasing demand for high-performance computing. The latter has received a boost with the advent of GPUs, and it is even becoming -- to some extent -- a commodity.

There are many scientific areas where the knowledge you will gain in this course can be very valuable and useful to you:

In this course, you will get hands-on experience in developing software for massively parallel computing architectures. For the first half of the lecture, there will be supervised exercises to familiarize yourself with the CUDA parallel programming model and environment. The exercises will comprise, for instance, image processing algorithms, such as you might find in Photoshop. In the second half, you can decide whether to continue with exercises, or whether to work on a small self-assignment for the rest of the cource.

Prerequisites are:

  1. A little bit of experience with C/C++ ; note that we will need just plain old C during this course, but the concept of pointers should be familiar.
  2. Liking for algorithmic thinking.

Useful, but not required, is a computer/notebook with an Nvidia GPU that is CUDA capable. You can find a list of all supported GPU's here. If you don't have access to such a computer, you are welcome to work on your assignments in our lab.

Not required are

Some of the envisioned topics (these are subject to change during the semester):

  1. Simple parallel program models and laws
  2. Introduction to CUDA and massively parallel programming
  3. Dense matrix algorithms
  4. Parallel prefix-sum
  5. Parallel sorting
  6. Dynamic Parallelism
  7. Introduction to Thrust
  8. Parallel sphere packing
  9. Parallel Hashing
All topics are accompanied and illustrated by lots of application examples.

News

Discord: https://discord.gg/YGUZFxf
I encourage you to use this for discussions and Q&A amongst yourselves. (Rest assured that I don't have time to "listen" in to your chatting on discord ;-) )

Slides

The following table contains the topics in more detail, the accompanying slides, and the assignments (it will be filled step-by-step after the respective lectures).

Week Topics Slides Assignments Frameworks
1. Orga
Introduction: More Moore, stream programming model, Flynn's taxonomy, definition speedup, Amdahl's law, Gustafson's law.
Fundamental Concepts of Parallel Programming 1: terminology, control flow, transfering data, blocks, data flow in general
PDF1 PDF2 PDF3 Assignment 1 Organisatorial Remarks CUDA_examples_cmake.zip cudaMallocAndMemcpy.zip
2. Fundamental Concepts 2: multi-dimensional blocks and grids, classes in CUDA constant memory, simplistic raytracer, warps, thread divergence, the GPU architecture, measuring performance, GPU/CPU synchronization, shared memory, parallel computation of the dot product, barrier synchronization, race conditions, document similarity, utilizing lockstepped threads instad of explicit synchronization, PDF Assignment 2 dotProduct.zip
3. Fundamental Concepts 3: heat transfer simulation, double buffering pattern, texture memory, GPU's memory hierarchy, parallel histogram computation, atomic operations, pipelining host-GPU calls, PDF
4. Matrix algorithms: matrix vector product, column major storage, AoS versus SoA, coalesced memory access, auto-tuning, arithmetic intensity, matrix-matrix multiplication, PDF Assignment 3 matrixVectorMul.zip
5. Matrix algorithms: tiled matrix multiplication, all pairs shortest path as matrix multiplication, tensor cores.
Holiday: May 1
PDF
6. Parallel prefix-sum: definition, simple examples, Hillis-Steele algorithm, depth & work complexity, Blelloch's algorithm, optimization, Brent's theorem & application to prefix-sum, theoretical speedup based on Brent, PDF
7. Parallel prefix-sum: overview of sequential radix sort, parallel radix sort on the GPU, the split operation, stream compaction, summed area tables, complexity of computing SAT's face detection with Viola-Jones.
Parallel sorting: comparator networks, the 0-1 principle, odd-even mergesort, bitonic sorting,
PDF
8. Parallel sorting: work and depth complexities, adaptive bitonic sorting, Applications of parallel sorting: BVH construction using space filling curves and Morton codes; deformable collision detection by sweep plane approach,
Lab meeting
PDF Assignment 4 LineOfSight
9. Dynamic Parallelism: general usage, execution model, simple example, caveats, Mariani-Silver algorithm for the Mandelbrot set, the "W" pattern for multigrids. fine-grained dynamic load balancing (task parallelism on the GPU), example application: Reyes rendering pipeline.
Random Forests: classification problem, simple solutions, concept of decision trees, information, entropy, information gain, conditional entropy, construction of decision trees,
PDF1 PDF2
10. Random Forests: problems of decision trees wisdom of the crowds, random forests, bootstrapping/subsampling, randomization during construction, variants of RF's, out-of-bag error estimation, parallel construction of RFs, handwritten digit recognition, body tracking using depth images. PDF2 Assignment 5 heat_transfer
11. Parallel Hashing: intersection of point clouds, geometric hashing, cuckoo hashing: principle, proof of correctness, performance experiments.. PDF Assignment 6 Hints interactive_raytracer

If you are interested in doing a thesis with us, please check my our topics on offer. Also, be sure to check our research projects. If you are interested in any of these, just drop me a line (or anybody else in the CGVR group).

Some very simple examples that can serve as a starting point. They contain a makefile and should compile out-of-the-box, at least under Max OS X and Linux.

Textbooks

The following textbooks can help review the material covered in class:

Please note that the course is not based on one single textbook! Some topics might even not be covered in any current textbook! So, I suggest you first look at the books in the library before purchasing a copy.

If you plan on buying one of these books, you might want to consider buying a used copy -- they can often be purchased for a fraction of the price of a new one. Two good internet used book shops are Abebooks and BookButler.

Additional Literature

Other Interesting Bits and Pieces

Gabriel Zachmann
Last modified: Thu Jun 13 16:41:33 MDT 2024