Massively Parallel Algorithms  SS 2018
News
About this Course
There are big changes afoot. The era of increased performance from faster single cores and optimized single core programs has ended. Instead, highly parallel GPU cores, initially developed for shading, can now run hundreds or thousands of threads in parallel. Consequently, they are increasingly being adopted to offload and augment conventional (albeit multicore) CPUs. And the technology is getting better, faster, and cheaper. It will probably even become a general computing processor on mobile devices, because it offers more processing power per energy amount.
The high number of parallel cores, however, poses a great challenge for software and algorithm design that must expose massive parallelism to benefit from the new hardware architecture. The main purpose of the lecture is to teach practical algorithm design for such parallel hardware.
Simulation is widely regarded as the third pillar of science (in addition to experimentation and theory). Simulation has an ever increasing demand for highperformance computing. The latter has received a boost with the advent of GPUs, and it is even becoming  to some extent  a commodity.
There are many scientific areas where the knowledge you will gain in this course can be very valuable and useful to you:
 Computer science (e.g., visual computing, database search)
 Computational material science (e.g., molecular dynamics simulation)
 Bioinformatics (e.g., alignment, sequencing, ...)
 Economics (e.g., simulation of financial models)
 Mathematics (e.g., solving large PDEs)
 Mechanical engineering (e.g., CFD and FEM)
 Physics (e.g., ab initio simulations)
 Logistics (e.g. simulation of traffic, assembly lines, or supply chains)
In this course, you will get handson experience in developing software for massively parallel computing architectures. For the first half of the lecture, there will be supervised exercises to familiarize yourself with the CUDA parallel programming model and environment. The exercises will comprise, for instance, image processing algorithms, such as you might find in Photoshop. In the second half, you can decide whether to continue with exercises, or whether to work on a small selfassignment for the rest of the cource.
Prerequisites are:
 A little bit of experience with C/C++ ; note that we will need just plain old C during this course, but the concept of pointers should be familiar.
 Liking for algorithmic thinking.
Useful, but not required, is a computer/notebook with an Nvidia GPU that is CUDA capable. You can find a list of all supported GPU's here. If you don't have access to such a computer, you are welcome to work on your assignments in our lab.
Not required are
 experience with parallel programming,
 experience with computer graphics (although we will study several algorithms and applications of massively parallel algorithms in the area of visual computing)
Some of the envisioned topics (these can change during the semester):
 Simple parallel program models and laws
 Introduction to CUDA and massively parallel programming
 Dense matrix algorithms
 Parallel prefixsum
 Parallel sorting
 Dynamic Parallelism
 Introduction to Thrust
 Parallel sphere packing
 Parallel Hashing
Slides
The following table contains the topics in more detail, the accompanying slides, and the assignments (it will be filled stepbystep after the respective lectures).
Week  Topics 

1. 
Introduction: More Moore, stream programming model, Flynn's taxonomy,
overall speedup, Amdahl's law, Gustafson's law. Fundamental Concepts of Parallel Programming 1: terminology, control flow, transfering data, 
2.  Fundamental Concepts of Parallel Programming 2: blocks, data flow in general multidimensional blocks and grids, classes in CUDA, constant memory, simplistic raytracer, warps, thread divergence, more details on the GPU architecture, warps, measuring performance, GPU/CPU synchronization, shared memory, parallel computation of the dot product 
3.  Fundamental Concepts of Parallel Programming 3: barrier synchronization, race conditions, document similarity, utilizing lockstepped threads instad of explicit synchronization, heat transfer simulation, double buffering pattern, texture memory, edge detection (Sobel operator), GPU's memory hierarchy, parallel histogram computation, atomic operations, pipelining hostGPU calls 
4.  Dense matrix algorithms: matrix vector product, column major storage, AoS versus SoA, coalesced memory access, autotuning, arithmetic intensity, matrixmatrix multiplication, blocked matrix multiplication, All Pairs Shortest Path as matrix multiplication, 
5.  Parallel prefixsum 1: definition, simple examples, HillisSteele algorithm, depth & work complexity, Blelloch's algorithm, optimization, Brent's theorem & application to prefixsum, theoretical speedup based on Brent, introduction to sequential radix sort, radix sort on the GPU, split operation, stream compaction, compressed sparse row (CSR) storage, sparse matrixvector multiplication 
6.  Parallel prefixsum 2: summed area tables, increased percision for SATs, complexity of computing SAT's, depthoffield rendering (gathering & scattering), anisotropic texture filtering, face detection with ViolaJones. 
7.  Parallel sorting: comparator networks, the 01 principle, oddeven mergesort, bitonic sorting, work and depth complexities, (digression: adaptive bitonic sorting) 
8.  Applications of parallel sorting: BVH construction (space filling curves, Morton codes); deformable collision detection by sweep plane approach, PCA transformation, and clustering; faster raytracing by coherent ray packets. 
9. 
Dynamic Parallelism: general usage,
execution model, simple example, caveats,
MarianiSilver algorithm for the Mandelbrot set,
the "W" pattern for multigrids. Random Forests 1: classification problem, simple solutions, concept of decision trees, information, entropy, information gain, 
10.  Random Forests 2: conditional entropy, construction of decision trees, problems of decision trees, wisdom of crowds, random forests, bootstrapping/subsampling, randomization during construction, variants of RF's, outofbag error estimation, parallel construction of RFs, handwritten digit recognition, body tracking using depth images. 
11. 
Introduction to thrust. Lab meeting 
12. 
Finegrained dynamic load balancing, example application: Reyes rendering pipeline. Parallel sphere packings: collision detection basics, inner sphere trees, bacht neural gas, parallel bounding volume hierarchy construction, 
13. 
Parallel Hashing: application intersection of point clouds,
application geometric hashing,
cuckoo hashing. Lab meeting 
Some very simple examples that can serve as a starting point. They contain a makefile and should compile outofthebox at least under Max OS X and Linux.
Textbooks
The following textbooks can help review the material covered in class:
 Jason Sanders, Edward Kandort: CUDA by Example. AddisonWesley, Pearson Education.
Very easy reading, a very gentle introduction into CUDA, requires minimal C knowledge.  WenMei W. Hwu: GPU Computing Gems Jade Edition. Morgan Kaufmann.
If you are in the Uni Bremen campus LAN/WLAN, you can also read the ebook.  David B. Kirk, WenMei W. Hwu: Programming Massively Parallel Processors. Morgan Kaufmann.
 NVidia: CUDA C Programming Guide. (There is also a PDF version.)
 Russ Miller, Laurence Boxer: Algorithms, Sequenetial & Parallel. Cengage Learning.
Doesn't talk about technical details of implementing parallel algorithms, but takes more the theoretical, purely algorithmic perspective (uses PRAMs as algorithmic model).
Please note that the course is not based on one single textbook! Some topics might even not be covered in any current textbook! So, I suggest you first look at the books in the library before purchasing a copy.
If you plan on buying one of these books, you might want to consider buying a used copy  they can often be purchased for a fraction of the price of a new one. Two good internet used book shops are Abebooks and BookButler.
Grades and Points achieved by the Assignments
For taking part in a socalled "Fachgespräch" (mini oral exam), you need a grade from the assignments >= 4.0 . You can get this by successfully completing at least 40% of all asignments.
Additional Literature
 Reevaluating Amdahl's Law and Gustafson's Law (Source)
 Herb Sutter's Welcome to the Jungle from 2012 gives a number of inspiring insights into future trends  and past trends that have ceased to continue. (Herb Sutter is widely respected in the C++ software engineering community.) (Source)

What Every Programmer Should Know About Memory by Ulrich Drepper, 2007
(Source)
Well, I don't think that every programmer really has to know everything the paper explains, but I hope you all are interested in broadening your horizon  who wants to know only that stuff she/he needs?  Dominik Goeddeke's GPGPU::Basic Math Tutorial, in dem die einfachen Prinzipien anhand der "saxpy"Operation erklärt werden (Quelle).
 Performance with constant memory (excerpt from CUDA by Example).
 CUDA Documents from NVidia
 Introduction to CUDA 5.0 (Source)
 Chapter 39. Parallel Prefix Sum (Scan) with CUDA by Mark Harris, Shubhabrata Sengupta, John D. Owens. In GPU Gems 3. (Source)
 A nice handout about PRAM algorithms and Brent's theorem by Siddhartha Chatterjee and Jan Prins. (Source)
 My article about Adaptive Bitonic Sorting. (A slightly updated version of my chapter in the Encyclopedia of Parallel Computing, David Padua, Ed., pages 146–157. Springer, 2011.)
 A very good explanation of BVH construction in this blog article by Tero Karras, with an application to collision detection on the first pages (Source)
 Here is a pretty comprehensive Introduction to Random Forests by Criminisi, Shotton, and Konukoglu. (Source)

Literature for the chapter on parallel hashing:
 Geometric Hashing, An Overview by Wolfson and Rigoutsos, 1997.
 A lecture given by Wolfson and Rigoutsos.
Other Interesting Bits and Pieces
 If you would like to see Hillis himself: a talk of Hillis at TED (not on his massively parallel algorithms, but on the perilous turn the internet might take) (Source)
 A very entertaining talk by Gunter Dueck on swarm stupidity (it also conveys a number of deep insights into how big companies work or, rather, don't work :) ). (Source) Another great, similar talk given by Gunter Dueck is Schwarmdummheit.
 Not exactly about massively parallel programming, but here is a clip from an interview with Linus Torvalds, where he speaks about tasteful code. And although he does not explicitely mention it, I strongly believe that tasteful code is what makes robust code. (Source: Linus Torvalds: The mind behind Linux, February 2016 at TED2016.)
Last modified: Wed Jul 04 15:25:42 CEST 2018