Synopsys :
Quick recap *
Data transfer optimizations (pinned memory, zero copy, cuda managed memory) *
concurrent execution (streams, events, levels of synchronization across warps/blocks) *
Kernel optimizations (warps, impact of branches, global/constant/shared memory in detail espacially bank conflicts)
overall GPU efficiency (occupancy, roofline model)
Hardware specific behaviours (Kepler, Pascal, Volta) key differences for the programmer
Compilation of CUDA in detail (execution model)
Multi GPU (device management, CUDA context, Peer2Peer in CUDA, NV-Link, CUDA +
MPI (gdr-copy), Multi-Process-Service mps)
Advanced profiling (nvidia-smi, nvprof, nvvp)
* = overlap with training "CUDA basics"