CUDA
Compute Unified Device Architecture is a parallel computing platform and application programming interfac
1. Graphics Processing Unit
- SIMD
- low frequency (1.4 GHz), but many cores (8k+)
- powerful: 29.7 TFlops
- high speed of DRAM: 760 GB/s (total)
- mem. hierarchy:
- regester
- shared mem. : PBSM: per-block shared mem.
- global mem.
- const mem. ?
GPU
- Streaming Multiprocessor
- Warp: locked step for all cores
- Core
- Warp: locked step for all cores
Host: CPU + CPU Mem.
Device: GPU + GPU Mem.
Kernel function
kernel = grid -> thread blocks -> threads
- thread blocks should be independent; they are parallel
- thread blocks are mapped to SMs
grid? how it combined with hardware
-
limited number of threads in each thread block: 1024; use multiple blocks
-
limited total number of threads
-
__device__
can only be called on GPU __global__
called on host and runs on GPU
Compilation Process (by nvcc)
- seperate
- generic: generate PTX
- specialized
coalesced mem. access
Last update:
April 19, 2022
Authors: