全局内存与 coalescing

GPU 的 global memory(也就是 cudaMalloc 出来那块显存)物理上是 HBM/GDDR DRAM——访问延迟几百 cycle,跟主存差不多。但 GPU 有一个救星:memory controller 能把 warp 内 32 个 thread 的内存请求合并成 1 次事务。

合并(coalescing)是把”32 次慢访问”变成”1 次快访问”的硬件机制。满足合并条件 → 性能正常;不满足 → 慢 32 倍。

1. 合并的三条规则

满足这三条,warp 的 32 次访问被合并成 1 个 128-byte 内存事务:

地址连续——warp 内 lane i 访问 base + i * sizeof(T)
起始地址对齐——对 4-byte 元素,起始 128-byte 对齐
同一个 cache line——128 byte 内

简单说:让 warp 内 32 个相邻 lane 访问相邻的 32 个元素。

2. 一个对比题

下面两段代码,哪个跑得快?

int tid = blockIdx.x * blockDim.x + threadIdx.x;

// 版本 A:每个 thread 处理 4 个连续元素
for (int i = 0; i < 4; i++)
    out[tid * 4 + i] = in[tid * 4 + i] * 2.0f;

// 版本 B:每个 thread 处理 4 个跳跃元素
for (int i = 0; i < 4; i++)
    out[tid + i * blockDim.x] = in[tid + i * blockDim.x] * 2.0f;

B 快。原因:

A 的 tid * 4 让相邻 thread 间隔 16 字节。每次循环,warp 内 32 个 thread 的访问跨度 32 × 4 = 128 字节——勉强 1 个 cache line,但 4 次循环各跨一个 cache line,总共 4 次内存事务。
B 的 tid + i * blockDim.x 让 warp 内 32 个 thread 每次访问连续的 32 个 float(128 字节)——完美合并,4 次循环 = 4 次内存事务(看似一样),但每次都是合并好的事务,带宽用满。

更关键的是 GPU 喜欢 stride-1 访问(thread index 走最快的内存维度)。口诀:让 thread index 沿数据的最内层维度走。

3. 常见访问模式分类

模式	合并吗	怎么改
`out[tid]`	✓	已经是最佳
`out[tid * N]`(N > 1)	✗	跨度过大 → 改 thread 映射 / 转置数据
列主序矩阵按行遍历	✗	thread 间跨度 = 行数
列主序矩阵按列遍历	✓	thread 间跨度 = 1
行主序矩阵按列遍历	✗	跨度 = 列数
`struct AoS[tid].field`	中等	不同 field 不连续,有时合并不完美
`SoA.field[tid]`	✓	Structure of Arrays 始终连续

最后一行是个高频对比——AoS(Array of Structs)vs SoA(Struct of Arrays)。深度学习数据通常用 SoA,就是为了 coalescing。

4. 用 vector load 一次拉 4 个

如果每个 thread 要连续读 4 个 float,用 float4:

// 标量版:4 次 load
float a = in[tid * 4 + 0];
float b = in[tid * 4 + 1];
float c = in[tid * 4 + 2];
float d = in[tid * 4 + 3];

// 向量版:1 次 load
float4 v = *reinterpret_cast<float4*>(&in[tid * 4]);

向量版生成 PTX ld.global.v4.f32 一条指令——一条事务、16 字节、跑得快。前提:in + tid * 4 必须 16-byte 对齐。

类似的还有 float2、int4、half2。

5. 怎么验证

用 Nsight Compute 看:

指标	健康范围
`Memory > L1/TEX Cache > Sector Hit Rate`	高 = cache 用得好
`Memory > Global Load Throughput`	接近峰值带宽 = 合并好
`Source > Memory Access Patterns`	直接告诉你哪些 load/store 是 uncoalesced

跑命令:

ncu --set full -k your_kernel ./your_app

打开报告找 Source 标签 → 看 Sectors/Req(每次 request 实际触发多少 sector)。理想值 1。如果是 8、16、32,说明大量未合并。

6. 跟 bank conflict 的关系

CUDA 性能两大支柱:

	global memory	shared memory
问题	未合并访问	bank conflict
机制	warp 内 32 个 lane 的 global 访问被合并成 1 次 DRAM 事务	shared memory 分 32 个 bank,同 bank 同时被多 lane 访问会串行化
优化方向	让 thread index 沿最内维度走	tile 加 padding `[TILE][TILE+1]` / swizzle
诊断指标	Sectors/Req、Global Load Throughput	Shared Memory Bank Conflicts

两个一起决定 memory bound kernel 的性能。先看 coalescing(全局),再看 bank conflict(共享)。

7. 一句话

让 warp 内 32 个相邻 lane 访问 32 个相邻地址 = 合并 = 快 32 倍。这一条贯穿 CUDA 数据布局设计——SoA、转置、向量 load 全是它的延伸。

评论区

评论功能即将上线, 敬请期待。