cuda-oxide:hello-constant 拆解 04——codegen_crate 入口接管

上一篇(rustc 前端把源码变成 MIR)讲到 rustc 把源码降到 MIR 之后,准备调 backend 的 codegen_crate(tcx)。这一篇看 backend 这边怎么接住——dlopen 加载、入口符号、委托模式、codegen_crate 主体逻辑,把进程 B 里”我们的代码开始跑”这一步看清楚。

1. rustc 怎么加载我们的 `.so`

普通 rustc 编译时,rustc 内部 hardcode 用 LLVM 后端。但因为 RUSTFLAGS 里有 -Z codegen-backend=...so,rustc 走插件路径:

rustc 启动
  │
  ├──▶ dlopen("librustc_codegen_cuda.so")           系统调用,加载动态库
  │
  ├──▶ dlsym("__rustc_codegen_backend")             找入口符号
  │
  └──▶ 调用 __rustc_codegen_backend()               拿到 backend 对象
           │
           ▼
       Box<dyn CodegenBackend>                       一个 trait object

__rustc_codegen_backend 是 rustc 约定俗成的入口符号名(类似 C 库的 _start、Windows 的 DllMain)。

-Z codegen-backend 协议细节见 compiler 系列第 02 篇,.so 怎么编出来见 08 番外,符号导出怎么验证见 02 nm 验证。

2. 入口实现:三件事

crates/rustc-codegen-cuda/src/lib.rs 里的入口函数:

#[unsafe(no_mangle)]                  // ← 必须保留符号名(不被 mangle)
pub fn __rustc_codegen_backend() -> Box<dyn CodegenBackend> {
    init_tracing_once();              // ① 装日志订阅器(RUST_LOG 起作用的地方)
    let config = CudaCodegenConfig::from_env();
    let llvm_backend = rustc_codegen_llvm::LlvmCodegenBackend::new();  // ② 关键:原装 LLVM 后端
    Box::new(CudaCodegenBackend { config, llvm_backend })              // ③ 包装
}

步	作用
①	`tracing` 订阅器初始化——后续所有 `tracing::info!` 才能被 `RUST_LOG=info` 看到
②	构造一个 `LlvmCodegenBackend` 实例,塞进自己的 struct 当字段
③	返回 `Box<dyn CodegenBackend>`,rustc 后续所有调用都打在这个对象上

关键设计:LlvmCodegenBackend 是 rustc 自带的完整 LLVM 后端,我们当成普通字段持有,不重写。

3. 委托模式(Delegation Pattern)

CudaCodegenBackend 实现 CodegenBackend trait 有十几个方法。策略:只重写一个,其它全部转发给 llvm_backend:

impl CodegenBackend for CudaCodegenBackend {
    fn name(&self) -> &'static str { "cuda" }            // ← 我们自己

    fn init(&self, sess: &Session) {
        self.llvm_backend.init(sess);                     // ← 转发
    }
    fn print_version(&self) {
        self.llvm_backend.print_version();                // ← 转发
    }
    fn target_cpu(&self, sess: &Session) -> String {
        self.llvm_backend.target_cpu(sess)                // ← 转发
    }
    fn target_config(&self, ...) -> TargetConfig {
        self.llvm_backend.target_config(sess)             // ← 转发
    }
    fn provide(&self, providers: &mut Providers) {
        self.llvm_backend.provide(providers);             // ← 转发
    }

    fn codegen_crate(&self, tcx: TyCtxt, crate_info: &CrateInfo) -> Box<dyn Any> {
        // ⬇ 唯一真正"自己写"的方法
        // ... 拦截 device code → 走 cuda-oxide 管线
        // ... host code → self.llvm_backend.codegen_crate(...)
    }

    fn join_codegen(&self, ...) -> ... {
        self.llvm_backend.join_codegen(...)               // ← 几乎转发,只加点 PTX artifact
    }
    // ...
}

这个”包装 + 转发”模式的详细讨论见 03 用组合代替继承。

4. 调用频次:一次 vs 每 crate 一次

__rustc_codegen_backend() 跟 codegen_crate() 的调用频次完全不一样,新手最容易混淆:

函数	调用频次	何时
`__rustc_codegen_backend()`	整个进程一次	rustc dlopen 完 `.so` 就调,只为拿到 backend 对象
`init()` / `target_cpu()` / `provide()` 等	每个 crate 调一次	rustc 编译每个 crate 前调
`codegen_crate(tcx)`	每个 crate 调一次	rustc 把每个 crate 的 MIR 喂给 backend

一次 cargo build 涉及很多 crate:

hello_constant
   ├── cuda-core        ──┐
   ├── cuda-device      ──┤  每个都被
   ├── cuda-host        ──┤  rustc 单独
   ├── tracing          ──┤  编译一遍
   ├── tracing-subscriber │  → 每个都触发
   └── ... 几十个传递依赖   │     codegen_crate
                          ▼
              rustc 编译这些 crate
              每个 crate 调用一次 codegen_crate

所以 lib.rs 里那条 eprintln!("codegen_crate() called for crate '{}'", ...) 你会看到几十次输出,N 等于依赖图大小。

lib.rs 里 init() 上面有注释专门提了这点:

// Note: Don't log here - init() is called for ALL crates including dependencies.
// We log in codegen_crate() only when there are kernels to compile.

5. `codegen_crate` 主体逻辑

crates/rustc-codegen-cuda/src/lib.rs 里 codegen_crate 的实现:

fn codegen_crate(&self, tcx: TyCtxt<'_>, crate_info: &CrateInfo) -> Box<dyn Any> {
    tracing::info!(">>>>>>>>>>>>> codegen_crate()");
    eprintln!("codegen_crate() called for crate '{}'", tcx.crate_name(LOCAL_CRATE));

    with_no_trimmed_paths!({
        // ──── Step 1: 收集单态化后的所有 mono item ────
        let mono_partitions = tcx.collect_and_partition_mono_items(());

        // ──── Step 2: 数有多少 #[kernel] 函数 ────
        let kernel_count = collector::count_kernels_in_cgus(tcx, mono_partitions.codegen_units);
        let device_fn_count = collector::count_device_fns_in_cgus(tcx, mono_partitions.codegen_units);
        let has_device_code = kernel_count > 0 || device_fn_count > 0;

        // ──── Step 3: 有 device 代码 → 走 cuda-oxide 管线 ────
        if has_device_code {
            tracing::info!(
                kernels = kernel_count,
                device_fns = device_fn_count,
                "[PHASE 3/9] rustc_codegen_cuda::codegen_crate — device code detected"
            );

            // 走 device 路径:collect → mir_importer::run_pipeline → PTX
            let collection_result = collector::collect_device_functions(tcx, ...);
            tracing::info!("[PHASE 4/9] collector: walked call graph from kernels");

            device_codegen::generate_device_code(tcx, ...);   // 跳到 mir-importer
        }

        // ──── Step 4: host 路径无论如何都要走 LLVM ────
        let host_result = self.llvm_backend.codegen_crate(tcx, crate_info);

        Box::new(CudaOngoingCodegen { host: host_result, artifact_objects })
    })
}

四件事:单态化收集 → 数 kernel → 有就走 device 路径 → host 永远跑 LLVM。

5.1 `tcx.collect_and_partition_mono_items()`

rustc 自己做的事:单态化(generic instantiation) + 把单态化后的所有 item 分组到 CGU(CodeGen Unit,并行编译的粒度)。

输出已经是”每个 Vec<i32>::push 都是独立实例”的状态——我们不用自己做。

5.2 `count_kernels_in_cgus`

扫所有 CGU,找名字符合 kernel 命名约定(cuda_oxide_kernel_* 前缀,由 #[kernel] 宏注入)的函数。

绝大多数依赖 crate(tracing、serde 等)这个计数都是 0,所以直接跳过 device 路径,只走 host LLVM。

5.3 host 路径永远跑

注意 host_result = self.llvm_backend.codegen_crate(...) 是无论 has_device_code 是不是 true 都执行的。即使是只有 device 代码的 crate(比如 cuda-device),host LLVM backend 也要跑一次,因为可能有 host-callable 辅助函数。

6. 真实 run.log 数据

跑一遍 hello_constant,你会在 stderr 看到这几条关键日志(docs/run.log):

INFO rustc_codegen_cuda: >>>>>>>>>>>>> codegen_crate()           ← tracing::info! 入口
codegen_crate() called for crate 'hello_constant'                ← eprintln! 入口
INFO rustc_codegen_cuda: [PHASE 3/9] rustc_codegen_cuda::codegen_crate — device code detected
    crate_name=hello_constant cgus=16 kernels=2 device_fns=0
[rustc_codegen_cuda] Compiling crate 'hello_constant': 16 CGUs, 2 kernel(s), 0 device fn(s)

注意几个数字:

字段	值	解释
`cgus`	16	hello_constant crate 被切成 16 个 codegen unit(rustc 自动决定)
`kernels`	2	example 里有两个 `#[kernel]`:`hello_constant` 和 `hello_kernel`
`device_fns`	0	没有独立的 `#[device]` 函数,所有 device 代码都是 kernel 入口

7. 时间线总结

rustc 启动
  │
  ▼
dlopen + dlsym
  │
  ▼
__rustc_codegen_backend()      ← 整个进程只调一次
  ├── 装 tracing subscriber
  ├── 构造 LlvmCodegenBackend
  └── 返回 CudaCodegenBackend
  │
  ▼
init() / target_cpu() / provide() / ...   ← 每个 crate 转发到 LLVM
  │
  ▼
codegen_crate(tcx, crate_info)            ← 每个 crate 调一次
  ├── crate "tracing"
  │   └── kernel_count == 0 → 只走 host LLVM
  ├── crate "cuda-device"
  │   └── kernel_count == 0 → 只走 host LLVM
  │       (虽然有 #[inline(never)] 桩函数,但没 #[kernel] 标记)
  ├── ...几十个其它依赖...
  └── crate "hello_constant"               ← ★ 这一次
      ├── kernel_count == 2 → has_device_code = true
      ├── ▶ [PHASE 3/9] 日志
      ├── ▶ collector::collect_device_functions  ← 下一篇讲
      ├── ▶ [PHASE 4/9] 日志
      ├── ▶ device_codegen::generate_device_code  ← 下一篇讲
      │     └── 这里调 mir_importer::run_pipeline
      └── ▶ self.llvm_backend.codegen_crate     ← host main() 等用 LLVM 编

8. 一句话总结

__rustc_codegen_backend 是 rustc dlopen 之后的入口符号,构造一个包装了 LlvmCodegenBackend 的 CudaCodegenBackend,其它 trait 方法全部转发,只在 codegen_crate 这一处做拦截:有 #[kernel] 就走 cuda-oxide 管线,host 部分永远委托给 LLVM。这是 host + device 双轨制的真正实现点。

下一篇会拆 codegen_crate 内部走 device 路径时调的两个函数:collector::collect_device_functions(BFS 调用图)和 device_codegen::generate_device_code(进入 stable MIR 上下文,喂给 mir-importer 的 pipeline)。

系列上一篇: cuda-oxide:hello-constant 拆解 03——rustc 前端把源码变成 MIR

评论区

评论功能即将上线, 敬请期待。