Support E4M3B15 datatype by Binyang2014 · Pull Request #765 · microsoft/mscclpp

Binyang2014 · 2026-03-26T03:46:05Z

Summary

Add fp8_e4m3b15 datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion.
Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable accumDtype parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy.
Propagate accumDtype through the full API: The new parameter is threaded from Algorithm::execute() → NativeAlgorithm → KernelFunc → dispatch → CUDA kernels, with DataType::AUTO as the default (resolves to input dtype at runtime).
Add FP8 accumulation correctness tests: New test_fp8_accum.py validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
Add test_fp8_accum.py to CI: Azure Pipeline ut.yml now runs FP8 accumulation tests alongside existing pytests.
NCCL shim logging cleanup: Migrated printf-style WARN/INFO calls to streaming-style logging.

Key files

Area	Files
New datatype + vector ops	`include/mscclpp/gpu_data_types.hpp`
Accumulation reduce helpers	`src/core/include/reduce_kernel.hpp`
Algorithm API (`accumDtype`)	`include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc`
Allreduce kernels	`src/ext/collectives/allreduce/*.cu`
Dispatch + common	`src/ext/collectives/include/allreduce/common.hpp`
Python bindings	`python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py`
Tests	`python/test/test_fp8_accum.py`
CI	`.azure-pipelines/templates/ut.yml`

Test plan

CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests
CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
CI passes on MI300X (ROCm) — FP8 tests run via HIP
Existing test_mscclpp.py tests continue to pass
NCCL shim builds and runs correctly with new accumDtype defaults

🤖 Generated with Claude Code

Remove the contextKey parameter from execute() and reset() across the C++ core, Python bindings, and Python API. Revert AlgorithmCtxKey to the simple flat struct and remove customKeyMap_ caching logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…small sizes Replace the standalone script in examples/torch-integration/ with a proper pytest test under python/test/. Also scale nblocks for rsag_zero_copy based on data size to avoid hangs with small buffers (e.g. size=1024 with 8 ranks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add multi-algorithm parametrization (packet, nvls_packet, fullmesh, rsag_zero_copy) and float32 accumulation to test_fp8_e4m3_accum, matching the coverage of test_fp8_e4m3b15_accum. This exercises the new x8/x16 decomposed conversion specializations used by fullmesh and rsag_zero_copy kernels (int4 = 16 fp8 elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends MSCCL++’s datatype and allreduce execution pipeline to support a new software-defined FP8 format (fp8_e4m3b15) and introduces configurable mixed-precision accumulation (accumDtype) that is threaded end-to-end from public API to kernel dispatch, plus adds FP8 accumulation correctness tests and wires them into CI.

Changes:

Add DataType::FLOAT8_E4B15 (__fp8_e4m3b15) with vector conversions and reduction support infrastructure.
Add accumDtype plumbed through C++/Python APIs and allreduce algorithm variants to enable FP8 accumulation in FP16/FP32.
Add test_fp8_accum.py and run it in Azure Pipelines; convert NCCL shim logging to streaming-style macros.

Reviewed changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/ext/nccl/nccl.cc`	Converts NCCL shim WARN/INFO logging to streaming-style logging.
`src/ext/nccl/datatype_conversion.hpp`	Adds size mapping and NCCL-datatype conversion handling for `FLOAT8_E4B15`.
`src/ext/nccl/algorithm_selector.cc`	Updates NVLS support heuristics to treat `FLOAT8_E4B15` as FP8.
`src/ext/collectives/include/allreduce/common.hpp`	Refactors allreduce dispatch to support FP8 accumulation type selection via `accumDtype`.
`src/ext/collectives/include/allreduce/allreduce_rsag.hpp`	Threads `accumDtype` into RSAG allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_rsag_zero_copy.hpp`	Threads `accumDtype` into RSAG zero-copy allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_rsag_pipeline.hpp`	Threads `accumDtype` into RSAG pipeline allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_packet.hpp`	Threads `accumDtype` into packet allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp`	Threads `accumDtype` into NVLS zero-copy allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp`	Threads `accumDtype` into NVLS warp-pipeline allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp`	Threads `accumDtype`; caches NVLS switch channels on the builder.
`src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp`	Threads `accumDtype` into NVLS block-pipeline allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_fullmesh.hpp`	Threads `accumDtype` into fullmesh allreduce kernel function signature.
`src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp`	Threads `accumDtype` into allpair-packet allreduce kernel function signature.
`src/ext/collectives/allreduce/allreduce_rsag.cu`	Updates RSAG adapter/dispatch to accept `AccumT` and use `accumDtype`.
`src/ext/collectives/allreduce/allreduce_rsag_zero_copy.cu`	Implements mixed-precision accumulation via upcast/accumulate/downcast vector helpers.
`src/ext/collectives/allreduce/allreduce_rsag_pipeline.cu`	Updates RSAG pipeline adapter/dispatch to accept `AccumT` and `accumDtype`.
`src/ext/collectives/allreduce/allreduce_packet.cu`	Adds mixed-precision accumulation path for packet allreduce; updates logging include.
`src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu`	Adds `AccumT` template param and rejects software FP8 type for NVLS.
`src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`	Adds `AccumT` template param and rejects software FP8 type for NVLS.
`src/ext/collectives/allreduce/allreduce_nvls_packet.cu`	Adds mixed-precision accumulation; reuses pre-setup switch channels; updates logging include.
`src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`	Adds `AccumT` template param and rejects software FP8 type for NVLS.
`src/ext/collectives/allreduce/allreduce_fullmesh.cu`	Adds mixed-precision accumulation in fullmesh reduction loop.
`src/ext/collectives/allreduce/allreduce_allpair_packet.cu`	Updates adapter dispatch to accept `accumDtype`.
`src/ext/collectives/allgather/allgather_fullmesh.cu`	Extends builder callback signature to accept (unused) `accumDtype`.
`src/ext/collectives/allgather/allgather_fullmesh_2.cu`	Extends builder callback signature to accept (unused) `accumDtype`.
`src/core/include/reduce_kernel.hpp`	Adds mixed-precision vector upcast/downcast/accumulation helpers and f32x2 specializations.
`src/core/include/execution_kernel.hpp`	Adds `FLOAT8_E4B15` and `AUTO` handling in HIP execution-kernel dispatch.
`src/core/executor/execution_kernel.cu`	Adds `FLOAT8_E4B15` and `AUTO` handling in CUDA execution-kernel dispatch.
`src/core/algorithm.cc`	Resolves `accumDtype==AUTO` at runtime and threads it through `NativeAlgorithm` kernel launch.
`include/mscclpp/gpu_data_types.hpp`	Introduces `__fp8_e4m3b15`, vector types, conversions, and arithmetic/min support; adds `DataType::AUTO`.
`include/mscclpp/algorithm.hpp`	Extends `Algorithm::execute` / `NativeAlgorithm::KernelFunc` to accept `accumDtype` with `AUTO` default.
`python/mscclpp/_core/algorithm.py`	Adds `accum_dtype` parameter to Python `Algorithm.execute` and forwards it to bindings.
`python/csrc/core_py.cpp`	Exposes `DataType.float8_e4m3b15` to Python.
`python/csrc/algorithm.cpp`	Extends Python binding execute signature to pass `accum_dtype` (default `AUTO`).
`python/csrc/gpu_utils_py.cpp`	Adds DLPack dtype handling for Torch FP8 dtypes and raw-byte representation for `fp8_e4m3b15`.
`python/test/test_fp8_accum.py`	Adds correctness tests comparing native FP8 vs FP16/FP32 accumulation across algorithms.
`examples/torch-integration/customized_allgather.cu`	Updates example callback signature to accept (unused) `accumDtype`.
`examples/customized-collective-algorithm/customized_allgather.cu`	Updates example callback signature to accept (unused) `accumDtype`.
`docs/guide/mscclpp-torch-integration.md`	Updates documentation example callback signature to accept (unused) `accumDtype`.
`.azure-pipelines/templates/ut.yml`	Adds FP8 accumulation pytest invocation in CI.

src/ext/nccl/algorithm_selector.cc

.azure-pipelines/templates/ut.yml

src/core/include/execution_kernel.hpp

src/ext/nccl/datatype_conversion.hpp

src/ext/nccl/nccl.cc

Binyang2014 · 2026-03-30T22:11:45Z

/azp run

azure-pipelines · 2026-03-30T22:12:11Z

Azure Pipelines successfully started running 4 pipeline(s).

codecov · 2026-03-30T23:27:55Z

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.26%. Comparing base (93f6eea) to head (9ca31c1).

Files with missing lines	Patch %	Lines
src/core/algorithm.cc	0.00%	7 Missing ⚠️
src/core/include/execution_kernel.hpp	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #765      +/-   ##
==========================================
+ Coverage   59.53%   68.26%   +8.73%     
==========================================
  Files          57       57              
  Lines        5281     5288       +7     
==========================================
+ Hits         3144     3610     +466     
+ Misses       2137     1678     -459

Flag	Coverage Δ
cuda-90	`69.88% <0.00%> (?)`
rocm-gfx942	`63.36% <0.00%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Binyang2014 and others added 18 commits March 14, 2026 04:09

fix memory leak

e508c67

WIP

2cb8adf

update

c777290

WIP

a8edfb7

WIP

0ceef09

for accumulation

2a4f270

WIP

41c7ce4

WIP

dadf3c8

WIP

613dcf0

WIP

e917c44

WIP

ca364db

WIP

4abbb84

WIP

7f8f6af

fix correctness issue

d163073

merge main

9d3d48b

WIP

b63232d

update

e2c9089

Binyang2014 changed the title ~~Binyli/handle fix~~ Support E4M3B15 datatype Mar 26, 2026

Binyang2014 and others added 11 commits March 26, 2026 03:50

lint

99697a9

WIP

bc3aa12

WIP

20e5167

WIP

34e1c83

WIP

021b81a

WIP

fc97035

WIP

ac7f02e

code clean

2022b29

fix

175b820

Binyang2014 added 3 commits March 30, 2026 20:31

test

807ad8f

lnt

7e5c5b6

WIP

c410391

Binyang2014 marked this pull request as ready for review March 30, 2026 21:35

Binyang2014 requested a review from Copilot March 30, 2026 21:35

Copilot started reviewing on behalf of Binyang2014 March 30, 2026 21:36 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

address comments

9ca31c1

Binyang2014 requested a review from a team March 30, 2026 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support E4M3B15 datatype#765

Support E4M3B15 datatype#765
Binyang2014 wants to merge 33 commits intomainfrom
binyli/handle-fix

Binyang2014 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Mar 30, 2026

Uh oh!

azure-pipelines bot commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Binyang2014 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Mar 30, 2026

Uh oh!

azure-pipelines bot commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Binyang2014 commented Mar 26, 2026 •

edited

Loading

codecov bot commented Mar 30, 2026 •

edited

Loading