Skip to content

Support E4M3B15 datatype#765

Open
Binyang2014 wants to merge 33 commits intomainfrom
binyli/handle-fix
Open

Support E4M3B15 datatype#765
Binyang2014 wants to merge 33 commits intomainfrom
binyli/handle-fix

Conversation

@Binyang2014
Copy link
Copy Markdown
Contributor

@Binyang2014 Binyang2014 commented Mar 26, 2026

Summary

  • Add fp8_e4m3b15 datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion.
  • Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable accumDtype parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy.
  • Propagate accumDtype through the full API: The new parameter is threaded from Algorithm::execute()NativeAlgorithmKernelFunc → dispatch → CUDA kernels, with DataType::AUTO as the default (resolves to input dtype at runtime).
  • Add FP8 accumulation correctness tests: New test_fp8_accum.py validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
  • Add test_fp8_accum.py to CI: Azure Pipeline ut.yml now runs FP8 accumulation tests alongside existing pytests.
  • NCCL shim logging cleanup: Migrated printf-style WARN/INFO calls to streaming-style logging.

Key files

Area Files
New datatype + vector ops include/mscclpp/gpu_data_types.hpp
Accumulation reduce helpers src/core/include/reduce_kernel.hpp
Algorithm API (accumDtype) include/mscclpp/algorithm.hpp, src/core/algorithm.cc
Allreduce kernels src/ext/collectives/allreduce/*.cu
Dispatch + common src/ext/collectives/include/allreduce/common.hpp
Python bindings python/csrc/algorithm.cpp, python/mscclpp/_core/algorithm.py
Tests python/test/test_fp8_accum.py
CI .azure-pipelines/templates/ut.yml

Test plan

  • CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests
  • CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
  • CI passes on MI300X (ROCm) — FP8 tests run via HIP
  • Existing test_mscclpp.py tests continue to pass
  • NCCL shim builds and runs correctly with new accumDtype defaults

🤖 Generated with Claude Code

Binyang2014 and others added 18 commits March 14, 2026 04:09
Remove the contextKey parameter from execute() and reset() across
the C++ core, Python bindings, and Python API. Revert AlgorithmCtxKey
to the simple flat struct and remove customKeyMap_ caching logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Binyang2014 Binyang2014 changed the title Binyli/handle fix Support E4M3B15 datatype Mar 26, 2026
Binyang2014 and others added 11 commits March 26, 2026 03:50
…small sizes

Replace the standalone script in examples/torch-integration/ with a proper
pytest test under python/test/. Also scale nblocks for rsag_zero_copy based
on data size to avoid hangs with small buffers (e.g. size=1024 with 8 ranks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add multi-algorithm parametrization (packet, nvls_packet, fullmesh,
rsag_zero_copy) and float32 accumulation to test_fp8_e4m3_accum,
matching the coverage of test_fp8_e4m3b15_accum. This exercises the
new x8/x16 decomposed conversion specializations used by fullmesh
and rsag_zero_copy kernels (int4 = 16 fp8 elements).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Binyang2014 Binyang2014 marked this pull request as ready for review March 30, 2026 21:35
@Binyang2014 Binyang2014 requested a review from Copilot March 30, 2026 21:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends MSCCL++’s datatype and allreduce execution pipeline to support a new software-defined FP8 format (fp8_e4m3b15) and introduces configurable mixed-precision accumulation (accumDtype) that is threaded end-to-end from public API to kernel dispatch, plus adds FP8 accumulation correctness tests and wires them into CI.

Changes:

  • Add DataType::FLOAT8_E4B15 (__fp8_e4m3b15) with vector conversions and reduction support infrastructure.
  • Add accumDtype plumbed through C++/Python APIs and allreduce algorithm variants to enable FP8 accumulation in FP16/FP32.
  • Add test_fp8_accum.py and run it in Azure Pipelines; convert NCCL shim logging to streaming-style macros.

Reviewed changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ext/nccl/nccl.cc Converts NCCL shim WARN/INFO logging to streaming-style logging.
src/ext/nccl/datatype_conversion.hpp Adds size mapping and NCCL-datatype conversion handling for FLOAT8_E4B15.
src/ext/nccl/algorithm_selector.cc Updates NVLS support heuristics to treat FLOAT8_E4B15 as FP8.
src/ext/collectives/include/allreduce/common.hpp Refactors allreduce dispatch to support FP8 accumulation type selection via accumDtype.
src/ext/collectives/include/allreduce/allreduce_rsag.hpp Threads accumDtype into RSAG allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_rsag_zero_copy.hpp Threads accumDtype into RSAG zero-copy allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_rsag_pipeline.hpp Threads accumDtype into RSAG pipeline allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_packet.hpp Threads accumDtype into packet allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp Threads accumDtype into NVLS zero-copy allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp Threads accumDtype into NVLS warp-pipeline allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp Threads accumDtype; caches NVLS switch channels on the builder.
src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp Threads accumDtype into NVLS block-pipeline allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_fullmesh.hpp Threads accumDtype into fullmesh allreduce kernel function signature.
src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp Threads accumDtype into allpair-packet allreduce kernel function signature.
src/ext/collectives/allreduce/allreduce_rsag.cu Updates RSAG adapter/dispatch to accept AccumT and use accumDtype.
src/ext/collectives/allreduce/allreduce_rsag_zero_copy.cu Implements mixed-precision accumulation via upcast/accumulate/downcast vector helpers.
src/ext/collectives/allreduce/allreduce_rsag_pipeline.cu Updates RSAG pipeline adapter/dispatch to accept AccumT and accumDtype.
src/ext/collectives/allreduce/allreduce_packet.cu Adds mixed-precision accumulation path for packet allreduce; updates logging include.
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu Adds AccumT template param and rejects software FP8 type for NVLS.
src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu Adds AccumT template param and rejects software FP8 type for NVLS.
src/ext/collectives/allreduce/allreduce_nvls_packet.cu Adds mixed-precision accumulation; reuses pre-setup switch channels; updates logging include.
src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu Adds AccumT template param and rejects software FP8 type for NVLS.
src/ext/collectives/allreduce/allreduce_fullmesh.cu Adds mixed-precision accumulation in fullmesh reduction loop.
src/ext/collectives/allreduce/allreduce_allpair_packet.cu Updates adapter dispatch to accept accumDtype.
src/ext/collectives/allgather/allgather_fullmesh.cu Extends builder callback signature to accept (unused) accumDtype.
src/ext/collectives/allgather/allgather_fullmesh_2.cu Extends builder callback signature to accept (unused) accumDtype.
src/core/include/reduce_kernel.hpp Adds mixed-precision vector upcast/downcast/accumulation helpers and f32x2 specializations.
src/core/include/execution_kernel.hpp Adds FLOAT8_E4B15 and AUTO handling in HIP execution-kernel dispatch.
src/core/executor/execution_kernel.cu Adds FLOAT8_E4B15 and AUTO handling in CUDA execution-kernel dispatch.
src/core/algorithm.cc Resolves accumDtype==AUTO at runtime and threads it through NativeAlgorithm kernel launch.
include/mscclpp/gpu_data_types.hpp Introduces __fp8_e4m3b15, vector types, conversions, and arithmetic/min support; adds DataType::AUTO.
include/mscclpp/algorithm.hpp Extends Algorithm::execute / NativeAlgorithm::KernelFunc to accept accumDtype with AUTO default.
python/mscclpp/_core/algorithm.py Adds accum_dtype parameter to Python Algorithm.execute and forwards it to bindings.
python/csrc/core_py.cpp Exposes DataType.float8_e4m3b15 to Python.
python/csrc/algorithm.cpp Extends Python binding execute signature to pass accum_dtype (default AUTO).
python/csrc/gpu_utils_py.cpp Adds DLPack dtype handling for Torch FP8 dtypes and raw-byte representation for fp8_e4m3b15.
python/test/test_fp8_accum.py Adds correctness tests comparing native FP8 vs FP16/FP32 accumulation across algorithms.
examples/torch-integration/customized_allgather.cu Updates example callback signature to accept (unused) accumDtype.
examples/customized-collective-algorithm/customized_allgather.cu Updates example callback signature to accept (unused) accumDtype.
docs/guide/mscclpp-torch-integration.md Updates documentation example callback signature to accept (unused) accumDtype.
.azure-pipelines/templates/ut.yml Adds FP8 accumulation pytest invocation in CI.

@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@Binyang2014 Binyang2014 requested a review from a team March 30, 2026 22:12
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.26%. Comparing base (93f6eea) to head (9ca31c1).

Files with missing lines Patch % Lines
src/core/algorithm.cc 0.00% 7 Missing ⚠️
src/core/include/execution_kernel.hpp 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #765      +/-   ##
==========================================
+ Coverage   59.53%   68.26%   +8.73%     
==========================================
  Files          57       57              
  Lines        5281     5288       +7     
==========================================
+ Hits         3144     3610     +466     
+ Misses       2137     1678     -459     
Flag Coverage Δ
cuda-90 69.88% <0.00%> (?)
rocm-gfx942 63.36% <0.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants