Conversation
Remove the contextKey parameter from execute() and reset() across the C++ core, Python bindings, and Python API. Revert AlgorithmCtxKey to the simple flat struct and remove customKeyMap_ caching logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…small sizes Replace the standalone script in examples/torch-integration/ with a proper pytest test under python/test/. Also scale nblocks for rsag_zero_copy based on data size to avoid hangs with small buffers (e.g. size=1024 with 8 ranks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add multi-algorithm parametrization (packet, nvls_packet, fullmesh, rsag_zero_copy) and float32 accumulation to test_fp8_e4m3_accum, matching the coverage of test_fp8_e4m3b15_accum. This exercises the new x8/x16 decomposed conversion specializations used by fullmesh and rsag_zero_copy kernels (int4 = 16 fp8 elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends MSCCL++’s datatype and allreduce execution pipeline to support a new software-defined FP8 format (fp8_e4m3b15) and introduces configurable mixed-precision accumulation (accumDtype) that is threaded end-to-end from public API to kernel dispatch, plus adds FP8 accumulation correctness tests and wires them into CI.
Changes:
- Add
DataType::FLOAT8_E4B15(__fp8_e4m3b15) with vector conversions and reduction support infrastructure. - Add
accumDtypeplumbed through C++/Python APIs and allreduce algorithm variants to enable FP8 accumulation in FP16/FP32. - Add
test_fp8_accum.pyand run it in Azure Pipelines; convert NCCL shim logging to streaming-style macros.
Reviewed changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/ext/nccl/nccl.cc |
Converts NCCL shim WARN/INFO logging to streaming-style logging. |
src/ext/nccl/datatype_conversion.hpp |
Adds size mapping and NCCL-datatype conversion handling for FLOAT8_E4B15. |
src/ext/nccl/algorithm_selector.cc |
Updates NVLS support heuristics to treat FLOAT8_E4B15 as FP8. |
src/ext/collectives/include/allreduce/common.hpp |
Refactors allreduce dispatch to support FP8 accumulation type selection via accumDtype. |
src/ext/collectives/include/allreduce/allreduce_rsag.hpp |
Threads accumDtype into RSAG allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_rsag_zero_copy.hpp |
Threads accumDtype into RSAG zero-copy allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_rsag_pipeline.hpp |
Threads accumDtype into RSAG pipeline allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_packet.hpp |
Threads accumDtype into packet allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp |
Threads accumDtype into NVLS zero-copy allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp |
Threads accumDtype into NVLS warp-pipeline allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp |
Threads accumDtype; caches NVLS switch channels on the builder. |
src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp |
Threads accumDtype into NVLS block-pipeline allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_fullmesh.hpp |
Threads accumDtype into fullmesh allreduce kernel function signature. |
src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp |
Threads accumDtype into allpair-packet allreduce kernel function signature. |
src/ext/collectives/allreduce/allreduce_rsag.cu |
Updates RSAG adapter/dispatch to accept AccumT and use accumDtype. |
src/ext/collectives/allreduce/allreduce_rsag_zero_copy.cu |
Implements mixed-precision accumulation via upcast/accumulate/downcast vector helpers. |
src/ext/collectives/allreduce/allreduce_rsag_pipeline.cu |
Updates RSAG pipeline adapter/dispatch to accept AccumT and accumDtype. |
src/ext/collectives/allreduce/allreduce_packet.cu |
Adds mixed-precision accumulation path for packet allreduce; updates logging include. |
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu |
Adds AccumT template param and rejects software FP8 type for NVLS. |
src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu |
Adds AccumT template param and rejects software FP8 type for NVLS. |
src/ext/collectives/allreduce/allreduce_nvls_packet.cu |
Adds mixed-precision accumulation; reuses pre-setup switch channels; updates logging include. |
src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu |
Adds AccumT template param and rejects software FP8 type for NVLS. |
src/ext/collectives/allreduce/allreduce_fullmesh.cu |
Adds mixed-precision accumulation in fullmesh reduction loop. |
src/ext/collectives/allreduce/allreduce_allpair_packet.cu |
Updates adapter dispatch to accept accumDtype. |
src/ext/collectives/allgather/allgather_fullmesh.cu |
Extends builder callback signature to accept (unused) accumDtype. |
src/ext/collectives/allgather/allgather_fullmesh_2.cu |
Extends builder callback signature to accept (unused) accumDtype. |
src/core/include/reduce_kernel.hpp |
Adds mixed-precision vector upcast/downcast/accumulation helpers and f32x2 specializations. |
src/core/include/execution_kernel.hpp |
Adds FLOAT8_E4B15 and AUTO handling in HIP execution-kernel dispatch. |
src/core/executor/execution_kernel.cu |
Adds FLOAT8_E4B15 and AUTO handling in CUDA execution-kernel dispatch. |
src/core/algorithm.cc |
Resolves accumDtype==AUTO at runtime and threads it through NativeAlgorithm kernel launch. |
include/mscclpp/gpu_data_types.hpp |
Introduces __fp8_e4m3b15, vector types, conversions, and arithmetic/min support; adds DataType::AUTO. |
include/mscclpp/algorithm.hpp |
Extends Algorithm::execute / NativeAlgorithm::KernelFunc to accept accumDtype with AUTO default. |
python/mscclpp/_core/algorithm.py |
Adds accum_dtype parameter to Python Algorithm.execute and forwards it to bindings. |
python/csrc/core_py.cpp |
Exposes DataType.float8_e4m3b15 to Python. |
python/csrc/algorithm.cpp |
Extends Python binding execute signature to pass accum_dtype (default AUTO). |
python/csrc/gpu_utils_py.cpp |
Adds DLPack dtype handling for Torch FP8 dtypes and raw-byte representation for fp8_e4m3b15. |
python/test/test_fp8_accum.py |
Adds correctness tests comparing native FP8 vs FP16/FP32 accumulation across algorithms. |
examples/torch-integration/customized_allgather.cu |
Updates example callback signature to accept (unused) accumDtype. |
examples/customized-collective-algorithm/customized_allgather.cu |
Updates example callback signature to accept (unused) accumDtype. |
docs/guide/mscclpp-torch-integration.md |
Updates documentation example callback signature to accept (unused) accumDtype. |
.azure-pipelines/templates/ut.yml |
Adds FP8 accumulation pytest invocation in CI. |
|
/azp run |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #765 +/- ##
==========================================
+ Coverage 59.53% 68.26% +8.73%
==========================================
Files 57 57
Lines 5281 5288 +7
==========================================
+ Hits 3144 3610 +466
+ Misses 2137 1678 -459
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
fp8_e4m3b15datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion.accumDtypeparameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy.accumDtypethrough the full API: The new parameter is threaded fromAlgorithm::execute()→NativeAlgorithm→KernelFunc→ dispatch → CUDA kernels, withDataType::AUTOas the default (resolves to input dtype at runtime).test_fp8_accum.pyvalidates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.test_fp8_accum.pyto CI: Azure Pipelineut.ymlnow runs FP8 accumulation tests alongside existing pytests.printf-styleWARN/INFOcalls to streaming-style logging.Key files
include/mscclpp/gpu_data_types.hppsrc/core/include/reduce_kernel.hppaccumDtype)include/mscclpp/algorithm.hpp,src/core/algorithm.ccsrc/ext/collectives/allreduce/*.cusrc/ext/collectives/include/allreduce/common.hpppython/csrc/algorithm.cpp,python/mscclpp/_core/algorithm.pypython/test/test_fp8_accum.py.azure-pipelines/templates/ut.ymlTest plan
test_mscclpp.pytests continue to passaccumDtypedefaults🤖 Generated with Claude Code