Skip to content

[Request] Optimize gloo with oneCCL for CPUs #168

@jianan-gu

Description

@jianan-gu

Summary

To speed up LLM model inference on multiple Xeon CPUs, we require oneCCL to optimize the existing PyTorch native communication backend for Xeon CPUs (gloo).

Use case

Tensor Parallel is key for LLM models to run inference on multiple CPUs, and its performance relies on the
efficiency of communication backend implementations for communication ops like allreduce, allgather, gather and etc.

PyTorch communication backends:

PyTorch + Transformers Tensor Parallel use case and status on CPUs:

As links listed above, the use case shows the current PyTorch native communication backend for CPU (gloo) is far from performant and needs to be optimized.

Proposed solution

Use the optimized communication ops from oneCCL like allreduce, allgather, gather and etc to replace the existing implementations in PyTorch gloo

Justification

This feature will help the OOB performance of LLM inference on Xeon CPUs using the PyTorch + Transformers ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions