[Request] Optimize gloo with oneCCL for CPUs

# Summary

To speed up LLM model inference on multiple Xeon CPUs, we require oneCCL to optimize the existing PyTorch native communication backend for Xeon CPUs ([gloo](https://github.com/pytorch/gloo/tree/main/docs)). 

# Use case
Tensor Parallel is key for LLM models to run inference on multiple CPUs,  and its performance relies on the
efficiency of communication backend implementations for communication ops like `allreduce`, `allgather`, `gather ` and etc.

PyTorch communication backends: 

- https://docs.pytorch.org/docs/stable/distributed.html

PyTorch + Transformers Tensor Parallel use case and status on CPUs:

- https://github.com/pytorch/pytorch/issues/147596

As links listed above,  the use case shows the current PyTorch native communication backend for CPU ([gloo](https://github.com/pytorch/gloo/tree/main/docs)) is far from performant and needs to be optimized.

# Proposed solution

Use the optimized communication ops from oneCCL like `allreduce`, `allgather`, `gather ` and etc to replace the existing implementations in PyTorch [gloo](https://github.com/pytorch/gloo/tree/main/docs)

# Justification

This feature will help the OOB performance of LLM inference on Xeon CPUs using the PyTorch + Transformers ecosystem.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Request] Optimize gloo with oneCCL for CPUs #168

Summary

Use case

Proposed solution

Justification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Request] Optimize gloo with oneCCL for CPUs #168

Description

Summary

Use case

Proposed solution

Justification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions