-
Couldn't load subscription status.
- Fork 90
Description
Summary
To speed up LLM model inference on multiple Xeon CPUs, we require oneCCL to optimize the existing PyTorch native communication backend for Xeon CPUs (gloo).
Use case
Tensor Parallel is key for LLM models to run inference on multiple CPUs, and its performance relies on the
efficiency of communication backend implementations for communication ops like allreduce, allgather, gather and etc.
PyTorch communication backends:
PyTorch + Transformers Tensor Parallel use case and status on CPUs:
As links listed above, the use case shows the current PyTorch native communication backend for CPU (gloo) is far from performant and needs to be optimized.
Proposed solution
Use the optimized communication ops from oneCCL like allreduce, allgather, gather and etc to replace the existing implementations in PyTorch gloo
Justification
This feature will help the OOB performance of LLM inference on Xeon CPUs using the PyTorch + Transformers ecosystem.