-
Notifications
You must be signed in to change notification settings - Fork 669
feat: enable HTTP completion endpoint to accept arrays of prompts and generate multiple completions per prompt #3953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enable completion endpoint to accept arrays of prompts and generate n completions per prompt, matching vLLM behavior. - Add utility functions to handle prompt arrays (get_prompt_batch_size, extract_single_prompt) - Implement batch processing in HTTP handler with proper choice index remapping - Add validation for total choices (batch_size × n ≤ 128) - Generate unique request_id for each prompt to avoid conflicts - Add comprehensive tests for batch prompts and n parameter combinations - Maintain backward compatibility with single prompt requests Choice index formula matches vLLM: final_index = prompt_idx * n + choice_idx Example: 3 prompts with n=2 yields indices 0,1 (prompt0), 2,3 (prompt1), 4,5 (prompt2)
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
…eature-parity-testingllama-33
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
WalkthroughThe pull request implements batch-aware handling for LLM completions by introducing detection logic that routes single-prompt and multi-prompt requests through dedicated code paths. Batch utilities extract and validate prompts, enforce a total choices limit, and support per-prompt choice remapping with streaming and annotation handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Overview:
This PR enables the completion endpoint to accept arrays of prompts and generate multiple completions per prompt.
Details:
Where should the reviewer start?
lib/llm/src/protocols/openai/completions.rs - contains the new validation logic and utility functions
lib/llm/src/http/service/openai.rs - contains the batch processing implementation with choice index remapping
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Bug Fixes