Technologies Getting Started Deployment β’ GGUF Converter API Endpoints Collaborators Contribute
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end
- Python 3.10+
- Modal serverless GPU's
- Poetry for dependency management
- llama.cpp
- HuggingFace Hub
The majority of this project will be run via Modal services, meaning all of the building and dependency installation will be handled by that
Here you list all prerequisites necessary for running your project. For example:
- Python ^10.10
- Modal pip package
- Poetry for dependency management (optional)
How to clone your project
git clone https://github.com/Supahands/llm-comparison-backend.gitFirst, you'll have to ensure your environments are set up, by default modal will provide you with a main environment, you will need to add your secrets to this environment as well as all deploy and serve commands will be running off of this environment as well, if you plan to create a new environment, you will need to add your secrets to that environment as well.
To create or manage environments on modal, please reffer to this documentation: https://modal.com/docs/guide/environments#environments
Also, for the CLI, you will need to ensure you are logged into Modal as well before running any of the deploy or serve commands. This can be done using the modal setup command, reffer more to this: https://modal.com/docs/guide#:~:text=Getting%20started,modal%20setup
There are two components to this project, the ollama api server as well as the litellm server which will be what our frontend uses to connect to and retrieve different models from.
I have added both the applications into a single deploy file which can be run to allow both apps to be spun up at the same time using:
modal deploy --env <environment name> deployProduction Deploy:
modal deploy --env <environment name> deployLocal Testing:
modal serve --env <environment name> deployTo install all relevant packages and authenticate with modal
make setupTo deploy the app
make deployTo serve the app locally
make serve- Create Modal secret:
modal secret create my-huggingface-secret HUGGING_FACE_HUB_TOKEN="your_token"- Run converter:
modal run --detach hugging_face_to_guff.py \
--modelowner tencent \
--modelname Tencent-Hunyuan-Large \
--quanttypes q8_0 \
--username Supa-AI \
--ollama-upload \
--hf-upload \
--clean-run- The
--detachcommand is used to allow this program to run even if your terminal disconnects from the modal servers modelowneris the repo owner that you are trying to get the model frommodelnameis the exact name of the model from that model owner you want to convertquanttypeis the size of quantization, default isq8_0which is the largest this supportsusernameis used to determine which account it should upload to and create a repo forollama-uploadis a boolean determiner for whether it should also upload the newly created quantized models to ollama under your username.- Important note! Before uploading, make sure that the volume called
ollamais created, once created you must runollama serveon your own machine to retrieve the public and private sh keys to add to ollama, more details can be found here
- Important note! Before uploading, make sure that the volume called
hf-uploadanother boolean determiner on whether it should upload these models to your hugging face repoclean-runis a boolean determiner on whether it should clean up all the existing model files in your ollama volume before running, this can fix issues where ollama won't let you re-run due to the model already existing in your volume from a previous run.
- Uses Modal volumes (model-storage)
- Persists between runs and should use existing models when running again (will continue downloads from what it has as well)
- Supports large models (>10GB)
- Parallel downloads (8 connections) thanks to booday's hugging face downloader
- Progress tracking with ETA
- Two-step conversion:
- FP16 format
- Quantization (Q4_K_M, Q5_K_M etc)
Currently, we do not support Anthropic models (Claude) on the official site due to API costs. We are actively seeking sponsors to help integrate these models. If you have suggestions for implementing Anthropic models or would like to contribute, please open an issue!
We welcome any creative solutions or partnerships that could help bring Anthropic model support to this comparison platform.
- Uses llama.cpp for GGUF conversion
- Two-step process:
- Convert to FP16 format
- Quantize to desired format (Q4_K_M, Q5_K_M etc)
- Supports importance matrix for optimized quantization
- Can split large models into manageable shards
Special thank you for all people that contributed for this project.
|
Noah Rijkaard |
EvanZJ |
Here you will explain how other developers can contribute to your project. For example, explaining how can create their branches, which patterns to follow and how to open an pull request
git clone https://github.com/Supahands/llm-comparison-backendgit checkout -b feature/NAME- Follow commit patterns
- Open a Pull Request explaining the problem solved or feature made, if exists, append screenshot of visual modifications and wait for the review!