A PowerShell automation to rebuild llama.cpp for a Windows environment. It automates the following steps:
- Fetching and extracting a specific release of OpenBLAS
- Fetching the latest version of llama.cpp
- Fixing OpenBLAS binding in the CMakeLists.txt
- Rebuilding the binaries with CMake
- Updating the Python dependencies
- Automatically detects the best BLAS acceleration
This script currently supports OpenBLAS for CPU BLAS acceleration and CUDA for NVIDIA GPU BLAS acceleration.
Download and install the latest versions:
Tip
When installing Visual Studio 2022 it is sufficent to just install the Build Tools for Visual Studio 2022 package. Also make sure that Desktop development with C++ is enabled in the installer.
Execute the following in a PowerShell terminal with Administrator privileges to enable the Hardware Accelerated GPU Scheduling feature:
New-ItemProperty `
    -Path "HKLM:\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" `
    -Name "HwSchMode" `
    -Value "2" `
    -PropertyType DWORD `
    -ForceThen restart your computer to activate the feature.
Clone the repository to a nice place on your machine via:
git clone --recurse-submodules git@github.com:countzero/windows_llama.cpp.gitCreate a new Conda environment for this project with a specific version of Python:
conda create --name llama.cpp python=3.12To make Conda available in you current shell execute the following:
conda initTip
You can always revert this via conda init --reverse.
To build llama.cpp binaries for a Windows environment with the best available BLAS acceleration execute the script:
./rebuild_llama.cpp.ps1Tip
If PowerShell is not configured to execute files allow it by executing the following in an elevated PowerShell: Set-ExecutionPolicy RemoteSigned
Download a large language model (LLM) with weights in the GGUF format into the ./vendor/llama.cpp/models directory. You can for example download the gemma-2-9b-it model in a quantized GGUF format:
Tip
See the 🤗 Open LLM Leaderboard and LMSYS Chatbot Arena Leaderboard for best in class open source LLMs.
You can easily chat with a specific model by using the .\examples\server.ps1 script:
.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"Note
The script will automatically start the llama.cpp server with an optimal configuration for your machine.
Execute the following to get detailed help on further options of the server script:
Get-Help -Detailed .\examples\server.ps1You can now chat with the model:
./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactiveYou can start llama.cpp as a webserver:
./vendor/llama.cpp/build/bin/Release/llama-server `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33And then access llama.cpp via the webinterface at:
You can increase the context size of a model with a minimal quality loss by setting the RoPE parameters. The formula for the parameters is as follows:
context_scale = increased_context_size / original_context_size
rope_frequency_scale = 1 / context_scale
rope_frequency_base = 10000 * context_scale
Note
To increase the context size of an openchat-3.6-8b-20240522 model from its original context size of 8192 to 32768 means, that the context_scale is 4.0. The rope_frequency_scale will then be 0.25 and the rope_frequency_base equals 40000.
To extend the context to 32k execute the following:
./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 32768 `
    --rope-freq-scale 0.25 `
    --rope-freq-base 40000 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactiveYou can enforce a specific grammar for the response generation. The following will always return a JSON response:
./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --prompt "The scientific classification (Taxonomy) of a Llama: " `
    --grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
    --colorExecute the following to measure the perplexity of the GGML formatted model:
./vendor/llama.cpp/build/bin/Release/llama-perplexity `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --file "./vendor/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw"You can easily count the tokens of a prompt for a specific model by using the .\examples\count_tokens.ps1 script:
 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -file ".\prompts\chat_with_llm.txt"To inspect the actual tokenization result you can use the -debug flag:
 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -prompt "Hello Word!" `
     -debugNote
The script is a simple wrapper for the tokenize.cpp example of the llama.cpp project.
Execute the following to get detailed help on further options of the server script:
Get-Help -Detailed .\examples\count_tokens.ps1Every time there is a new release of llama.cpp you can simply execute the script to automatically rebuild everything:
| Command | Description | 
|---|---|
| ./rebuild_llama.cpp.ps1 | Automatically detects best BLAS acceleration | 
| ./rebuild_llama.cpp.ps1 -blasAccelerator "OFF" | Without any BLAS acceleration | 
| ./rebuild_llama.cpp.ps1 -blasAccelerator "OpenBLAS" | With CPU BLAS acceleration | 
| ./rebuild_llama.cpp.ps1 -blasAccelerator "CUDA" | With NVIDIA GPU BLAS acceleration | 
You can build a specific version of llama.cpp by specifying a git tag or commit:
| Command | Description | 
|---|---|
| ./rebuild_llama.cpp.ps1 | The latest release | 
| ./rebuild_llama.cpp.ps1 -version "b1138" | The tag b1138 | 
| ./rebuild_llama.cpp.ps1 -version "1d16309" | The commit 1d16309 |