-
Notifications
You must be signed in to change notification settings - Fork 4.7k
AI starter kit chart #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
AI starter kit chart #579
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: volatilemolotov The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Here is the initial PR, currently in draft state. Think we should be able to send it for reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want Cloud Build and Terraform as prerequisites. Suggest making this example more generic, like other AI examples. I'd like to focus on the Kubernetes manifests and make it customizable for different platforms.
Removed the example values and ci folder. Hope makefile can stay, it can be useful |
| -f values.yaml | ||
| ``` | ||
|
|
||
| 3. **Access JupyterHub:** |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check out which exactly can be run in Minikube and note accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All should work. Multi agent ray one needs ray enabled but we are not enabling it by default
| helm install ai-starter-kit . \ | ||
| --set huggingface.token="YOUR_HF_TOKEN" \ | ||
| -f values.yaml \ | ||
| -f values-gke.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetkuo do you have concern to include the GKE specific setup in the example? Do you think we should remove this all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the example should be as generic as possible so that it's applicable to most Kubernetes clusters. It might be challenging at times for some platform-specific setup, and in that case we should call it out and mention alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case, there are some platform specific setup such as specify the GPU in GKE cloud.google.com/gke-accelerator: nvidia-l4. What is your suggestion to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all GKE mentions and added a readme entry that demonstrates how to work with GPUs
ai/ai-starter-kit/helm-chart/ai-starter-kit/files/multi-agent-ollama.ipynb
Show resolved
Hide resolved
| "source": [ | ||
| "import os, time, requests, json\n", | ||
| "\n", | ||
| "USE_WRAPPER = True\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be auto set like what you did in cell 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was resolved in previous commit.
ai/ai-starter-kit/helm-chart/ai-starter-kit/files/multi-agent-ollama.ipynb
Show resolved
Hide resolved
| @@ -0,0 +1,64 @@ | |||
| .PHONY: check_hf_token check_OCI_target package_helm lint dep_update install install_gke start uninstall push_helm | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usage of the make commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want me to document each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in general in README. User can still following the current README to install via helm, so not sure when these make commands should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented in commit: 78a03d7
ai/ai-starter-kit/helm-chart/ai-starter-kit/files/multi-agent-ramalama.ipynb
Show resolved
Hide resolved
ai/ai-starter-kit/helm-chart/ai-starter-kit/files/multi-agent-ramalama.ipynb
Show resolved
Hide resolved
Applying fixes to resolve PR comments
| ### Delete GKE cluster | ||
| ```bash | ||
| gcloud container clusters delete ${CLUSTER_NAME} --region=${REGION} | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in commit: ced46e9
| "id": "0af596cf-5ba6-42df-a030-61d7a20d6f7b", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Cell 6 - MLFlow: connect to tracking server and list recent runs\n", |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alex-akv did you reproduce this issue? I'm still seeing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I get 4 recent runs as an output.
| nvidia.com/gpu: 1 | ||
|
|
||
| nodeSelector: | ||
| cloud.google.com/gke-accelerator: nvidia-l4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call out in the description above that this is using GKE as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Described in commit: ced46e9
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "!pip install numpy mlflow tensorflow \"ray[serve,default,client]\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify tensorflow==2.20.0 fixed some error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specified in commit: ced46e9
| "id": "8111d705-595e-4e65-8479-bdc76191fa31", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Cell 3 - Deploy model on Ray Serve with llama-cpp\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run this cell doesn't output any error but the corresponding Ray job failed with the below logs
runtime_env setup failed: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 384, in _create_runtime_env_with_retry
runtime_env_context = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 350, in _setup_runtime_env
await create_for_plugin_if_needed(
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/plugin.py", line 254, in create_for_plugin_if_needed
size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 309, in create
pip_dir_bytes = await task
^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 289, in _create_for_hash
await PipProcessor(
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 191, in _run
await self._install_pip_packages(
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 167, in _install_pip_packages
await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/utils.py", line 105, in check_output_cmd
raise SubprocessCalledProcessError(
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[13] failed with the following details.
Command '['/tmp/ray/session_2025-10-31_14-03-31_982555_1/runtime_resources/pip/8dc32a48ead56d51e7e1a0de9341332701cf7b2f/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2025-10-31_14-03-31_982555_1/runtime_resources/pip/8dc32a48ead56d51e7e1a0de9341332701cf7b2f/ray_runtime_env_internal_pip_requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 179.2 MB/s eta 0:00:00
Downloading graphql_core-3.2.6-py3-none-any.whl (203 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203.4/203.4 kB 196.9 MB/s eta 0:00:00
Downloading graphql_relay-3.2.0-py3-none-any.whl (16 kB)
Downloading greenlet-3.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (607 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 607.6/607.6 kB 184.8 MB/s eta 0:00:00
Downloading itsdangerous-2.2.0-py3-none-any.whl (16 kB)
Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 kB 217.8 MB/s eta 0:00:00
Downloading kiwisolver-1.4.9-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 188.7 MB/s eta 0:00:00
Downloading pillow-12.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (7.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 165.1 MB/s eta 0:00:00
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Downloading werkzeug-3.1.3-py3-none-any.whl (224 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 182.1 MB/s eta 0:00:00
Downloading zipp-3.23.0-py3-none-any.whl (10 kB)
Downloading mako-1.3.10-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 181.9 MB/s eta 0:00:00
Downloading smmap-5.0.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: llama-cpp-python
Building wheel for llama-cpp-python (pyproject.toml): started
Building wheel for llama-cpp-python (pyproject.toml): finished with status 'error'
error: subprocess-exited-with-error
× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
*** scikit-build-core 0.11.6 using CMake 4.1.2 (wheel)
*** Configuring CMake...
loading initial cache file /tmp/tmpqiu0581x/build/CMakeInit.txt
CMake Error at /tmp/pip-build-env-6dhn2ys0/normal/lib/python3.12/site-packages/cmake/data/share/cmake-4.1/Modules/CMakeDetermineCCompiler.cmake:48 (message):
Could not find compiler set in environment variable CC:
gcc -pthread -B /home/ray/anaconda3/compiler_compat.
Call Stack (most recent call first):
CMakeLists.txt:3 (project)
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
*** CMake configuration failed
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tested using MacOS and Linux desktop environments and were not able to reproduce the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you testing using minikube on Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
This PR adds a AI starter kit helm chart that aims to provide a out of the box development solution for AI workloads. Uses RayServe, Ollama or Ramalama to run the LLMs and JupyterHub for the development.