Skip to content

Conversation

@volatilemolotov
Copy link

This PR adds a AI starter kit helm chart that aims to provide a out of the box development solution for AI workloads. Uses RayServe, Ollama or Ramalama to run the LLMs and JupyterHub for the development.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: volatilemolotov
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 24, 2025
@volatilemolotov
Copy link
Author

Here is the initial PR, currently in draft state. Think we should be able to send it for reviews

@janetkuo @gongmax @fcabrera23

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want Cloud Build and Terraform as prerequisites. Suggest making this example more generic, like other AI examples. I'd like to focus on the Kubernetes manifests and make it customizable for different platforms.

@volatilemolotov
Copy link
Author

I'm not sure if we want Cloud Build and Terraform as prerequisites. Suggest making this example more generic, like other AI examples. I'd like to focus on the Kubernetes manifests and make it customizable for different platforms.

Removed the example values and ci folder. Hope makefile can stay, it can be useful

-f values.yaml
```

3. **Access JupyterHub:**

This comment was marked as resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check out which exactly can be run in Minikube and note accordingly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All should work. Multi agent ray one needs ray enabled but we are not enabling it by default

Comment on lines 85 to 88
helm install ai-starter-kit . \
--set huggingface.token="YOUR_HF_TOKEN" \
-f values.yaml \
-f values-gke.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo do you have concern to include the GKE specific setup in the example? Do you think we should remove this all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the example should be as generic as possible so that it's applicable to most Kubernetes clusters. It might be challenging at times for some platform-specific setup, and in that case we should call it out and mention alternatives.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, there are some platform specific setup such as specify the GPU in GKE cloud.google.com/gke-accelerator: nvidia-l4. What is your suggestion to handle this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all GKE mentions and added a readme entry that demonstrates how to work with GPUs

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 15, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 15, 2025
"source": [
"import os, time, requests, json\n",
"\n",
"USE_WRAPPER = True\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be auto set like what you did in cell 5?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was resolved in previous commit.

@@ -0,0 +1,64 @@
.PHONY: check_hf_token check_OCI_target package_helm lint dep_update install install_gke start uninstall push_helm
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of the make commands?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want me to document each?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in general in README. User can still following the current README to install via helm, so not sure when these make commands should be used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in commit: 78a03d7

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 28, 2025
Comment on lines 180 to 183
### Delete GKE cluster
```bash
gcloud container clusters delete ${CLUSTER_NAME} --region=${REGION}
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in commit: ced46e9

"id": "0af596cf-5ba6-42df-a030-61d7a20d6f7b",
"metadata": {},
"source": [
"### Cell 6 - MLFlow: connect to tracking server and list recent runs\n",

This comment was marked as outdated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-akv did you reproduce this issue? I'm still seeing it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I get 4 recent runs as an output.

nvidia.com/gpu: 1

nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call out in the description above that this is using GKE as an example

Copy link

@alex-akv alex-akv Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described in commit: ced46e9

},
"outputs": [],
"source": [
"!pip install numpy mlflow tensorflow \"ray[serve,default,client]\""
Copy link

@gongmax gongmax Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify tensorflow==2.20.0 fixed some error.

Copy link

@alex-akv alex-akv Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified in commit: ced46e9

"id": "8111d705-595e-4e65-8479-bdc76191fa31",
"metadata": {},
"source": [
"### Cell 3 - Deploy model on Ray Serve with llama-cpp\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run this cell doesn't output any error but the corresponding Ray job failed with the below logs

runtime_env setup failed: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 384, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 350, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/plugin.py", line 254, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 309, in create
    pip_dir_bytes = await task
                    ^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 289, in _create_for_hash
    await PipProcessor(
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 191, in _run
    await self._install_pip_packages(
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/pip.py", line 167, in _install_pip_packages
    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runtime_env/utils.py", line 105, in check_output_cmd
    raise SubprocessCalledProcessError(
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[13] failed with the following details.
Command '['/tmp/ray/session_2025-10-31_14-03-31_982555_1/runtime_resources/pip/8dc32a48ead56d51e7e1a0de9341332701cf7b2f/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2025-10-31_14-03-31_982555_1/runtime_resources/pip/8dc32a48ead56d51e7e1a0de9341332701cf7b2f/ray_runtime_env_internal_pip_requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 179.2 MB/s eta 0:00:00
    Downloading graphql_core-3.2.6-py3-none-any.whl (203 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203.4/203.4 kB 196.9 MB/s eta 0:00:00
    Downloading graphql_relay-3.2.0-py3-none-any.whl (16 kB)
    Downloading greenlet-3.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (607 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 607.6/607.6 kB 184.8 MB/s eta 0:00:00
    Downloading itsdangerous-2.2.0-py3-none-any.whl (16 kB)
    Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 kB 217.8 MB/s eta 0:00:00
    Downloading kiwisolver-1.4.9-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.5 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 188.7 MB/s eta 0:00:00
    Downloading pillow-12.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (7.0 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 165.1 MB/s eta 0:00:00
    Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
    Downloading werkzeug-3.1.3-py3-none-any.whl (224 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 182.1 MB/s eta 0:00:00
    Downloading zipp-3.23.0-py3-none-any.whl (10 kB)
    Downloading mako-1.3.10-py3-none-any.whl (78 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 181.9 MB/s eta 0:00:00
    Downloading smmap-5.0.2-py3-none-any.whl (24 kB)
    Building wheels for collected packages: llama-cpp-python
      Building wheel for llama-cpp-python (pyproject.toml): started
      Building wheel for llama-cpp-python (pyproject.toml): finished with status 'error'
      error: subprocess-exited-with-error
  
      × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [16 lines of output]
          *** scikit-build-core 0.11.6 using CMake 4.1.2 (wheel)
          *** Configuring CMake...
          loading initial cache file /tmp/tmpqiu0581x/build/CMakeInit.txt
          CMake Error at /tmp/pip-build-env-6dhn2ys0/normal/lib/python3.12/site-packages/cmake/data/share/cmake-4.1/Modules/CMakeDetermineCCompiler.cmake:48 (message):
            Could not find compiler set in environment variable CC:
      
            gcc -pthread -B /home/ray/anaconda3/compiler_compat.
          Call Stack (most recent call first):
            CMakeLists.txt:3 (project)
      
      
          CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
          CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
          -- Configuring incomplete, errors occurred!
      
          *** CMake configuration failed
          [end of output]
  
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for llama-cpp-python
    Failed to build llama-cpp-python
    ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects
    ```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tested using MacOS and Linux desktop environments and were not able to reproduce the issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you testing using minikube on Linux?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants