Skip to content

Havign problems while running open3dml on HPC #681

@Inshu0302

Description

@Inshu0302

Checklist

Steps to reproduce the issue

I have a custom dataset of huge point clouds, I want to run on HPC. The HPC has GPU's a100, h100, v100, h200.

This is my def file:

Bootstrap: docker
From: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

%post
# Update package lists
apt-get update

# Install essential system libraries and Python3 (including OpenMP - fixes libgomp.so.1 error)
apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    python3-dev \
    python3-venv \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libgomp1 \
    libgcc-s1 \
    libstdc++6 \
    libomp-dev \
    git \
    ca-certificates \
    wget \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Ensure PyTorch 2.0.x (cu118) compatible with Open3D-ML
python3 -m pip install --no-cache-dir \
    torch==2.0.1+cu118 \
    torchvision==0.15.2+cu118 \
    torchaudio==2.0.2+cu118 \
    --index-url https://download.pytorch.org/whl/cu118

# Install Open3D and dependencies (exact versions from working SSH env)
python3 -m pip install --no-cache-dir \
    open3d==0.18.0 \
    numpy==1.26.4 \
    scikit-learn==1.7.1 \
    pyyaml==6.0.2 \
    laspy==2.5.4 \
    torchmetrics==1.4.0 \
    tensorboard \
    wandb==0.17.5

# Clone and install Open3D-ML from source (idempotent)
mkdir -p /opt
rm -rf /opt/Open3D-ML
git clone --depth 1 https://github.com/isl-org/Open3D-ML.git /opt/Open3D-ML
python3 -m pip install -e /opt/Open3D-ML

# Test installations during build to catch issues early
python3 -c "import torch; print(f'✓ PyTorch {torch.__version__} installed')"
python3 -c "import open3d; print('✓ Open3D imported successfully')"
python3 -c "import open3d.ml.torch; print('✓ Open3D-ML imported successfully')"
python3 -c "import numpy as np; print(f'✓ NumPy {np.__version__}')"
python3 -c "import sklearn; print(f'✓ scikit-learn {sklearn.__version__}')"

%environment
# Set environment variables for optimal performance
export OMP_NUM_THREADS=8
export MPLBACKEND=Agg
export PYTHONUNBUFFERED=1

# Ensure CUDA libraries are found (critical for GPU training)
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH"

# Python path for custom modules
export PYTHONPATH="/workspace/code:$PYTHONPATH"

%runscript
exec python3 "$@"



### Error message

```shell
The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  

Warning: Open3D was built with CUDA 11.7 butPyTorch was built with CUDA 11.8. Falling back to CPU for now.Otherwise, install PyTorch with CUDA 11.7.
/usr/bin/python3: Error while finding module specification for 'open3d.ml.torch.scripts.run_pipeline' (ModuleNotFoundError: No module named 'open3d.ml.torch.scripts')

Open3D, Python and System information

- Operating system: Ubuntu 22.04.3
- Python version: Python 3.10.12
- Open3D version: 0.18.0
- System type: HPC
- Is this remote workstation?: yes 
- How did you install Open3D?: pip
- Compiler version (if built from source): (e.g. gcc 7.5, clang 7.0)

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions