Skip to content

[Zeusd] Better failure handling and testing #88

@jaywonchung

Description

@jaywonchung

Right now, zeusd assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases. NVML might hang for some unknown reason, and we don't want the management task in zeusd (and thus a blocking request) to also hang forever. Or GPU might go lost, which will raise a specific error from NVML.

We want some timeout, a cancellation mechanism, and a way to mark the GPU as dead so that subsequent requests don't wait the full timeout. The failure will be reported, but we don't want zeusd threads to panic and burn and die.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions