[Zeusd] Better failure handling and testing

Right now, `zeusd` assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases.  NVML might hang for some unknown reason, and we don't want the management task in `zeusd` (and thus a blocking request) to also hang forever. Or GPU might go lost, which will raise a specific error from NVML.

We want some timeout, a cancellation mechanism, and a way to mark the GPU as dead so that subsequent requests don't wait the full timeout. The failure will be reported, but we don't want `zeusd` threads to panic and burn and die.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Zeusd] Better failure handling and testing #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Zeusd] Better failure handling and testing #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions