-
Couldn't load subscription status.
- Fork 2.2k
Add a 5 min default timeout for deadlocks #16342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When a process is running and another calls `uv cache clean` or `uv cache prune` we currently deadlock - sometimes until the CI timeout (astral-sh/setup-uv#588). To avoid this, we add a default 5 min timeout waiting for a lock. 5 min balances allowing in-progress builds to finish, especially with larger native dependencies, while also giving timely errors for deadlocks on (remote) systems.
2287b53 to
2a5b779
Compare
crates/uv-fs/src/locked_file.rs
Outdated
| timeout: Duration, | ||
| ) -> Option<Output> { | ||
| let (sender, receiver) = std::sync::mpsc::channel(); | ||
| thread::spawn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen rarely and already involves waiting, so we can spawn a thread. I quickly looked into making it generally async but it didn't seem worth the churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we only have a couple calls to the blocking / sync versions of the lock APIs. I'd be tempted to make them async.
I wonder if we should move this timeout handling to the API functions so we can run_with_timeout in the blocking / sync versions and just use an async timeout in the async versions? I'm wary of spawning a thread just for a timeout in the async case.
| /// Parsed value of `UV_LOCK_TIMEOUT`, with a default of 5 min. | ||
| static LOCK_TIMEOUT: LazyLock<Duration> = LazyLock::new(|| { | ||
| let default_timeout = Duration::from_secs(300); | ||
| let Some(lock_timeout) = env::var_os(EnvVars::UV_LOCK_TIMEOUT) else { | ||
| return default_timeout; | ||
| }; | ||
|
|
||
| if let Some(lock_timeout) = lock_timeout | ||
| .to_str() | ||
| .and_then(|lock_timeout| lock_timeout.parse::<u64>().ok()) | ||
| { | ||
| Duration::from_secs(lock_timeout) | ||
| } else { | ||
| warn!( | ||
| "Could not parse value of {} as integer: {:?}", | ||
| EnvVars::UV_LOCK_TIMEOUT, | ||
| lock_timeout | ||
| ); | ||
| default_timeout | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have this to our standard environment variable parsing in EnvironmentOptions instead, I don't want to keep adding ad-hoc parsing like this.
If you want to defer it to reduce churn, that's okay — but we should add it the tracking issue and make sure it's moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it to #14720, do you want a separate tracking issue?
I had looked into parsing this centrally but the locks are called in a lot of locations including e.g. a LazyLock in a Default impl (
uv/crates/uv-auth/src/middleware.rs
Line 79 in 7f38a8a
| match TextCredentialStore::read(&path) { |
crates/uv-fs/src/locked_file.rs
Outdated
| #[derive(Debug, Error)] | ||
| pub enum LockedFileError { | ||
| #[error( | ||
| "Timeout ({}s) when waiting for lock on `{}` at `{}`, is another uv process running? Set `{}` to increase the timeout.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to say "You can set ... to increase the timeout" instead of "Set" which makes it sounds like you should do that as the solution.
crates/uv/tests/it/cache_clean.rs
Outdated
| // Write a test package that builds for a while | ||
| let child_pyproject_toml = context.temp_dir.child("pyproject.toml"); | ||
| child_pyproject_toml.write_str(indoc! {r#" | ||
| [project] | ||
| name = "child" | ||
| version = "0.1.0" | ||
| requires-python = ">=3.9" | ||
| [build-system] | ||
| requires = [] | ||
| backend-path = ["."] | ||
| build-backend = "build_backend" | ||
| "#})?; | ||
| // File to wait until the lock is acquired from starting the build. | ||
| let ready_file = context.temp_dir.child("ready_file.txt"); | ||
| let build_backend = context.temp_dir.child("build_backend.py"); | ||
| build_backend.write_str(&formatdoc! {r#" | ||
| import time | ||
| from pathlib import Path | ||
| Path(r"{}").touch() | ||
| # Make the test fail quickly if something goes wrong | ||
| time.sleep(10) | ||
| "#, | ||
| // Don't run tests in directories with double quotes, please. | ||
| ready_file.display(), | ||
| })?; | ||
|
|
||
| let mut child = context.pip_install().arg(".").spawn()?; | ||
|
|
||
| // Wait until we've acquired the lock in the first process. | ||
| while !ready_file.exists() { | ||
| std::thread::sleep(std::time::Duration::from_millis(1)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more complicated than it needs to be. We can just do
uv/crates/uv/tests/it/cache_clean.rs
Line 74 in 107d4e0
| let _cache = uv_cache::Cache::from_path(context.cache_dir.path()).with_exclusive_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16342 (comment) is my main remaining caveat.
We should probably also add a note in https://docs.astral.sh/uv/concepts/cache since that's the main place this will be relevant.
|
On the timing, I guess I might expect something like 60s rather than 5m? 5m is nice and conservative though, we could reduce it later once we see that 5m doesn't break anything |
| ) -> anyhow::Result<Vec<PathBuf>> { | ||
| let cache = Cache::from_path(temp_dir.child("cache").to_path_buf()).init()?; | ||
| let cache = Cache::from_path(temp_dir.child("cache").to_path_buf()) | ||
| .init_no_wait()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit more risky change because it assumes tests do not lock or spawn something in the background and then operate on Python versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is making every integration test async
e15037e to
762fbf6
Compare
|
I rewrote it entirely async and removed the duplication between sync and async as well as shared and exclusive.
I can see some (e.g. Rust) build taking >60s, so I'd like to go with a higher timeout. |
When a process is running and another calls
uv cache cleanoruv cache prunewe currently deadlock - sometimes until the CI timeout (astral-sh/setup-uv#588). To avoid this, we add a default 5 min timeout waiting for a lock. 5 min balances allowing in-progress builds to finish, especially with larger native dependencies, while also giving timely errors for deadlocks on (remote) systems.Commit 1 is a refactoring.