clientv3: implement exponential backoff mechanism #20731

elias-dbx · 2025-09-26T18:56:45Z

The current clientv3 backoff behavior is to do a flat backoff with jitter. Having a too low backoff wait time can amplify cascading failures as client requests can be retried many times with a low backoff between each request. Operators of large etcd clusters can increase the backoff wait time, but for large clusters that wait time needs to be quite large in order to safely protect the cluster from a large number of clients retrying. A very high backoff time means that retries in a non cascading failure will have a larger wait time than needed. A better solution to handle both cascading failures as well as having lower retry times in non cascading failures is to implement exponential backoff within the etcd clients.

This commit implements the mechanism for exponential backoff in clients with two new parameters:

BackoffExponent: configures exponential backoff factor. For example a BackoffExponent of 2.0 doubles the backoff time between each retry. The default value of BackoffExponent is 1.0 which disables exponential backoff for reverse compatibility.
BackoffMaxWaitBetween: configures the max wait time when performing exponential backoff. The default value is 5 seconds.

Related to: #20717

k8s-ci-robot · 2025-09-26T18:56:55Z

Hi @elias-dbx. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ronaldngounou · 2025-10-01T03:40:35Z

client/v3/client_test.go

 }

+func TestBackoffExponent(t *testing.T) {
+	backoffExponent := float64(2.0)


Should we add another testcase with backoffExponent = 1.0 to ensure that reverse compatibility is disabled?

sure, added a test case for backoffExponent = 1.0

ronaldngounou · 2025-10-01T03:51:16Z

client/v3/options.go

+	defaultBackoffExponent = 1.0
+
+	// client-side retry backoff exponential max wait between requests.
+	defaultBackoffMaxWaitBetween = 5 * time.Second


Could you please give some context on why 5 seconds?
I have seen defaultBackoffWaitBetween = 25 * time.Millisecond but I don't know how it relates to these 5 seconds. https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53

I chose 5 seconds as it is roughly in line with other widely used client side libraries. For example the aws-sdk-go-v2 client libraries use a default max backoff of 20 seconds: https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/standard.go#L31

Also, from experience a backoff in the range of 5-30 seconds is long enough to load shed, but short enough that system will converge in a timely manner. This is only from my experience running distributed systems.

I tried to research if there are any white papers evaluating max backoff parameters, but I could not find any.

If someone has a strong reason why the default max backoff should be a different value I would be amenable to changing it.

ronaldngounou · 2025-10-01T03:54:11Z

client/v3/utils_test.go

+		{generation: 2, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond},
+		{generation: 3, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond},
+		{generation: math.MaxUint, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond},
+	}


I was also thinking if it is worth having a testcase for backoffExponent = 0 and/or exponent = 0:

Regarding this function:

http://github.com/etcd-io/etcd/pull/20731/files#diff-c4fbc528cdd146f7f307011a8ea0a5101017b892d3c1805679ef68143ad0bd8cR39-R42

func expBackoff(generation uint, exponent float64, minDelay, maxDelay time.Duration) time.Duration { delay := math.Min(math.Pow(exponent, float64(generation))*float64(minDelay), float64(maxDelay)) return time.Duration(delay) }

math.Pow(0, n) = 0 for n > 0, which means that the delay would always be 0 (or minDelay if there's a lower bound check)

I was also thinking if it is worth having a testcase for backoffExponent = 0 and/or exponent = 0

That is not a possible configuration since if the user passes in BackoffExponent=0 or BackoffMaxWaitBetween=0 the default values (1.0 and 5 seconds respectively) will be used.

Makes sense.

The current clientv3 backoff behavior is to do a flat backoff with jitter. Having a too low backoff wait time can amplify cascading failures as client requests can be retried many times with a low backoff between each request. Operators of large etcd clusters can increase the backoff wait time, but for large clusters that wait time needs to be quite large in order to safely protect the cluster from a large number of clients retrying. A very high backoff time means that retries in a non cascading failure will have a larger wait time than needed. A better solution to handle both cascading failures as well as having lower retry times in non cascading failures is to implement exponential backoff within the etcd clients. This commit implements the mechanism for exponential backoff in clients with two new parameters: 1. BackoffExponent: configures exponential backoff factor. For example a BackoffExponent of 2.0 doubles the backoff time between each retry. The default value of BackoffExponent is 1.0 which disables exponential backoff for reverse compatibility. 2. BackoffMaxWaitBetween: configures the max wait time when performing exponential backoff. The default value is 5 seconds. Signed-off-by: Elias Carter <elias@dropbox.com>

elias-dbx · 2025-10-01T16:05:10Z

@ronaldngounou thanks for the review, please see my comments and update.

ronaldngounou · 2025-10-03T20:24:34Z

/ok-to-test

k8s-ci-robot · 2025-10-03T20:56:33Z

@elias-dbx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-verify	`f364725`	link	true	`/test pull-etcd-verify`
pull-etcd-robustness-amd64	`f364725`	link	true	`/test pull-etcd-robustness-amd64`
pull-etcd-coverage-report	`f364725`	link	true	`/test pull-etcd-coverage-report`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

codecov · 2025-10-03T20:56:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.19%. Comparing base (d3f136a) to head (f364725).
⚠️ Report is 27 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
client/v3/client.go	`85.31% <100.00%> (+0.60%)`	⬆️
client/v3/config.go	`85.71% <ø> (ø)`
client/v3/utils.go	`100.00% <100.00%> (ø)`

... and 28 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20731      +/-   ##
==========================================
+ Coverage   69.16%   69.19%   +0.02%     
==========================================
  Files         420      422       +2     
  Lines       34817    34836      +19     
==========================================
+ Hits        24081    24104      +23     
+ Misses       9338     9335       -3     
+ Partials     1398     1397       -1

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3f136a...f364725. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

k8s-ci-robot · 2025-10-11T19:47:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elias-dbx, ronaldngounou
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the area/clientv3 label Sep 26, 2025

k8s-ci-robot added needs-ok-to-test size/L labels Sep 26, 2025

ronaldngounou approved these changes Oct 1, 2025

View reviewed changes

elias-dbx force-pushed the client_exponential_backoff branch from 8066e99 to f364725 Compare October 1, 2025 16:03

elias-dbx requested a review from ronaldngounou October 1, 2025 16:04

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 3, 2025

ronaldngounou approved these changes Oct 11, 2025

View reviewed changes

Uh oh!

clientv3: implement exponential backoff mechanism #20731

Are you sure you want to change the base?

clientv3: implement exponential backoff mechanism #20731

Uh oh!

Conversation

elias-dbx commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 26, 2025

Uh oh!

ronaldngounou Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

elias-dbx Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ronaldngounou Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

elias-dbx Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronaldngounou Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

elias-dbx Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronaldngounou Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

elias-dbx commented Oct 1, 2025

Uh oh!

ronaldngounou commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

codecov bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

k8s-ci-robot commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

elias-dbx commented Sep 26, 2025 •

edited

Loading

elias-dbx Oct 1, 2025 •

edited

Loading

elias-dbx Oct 1, 2025 •

edited

Loading

codecov bot commented Oct 3, 2025 •

edited

Loading