-
Couldn't load subscription status.
- Fork 10.2k
clientv3: implement exponential backoff mechanism #20731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @elias-dbx. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| } | ||
|
|
||
| func TestBackoffExponent(t *testing.T) { | ||
| backoffExponent := float64(2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add another testcase with backoffExponent = 1.0 to ensure that reverse compatibility is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, added a test case for backoffExponent = 1.0
| defaultBackoffExponent = 1.0 | ||
|
|
||
| // client-side retry backoff exponential max wait between requests. | ||
| defaultBackoffMaxWaitBetween = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give some context on why 5 seconds?
I have seen defaultBackoffWaitBetween = 25 * time.Millisecond but I don't know how it relates to these 5 seconds. https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose 5 seconds as it is roughly in line with other widely used client side libraries. For example the aws-sdk-go-v2 client libraries use a default max backoff of 20 seconds: https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/standard.go#L31
Also, from experience a backoff in the range of 5-30 seconds is long enough to load shed, but short enough that system will converge in a timely manner. This is only from my experience running distributed systems.
I tried to research if there are any white papers evaluating max backoff parameters, but I could not find any.
If someone has a strong reason why the default max backoff should be a different value I would be amenable to changing it.
| {generation: 2, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond}, | ||
| {generation: 3, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond}, | ||
| {generation: math.MaxUint, exponent: 1.0, minDelay: 100 * time.Millisecond, maxDelay: 500 * time.Millisecond, expectedBackoff: 100 * time.Millisecond}, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking if it is worth having a testcase for backoffExponent = 0 and/or exponent = 0:
Regarding this function:
func expBackoff(generation uint, exponent float64, minDelay, maxDelay time.Duration) time.Duration {
delay := math.Min(math.Pow(exponent, float64(generation))*float64(minDelay), float64(maxDelay))
return time.Duration(delay)
}math.Pow(0, n) = 0 for n > 0, which means that the delay would always be 0 (or minDelay if there's a lower bound check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking if it is worth having a testcase for backoffExponent = 0 and/or exponent = 0
That is not a possible configuration since if the user passes in BackoffExponent=0 or BackoffMaxWaitBetween=0 the default values (1.0 and 5 seconds respectively) will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
The current clientv3 backoff behavior is to do a flat backoff with jitter. Having a too low backoff wait time can amplify cascading failures as client requests can be retried many times with a low backoff between each request. Operators of large etcd clusters can increase the backoff wait time, but for large clusters that wait time needs to be quite large in order to safely protect the cluster from a large number of clients retrying. A very high backoff time means that retries in a non cascading failure will have a larger wait time than needed. A better solution to handle both cascading failures as well as having lower retry times in non cascading failures is to implement exponential backoff within the etcd clients. This commit implements the mechanism for exponential backoff in clients with two new parameters: 1. BackoffExponent: configures exponential backoff factor. For example a BackoffExponent of 2.0 doubles the backoff time between each retry. The default value of BackoffExponent is 1.0 which disables exponential backoff for reverse compatibility. 2. BackoffMaxWaitBetween: configures the max wait time when performing exponential backoff. The default value is 5 seconds. Signed-off-by: Elias Carter <elias@dropbox.com>
8066e99 to
f364725
Compare
|
@ronaldngounou thanks for the review, please see my comments and update. |
|
/ok-to-test |
|
@elias-dbx: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 28 files with indirect coverage changes @@ Coverage Diff @@
## main #20731 +/- ##
==========================================
+ Coverage 69.16% 69.19% +0.02%
==========================================
Files 420 422 +2
Lines 34817 34836 +19
==========================================
+ Hits 24081 24104 +23
+ Misses 9338 9335 -3
+ Partials 1398 1397 -1 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: elias-dbx, ronaldngounou The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The current clientv3 backoff behavior is to do a flat backoff with jitter. Having a too low backoff wait time can amplify cascading failures as client requests can be retried many times with a low backoff between each request. Operators of large etcd clusters can increase the backoff wait time, but for large clusters that wait time needs to be quite large in order to safely protect the cluster from a large number of clients retrying. A very high backoff time means that retries in a non cascading failure will have a larger wait time than needed. A better solution to handle both cascading failures as well as having lower retry times in non cascading failures is to implement exponential backoff within the etcd clients.
This commit implements the mechanism for exponential backoff in clients with two new parameters:
Related to: #20717