-
Notifications
You must be signed in to change notification settings - Fork 32
mctp: Add retry for one-time peer property queries on timeout #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
25e6a14 to
a7f8281
Compare
|
Hi Daniel, Thanks for the contribution! I think the retries are a good plan there. I'll put a few comments inline, but as a first edit can you squash the fixes into the original patch? That means we don't have a point in history where tests are failing. Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so. |
src/mctpd.c
Outdated
| if (peer->ctx->verbose) | ||
| warnx("Retrying to get endpoint types for %s. Attempt %d", | ||
| peer_tostr(peer), i + 1); | ||
| rc = query_get_peer_msgtypes(peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, we now have duplicate calls to query_get_peer_msgtypes(). If you include the first call in the retry loop you can eliminate this.
No need for including the first warnx warning in the timeout case.
Same with query_get_peer_uuid() below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to
for (unsigned int i = 0; i < max_retries; i++) {
rc = query_get_peer_msgtypes(peer);
// Success
if (rc == 0)
break;
// On timeout, retry
if (rc == -ETIMEDOUT) {
if (peer->ctx->verbose)
warnx("Retrying to get endpoint types for %s. Attempt %u",
peer_tostr(peer), i + 1);
continue;
}
// On other errors, warn and ignore
if (rc < 0) {
if (peer->ctx->verbose)
warnx("Error getting endpoint types for %s. Ignoring error %d %s",
peer_tostr(peer), -rc, strerror(-rc));
rc = 0;
break;
}
}Same with query_get_peer_uuid()
src/mctpd.c
Outdated
| static int query_peer_properties(struct peer *peer) | ||
| { | ||
| int rc; | ||
| const int max_retries = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor: can you reverse-christmas-tree these, so we would have:
const int max_retries = 4;
int rc;may as well make this unsigned int too, as well as the loop counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and we should make this configurable, but that would be best as a later change. No need to include that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
a7f8281 to
2de7b87
Compare
The function `query_peer_properties()` is called once during peer initialization
to query basic information after the EID becomes routable.
To improve reliability, this change adds a retry mechanism when the query
fails with `-ETIMEDOUT`. Since these queries are one-time initialization steps,
a single successful attempt is sufficient, and retrying enhances stability
under transient MCTP bus contention or multi-master timing issues.
Testing:
add stress test for peer initialization under multi-master
```
while true; do
echo "Restarting mctpd.service..."
systemctl restart mctpd.service
# Wait a few seconds to allow service to initialize
sleep 20
done
```
After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.
```
root@bmc:~# journalctl -xeu mctpd.service | grep Retrying
Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2
Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
```
Signed-off-by: Daniel Hsu <Daniel-Hsu@quantatw.com>
2de7b87 to
18ab1b5
Compare
If we need to build a proper unit test for this behaviour, I’ll need some help — since the function is static in a .c file and not easy to isolate for unit testing. I did add more details in the PR description showing how I tested this manually in a multi-master environment (restarting mctpd.service repeatedly and checking the journal for retries). |
|
The test cases for mctpd are in I figure we'll need a fake endpoint implementation that drops the first Get Endpoint UUID command, for example. |
The function
query_peer_properties()is called once during peer initialization to query basic information after the EID becomes routable. To improve reliability, this change adds a retry mechanism when the query fails with-ETIMEDOUT. Since these queries are one-time initialization steps, a single successful attempt is sufficient, and retrying enhances stability under transient MCTP bus contention or multi-master timing issues.Testing:
add stress test for peer initialization under multi-master
After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.