Skip to content

Conversation

@JunYe1993
Copy link

@JunYe1993 JunYe1993 commented Oct 30, 2025

The function query_peer_properties() is called once during peer initialization to query basic information after the EID becomes routable. To improve reliability, this change adds a retry mechanism when the query fails with -ETIMEDOUT. Since these queries are one-time initialization steps, a single successful attempt is sufficient, and retrying enhances stability under transient MCTP bus contention or multi-master timing issues.

Testing:
add stress test for peer initialization under multi-master

while true; do
    echo "Restarting mctpd.service..."
    systemctl restart mctpd.service

    # Wait a few seconds to allow service to initialize
    sleep 20
done

After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.

root@bmc:~# journalctl -xeu mctpd.service | grep Retrying
Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2
Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1

@jk-ozlabs
Copy link
Member

Hi Daniel,

Thanks for the contribution! I think the retries are a good plan there.

I'll put a few comments inline, but as a first edit can you squash the fixes into the original patch? That means we don't have a point in history where tests are failing.

Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so.

src/mctpd.c Outdated
if (peer->ctx->verbose)
warnx("Retrying to get endpoint types for %s. Attempt %d",
peer_tostr(peer), i + 1);
rc = query_get_peer_msgtypes(peer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we now have duplicate calls to query_get_peer_msgtypes(). If you include the first call in the retry loop you can eliminate this.

No need for including the first warnx warning in the timeout case.

Same with query_get_peer_uuid() below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to

for (unsigned int i = 0; i < max_retries; i++) {
		rc = query_get_peer_msgtypes(peer);

		// Success
		if (rc == 0)
			break;

		// On timeout, retry
		if (rc == -ETIMEDOUT) {
			if (peer->ctx->verbose)
				warnx("Retrying to get endpoint types for %s. Attempt %u",
				      peer_tostr(peer), i + 1);
			continue;
		}

		// On other errors, warn and ignore
		if (rc < 0) {
			if (peer->ctx->verbose)
				warnx("Error getting endpoint types for %s. Ignoring error %d %s",
				      peer_tostr(peer), -rc, strerror(-rc));
			rc = 0;
			break;
		}
	}

Same with query_get_peer_uuid()

src/mctpd.c Outdated
static int query_peer_properties(struct peer *peer)
{
int rc;
const int max_retries = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor: can you reverse-christmas-tree these, so we would have:

	const int max_retries = 4;
	int rc;

may as well make this unsigned int too, as well as the loop counter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and we should make this configurable, but that would be best as a later change. No need to include that in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

The function `query_peer_properties()` is called once during peer initialization
to query basic information after the EID becomes routable.
To improve reliability, this change adds a retry mechanism when the query
fails with `-ETIMEDOUT`. Since these queries are one-time initialization steps,
a single successful attempt is sufficient, and retrying enhances stability
under transient MCTP bus contention or multi-master timing issues.

Testing:
add stress test for peer initialization under multi-master
```
while true; do
    echo "Restarting mctpd.service..."
    systemctl restart mctpd.service

    # Wait a few seconds to allow service to initialize
    sleep 20
done
```
After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.
```
root@bmc:~# journalctl -xeu mctpd.service | grep Retrying
Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2
Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
```

Signed-off-by: Daniel Hsu <Daniel-Hsu@quantatw.com>
@JunYe1993
Copy link
Author

Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so.

If we need to build a proper unit test for this behaviour, I’ll need some help — since the function is static in a .c file and not easy to isolate for unit testing.

I did add more details in the PR description showing how I tested this manually in a multi-master environment (restarting mctpd.service repeatedly and checking the journal for retries).

@jk-ozlabs
Copy link
Member

The test cases for mctpd are in tests/test_mctpd.py, and involve setting up a fake mctp environment (local kernel state, and remote MCTP endpoints) that mctpd can interact with.

I figure we'll need a fake endpoint implementation that drops the first Get Endpoint UUID command, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants