Skip to content

Conversation

@weliang1
Copy link
Contributor

@weliang1 weliang1 commented Oct 22, 2025

[sig-network] pods should successfully create sandboxes by adding pod to network - failed in two Prow jobs recently in https://issues.redhat.com/browse/OCPBUGS-63478

Prow Job1:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade/1980375618641989632

Test Log:

namespace/openshift-etcd node/ci-op-tz2f8kj9-fbbf2-j6brz-master-0 pod/etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0 hmsg/b1ad8ffd50 - 297.49 seconds after deletion - firstTimestamp/2025-10-20T22:22:31Z interesting/true lastTimestamp/2025-10-20T22:22:31Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0_openshift-etcd_4f678bb2-10f6-47f8-b080-36cbae9d50a4_0(09b24379f8fe8571b7a5eda9a668692a1f31d31d76772a7dabdb19c9778a43d0): error adding pod openshift-etcd_etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"09b24379f8fe8571b7a5eda9a668692a1f31d31d76772a7dabdb19c9778a43d0" Netns:"/var/run/netns/4ab6fd45-b6fa-4709-bb88-a4e7a81b8b3e" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0;K8S_POD_INFRA_CONTAINER_ID=09b24379f8fe8571b7a5eda9a668692a1f31d31d76772a7dabdb19c9778a43d0;K8S_POD_UID=4f678bb2-10f6-47f8-b080-36cbae9d50a4" Path:"" ERRORED: error configuring pod [openshift-etcd/etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0] networking: Multus: [openshift-etcd/etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0/4f678bb2-10f6-47f8-b080-36cbae9d50a4]: error waiting for pod: Get "https://api-int.ci-op-tz2f8kj9-fbbf2.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ci-op-tz2f8kj9-fbbf2-j6brz-master-0?timeout=1m0s": context deadline exceeded
': StdinData: {"auxiliaryCNIChainName":"vendor-cni-chain","binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator,openshift-cnv","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}}

Prow job2:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-upgrade-from-stable-4.20-e2e-metal-ipi-ovn-upgrade/1979889012780830720

Test log:

namespace/openshift-kube-apiserver node/master-2 pod/revision-pruner-9-master-2 hmsg/cde531b9d8 - 145.44 seconds after deletion - firstTimestamp/2025-10-19T15:01:26Z interesting/true lastTimestamp/2025-10-19T15:01:26Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-master-2_openshift-kube-apiserver_a08d5e14-b027-4a1a-ac4e-1e58d9103c01_0(a6bb2c3a3e4af84920d82a41b695d764a677a86e4acc11d80669f9351228444a): error adding pod openshift-kube-apiserver_revision-pruner-9-master-2 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"a6bb2c3a3e4af84920d82a41b695d764a677a86e4acc11d80669f9351228444a" Netns:"/var/run/netns/55eab7e4-f851-47ac-9317-af272dacca1b" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=revision-pruner-9-master-2;K8S_POD_INFRA_CONTAINER_ID=a6bb2c3a3e4af84920d82a41b695d764a677a86e4acc11d80669f9351228444a;K8S_POD_UID=a08d5e14-b027-4a1a-ac4e-1e58d9103c01" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/revision-pruner-9-master-2] networking: Multus: [openshift-kube-apiserver/revision-pruner-9-master-2/a08d5e14-b027-4a1a-ac4e-1e58d9103c01]: error setting the networks status: SetPodNetworkStatusAnnotation: failed to update the pod revision-pruner-9-master-2 in out of cluster comm: SetNetworkStatus: failed to update the pod revision-pruner-9-master-2 in out of cluster comm: status update failed for pod /: Get "https://api-int.ostest.test.metalkube.org:6443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-9-master-2?timeout=1m0s": dial tcp: lookup api-int.ostest.test.metalkube.org on 192.168.111.1:53: no such host
': StdinData: {"auxiliaryCNIChainName":"vendor-cni-chain","binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator,openshift-cnv","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}

For deleted pods, the code never checks if the failure occurred during operator Progressing. It only checks the time difference, which results in a hard failure at > 5 seconds. According to the code's own logic, if the etcd/dns/network operator was Progressing, sandbox failures should be treated more leniently (as flakes), but this check is missing in the deleted pod branch.

The fix adds operator Progressing checks to the deleted pod branch

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025
@openshift-ci openshift-ci bot requested review from p0lyn0mial and sjenning October 22, 2025 19:40
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weliang1
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-trt
Copy link

openshift-trt bot commented Oct 22, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: d1251ad

  • "Import the release payload "nightly-arm64" from an external source" [Total: 2, Pass: 2, Fail: 0, Flake: 0]

@weliang1
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-vsphere-runc-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

@weliang1: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-vsphere-runc-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cf40ee70-b01b-11f0-9e1f-9a80527f472e-0

@weliang1
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-vsphere-runc-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

@weliang1: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-vsphere-runc-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/19cb6ba0-b01c-11f0-9215-b5d627521940-0

@weliang1
Copy link
Contributor Author

weliang1 commented Oct 23, 2025

/payload-job periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-azure-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

@weliang1: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c2dba620-b05c-11f0-9e18-0c6ac06f757c-0

@weliang1
Copy link
Contributor Author

/test e2e-gcp-ovn

@weliang1
Copy link
Contributor Author

/test e2e-metal-ipi-ovn-ipv6

@weliang1
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2025

@weliang1: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn d1251ad link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants