CLOUDP-328217: Automation agent password secret #566

filipcirtog · 2025-10-31T21:15:12Z

Summary

If a deployment is moved to a different project, the automation agent password will be re-generated, triggering a password change in the automation plan.

This will cause a deadlock in a sharded cluster due to the multiple components requiring automation. However, it will not cause issues in replicasets.

This is a blocker for migrating projects in sharded deployments.

Proof of Work

For SCRAM (the only auth mechanism who re-generates a pwd), we now save the automation agent's password in a secret. During migration, the stored secret is utilized to preserve the password ,ensuring project migration possible.

Observed problems

For LDAP (Sharded + Replica) and SCRAM (Sharded), the following tests are failing, even though the only modification made is updating the MongoDB resource's project reference, with no other changes applied. To help further investigation, I have commented out certain code in the tests (which can make them fail) so the issue can be consistently reproduced.

I’ve observed that when users are previously defined and the project is changed, there is a risk that the deployment does not automatically return to the "running" state. This appears to occur because one or more pods fail to receive the updated automation configuration.
Furthermore, for LDAP deployments, while the deployment might return to the "running" state, the users are missing from the automation configuration.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-10-31T21:16:23Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.0 Release Notes

New Features

MongoDBCommunity: Added support to configure custom cluster domain via newly introduced spec.clusterDomain resource field. If spec.clusterDomain is not set, environment variable CLUSTER_DOMAIN is used as cluster domain. If the environment variable CLUSTER_DOMAIN is also not set, operator falls back to cluster.local as default cluster domain.
Helm Chart: Introduced two new helm fields operator.podSecurityContext and operator.securityContext that can be used to configure securityContext for Operator deployment through Helm Chart.
MongoDBSearch: Switch to gRPC and mTLS for internal communication
Since MCK 1.4 the mongod and mongot processess communicated using the MongoDB Wire Protocol and used keyfile authentication. This release switches that to gRPC with mTLS authentication. gRPC will allow for load-balancing search queries against multiple mongot processes in the future, and mTLS decouples the internal cluster authentication mode and credentials among mongod processes from the connection to the mongot process. The Operator will automatically enable gRPC for existing and new workloads, and will enable mTLS authentication if both Database Server and MongoDBSearch resource are configured for TLS.

Bug Fixes

Fixed parsing of the customEnvVars Helm value when values contain = characters.
ReplicaSet: Blocked disabling TLS and changing member count simultaneously. These operations must now be applied separately to prevent configuration inconsistencies.

Other Changes

Simplified MongoDB Search setup: Removed the custom Search Coordinator polyfill (a piece of compatibility code previously needed to add the required permissions), as MongoDB 8.2.0 and later now include the necessary permissions via the built-in searchCoordinator role.
kubectl-mongodb plugin: cosign, the signing tool that is used to sign kubectl-mongodb plugin binaries, has been updated to version 3.0.2. With this change, released binaries will be bundled with .bundle files containing both signature and certificate information. For more information on how to verify signatures using new cosign version please refer to -> https://github.com/sigstore/cosign/blob/v3.0.2/doc/cosign_verify-blob.md

lsierant · 2025-11-07T10:21:09Z

controllers/operator/authentication/authentication.go

 // Configure will configure all the specified authentication Mechanisms. We need to ensure we wait for
 // the agents to reach ready state after each operation as prematurely updating the automation config can cause the agents to get stuck.
-func Configure(conn om.Connection, opts Options, isRecovering bool, log *zap.SugaredLogger) error {
+func Configure(client kubernetesClient.Client, ctx context.Context, mdbNamespacedName *types.NamespacedName, conn om.Connection, opts Options, isRecovering bool, log *zap.SugaredLogger) error {


nit: ctx should always be first arg

lsierant · 2025-11-07T10:25:13Z

controllers/operator/authentication/authentication.go

 // Disable disables all authentication mechanisms, and waits for the agents to reach goal state. It is still required to provide
 // automation agent username, password and keyfile contents to ensure a valid Automation Config.
-func Disable(conn om.Connection, opts Options, deleteUsers bool, log *zap.SugaredLogger) error {
+func Disable(client kubernetesClient.Client, ctx context.Context, mdbNamespacedName *types.NamespacedName, conn om.Connection, opts Options, deleteUsers bool, log *zap.SugaredLogger) error {


types.NamespacedName is usually not passed by pointer. Is there a reason it's a pointer here? Is passing nil here a valid case?

lsierant · 2025-11-07T10:26:56Z

...sts/tests/authentication/fixtures/switch-project/replica-set-scram-sha-1-switch-project.yaml

+    authentication:
+      agents:
+        # This may look weird, but without it we'll get this from OpsManager:
+        # Cannot configure SCRAM-SHA-1 without using MONGODB-CR in te Agent Mode","reason":"Cannot configure SCRAM-SHA-1 without using MONGODB-CR in te Agent Mode


could you pls fix the typo in "te" in the mentioned validation btw.?

lsierant · 2025-11-07T10:27:28Z

...sts/tests/authentication/fixtures/switch-project/replica-set-scram-sha-1-switch-project.yaml

+  security:
+    authentication:
+      agents:
+        # This may look weird, but without it we'll get this from OpsManager:


nit: remove "This may look weird" - let's state the why objectively.

btw. SCRAM-SHA-1 is deprecated and it requires some additional legacy enablement with that MONGODB-CR IIRC

lsierant · 2025-11-07T10:30:08Z

docker/mongodb-kubernetes-tests/tests/authentication/replica_set_ldap_switch_project.py

+    server_certs: str,
+    namespace: str,
+) -> MongoDB:
+    resource = MongoDB.from_yaml(find_fixture(f"switch-project/{MDB_FIXTURE_NAME}.yaml"), namespace=namespace)


could we use some basic fixture to not create redundant yamls? I see you're configuring all the security and auth in the test anyway

lsierant · 2025-11-07T10:31:42Z