Skip to content

Conversation

@filipcirtog
Copy link

@filipcirtog filipcirtog commented Oct 31, 2025

Summary

If a deployment is moved to a different project, the automation agent password will be re-generated, triggering a password change in the automation plan.

This will cause a deadlock in a sharded cluster due to the multiple components requiring automation. However, it will not cause issues in replicasets.

This is a blocker for migrating projects in sharded deployments.

Proof of Work

For SCRAM (the only auth mechanism who re-generates a pwd), we now save the automation agent's password in a secret. During migration, the stored secret is utilized to preserve the password ,ensuring project migration possible.

Observed problems

For LDAP (Sharded + Replica) and SCRAM (Sharded), the following tests are failing, even though the only modification made is updating the MongoDB resource's project reference, with no other changes applied. To help further investigation, I have commented out certain code in the tests (which can make them fail) so the issue can be consistently reproduced.

I’ve observed that when users are previously defined and the project is changed, there is a risk that the deployment does not automatically return to the "running" state. This appears to occur because one or more pods fail to receive the updated automation configuration.
Furthermore, for LDAP deployments, while the deployment might return to the "running" state, the users are missing from the automation configuration.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.0 Release Notes

New Features

  • MongoDBCommunity: Added support to configure custom cluster domain via newly introduced spec.clusterDomain resource field. If spec.clusterDomain is not set, environment variable CLUSTER_DOMAIN is used as cluster domain. If the environment variable CLUSTER_DOMAIN is also not set, operator falls back to cluster.local as default cluster domain.
  • Helm Chart: Introduced two new helm fields operator.podSecurityContext and operator.securityContext that can be used to configure securityContext for Operator deployment through Helm Chart.
  • MongoDBSearch: Switch to gRPC and mTLS for internal communication
    Since MCK 1.4 the mongod and mongot processess communicated using the MongoDB Wire Protocol and used keyfile authentication. This release switches that to gRPC with mTLS authentication. gRPC will allow for load-balancing search queries against multiple mongot processes in the future, and mTLS decouples the internal cluster authentication mode and credentials among mongod processes from the connection to the mongot process. The Operator will automatically enable gRPC for existing and new workloads, and will enable mTLS authentication if both Database Server and MongoDBSearch resource are configured for TLS.

Bug Fixes

  • Fixed parsing of the customEnvVars Helm value when values contain = characters.
  • ReplicaSet: Blocked disabling TLS and changing member count simultaneously. These operations must now be applied separately to prevent configuration inconsistencies.

Other Changes

  • Simplified MongoDB Search setup: Removed the custom Search Coordinator polyfill (a piece of compatibility code previously needed to add the required permissions), as MongoDB 8.2.0 and later now include the necessary permissions via the built-in searchCoordinator role.
  • kubectl-mongodb plugin: cosign, the signing tool that is used to sign kubectl-mongodb plugin binaries, has been updated to version 3.0.2. With this change, released binaries will be bundled with .bundle files containing both signature and certificate information. For more information on how to verify signatures using new cosign version please refer to -> https://github.com/sigstore/cosign/blob/v3.0.2/doc/cosign_verify-blob.md

@filipcirtog filipcirtog marked this pull request as ready for review November 4, 2025 14:05
@filipcirtog filipcirtog requested a review from a team as a code owner November 4, 2025 14:05
// Configure will configure all the specified authentication Mechanisms. We need to ensure we wait for
// the agents to reach ready state after each operation as prematurely updating the automation config can cause the agents to get stuck.
func Configure(conn om.Connection, opts Options, isRecovering bool, log *zap.SugaredLogger) error {
func Configure(client kubernetesClient.Client, ctx context.Context, mdbNamespacedName *types.NamespacedName, conn om.Connection, opts Options, isRecovering bool, log *zap.SugaredLogger) error {
Copy link
Contributor

@lsierant lsierant Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ctx should always be first arg

// Disable disables all authentication mechanisms, and waits for the agents to reach goal state. It is still required to provide
// automation agent username, password and keyfile contents to ensure a valid Automation Config.
func Disable(conn om.Connection, opts Options, deleteUsers bool, log *zap.SugaredLogger) error {
func Disable(client kubernetesClient.Client, ctx context.Context, mdbNamespacedName *types.NamespacedName, conn om.Connection, opts Options, deleteUsers bool, log *zap.SugaredLogger) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types.NamespacedName is usually not passed by pointer. Is there a reason it's a pointer here? Is passing nil here a valid case?

authentication:
agents:
# This may look weird, but without it we'll get this from OpsManager:
# Cannot configure SCRAM-SHA-1 without using MONGODB-CR in te Agent Mode","reason":"Cannot configure SCRAM-SHA-1 without using MONGODB-CR in te Agent Mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls fix the typo in "te" in the mentioned validation btw.?

security:
authentication:
agents:
# This may look weird, but without it we'll get this from OpsManager:
Copy link
Contributor

@lsierant lsierant Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "This may look weird" - let's state the why objectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. SCRAM-SHA-1 is deprecated and it requires some additional legacy enablement with that MONGODB-CR IIRC

server_certs: str,
namespace: str,
) -> MongoDB:
resource = MongoDB.from_yaml(find_fixture(f"switch-project/{MDB_FIXTURE_NAME}.yaml"), namespace=namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use some basic fixture to not create redundant yamls? I see you're configuring all the security and auth in the test anyway

Ensures test isolation in a multi-namespace test environment.
"""
return random_k8s_name(f"{namespace}-project-")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to randomize it - namespace is randomized in evg anyway. And randomizing it is just making local runs difficult and not possible to re-run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to have different project names for different deployments then just add a resource name to it: {namespace}-{mdb.name} - it will suffice for making them unique for the test



@pytest.fixture(scope="module")
def replica_set(namespace: str) -> MongoDB:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource fixture should be scoped to function and have if try_load(resource) applied (look for it in other tests).

We don't have this pattern applied across the board, but we try to use it for newer tests.

tester.assert_expected_users(0)

def test_create_secret(self):
print(f"creating password for MongoDBUser {self.USER_NAME} in secret/{self.PASSWORD_SECRET_NAME} ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? normally we have the progress visible when running the test, i.e. which test step is currently executing. There is nothing more than that so I think it's redundant

},
)

replica_set.load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you use the pattern with function scope and try_load, you don't need to load it manually in the tests avoiding flakiness errors due to stale object


replica_set.load()
replica_set["spec"]["opsManager"]["configMapRef"]["name"] = new_project_configmap
replica_set.set_version(custom_mdb_version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version should be only set in the resource's fixture unless the changing version is part of the test.


def test_moved_replica_set_connectivity(self):
"""
Verify connectivity to the replica set after switching projects.
Copy link
Contributor

@lsierant lsierant Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is redundant, just name the function name so it's self describing (it's already good)


def test_ops_manager_state_correctly_updated_in_moved_replica_set(self, replica_set: MongoDB):
"""
Ensure Ops Manager state is correctly updated in the moved replica set after the project switch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment also is not adding much over the already good function name

MDB_RESOURCE_NAME = "replica-set-scram-sha-1-switch-project"
MDB_FIXTURE_NAME = MDB_RESOURCE_NAME

CONFIG_MAP_KEYS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? Those keys won't be ever changed so we can just inline them.



@pytest.mark.e2e_replica_set_x509_switch_project
class TestReplicaSetCreationAndProjectSwitch(KubernetesTester):
Copy link
Contributor

@lsierant lsierant Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the three or four test classes here any different? Do you think we could extract it as a reusable test class parametrized with resource and auth mechanism?

The way to do this is to have a generic test helper (important: without the pytest.mark annotation!)

class ReplicaSetCreationAndProjectSwitchTestHelper: 
     def test_create_replica_set
     def test_ops_manager_state_correctly_updated_in_initial_replica_set
[...]

And in the actual test files we could have only test functions that simply delegate to the common code:

@pytest.fixture
def test_helper() -> ReplicaSetCreationAndProjectSwitchTestHelper:
    ... configure and return test helper

@pytest.mark.e2e_replica_set_x509_switch_project
def test_create_replica_set(test_helper: ReplicaSetCreationAndProjectSwitchTestHelper):
        test_helper.test_create_replica_set()

... etc. 

I'm a bit worried that we've added just too much of duplicated code. We've already have a code duplication problem - let's try to not exacerbate it further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of using test helper to reduce duplication:

def test_search_restore_sample_database(mdb: MongoDB):

unfortunately for now we cannot reuse easily whole test classes with the testing steps. The test functions/classes must be defined in each file separately, but we can organize the code to minimize the duplication

# def test_create_secret(self):
# print(f"creating password for MongoDBUser {self.USER_NAME} in secret/{self.PASSWORD_SECRET_NAME} ")

# create_or_update_secret(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either remove or uncomment commented code; if necessary to leave it - explain why is commented

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome you've added so many e2e tests! But let's try to think how we could minimize the code duplication in there. It looks like all the tests are almost identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants