Skip to content

Conversation

@shipilev
Copy link
Member

@shipilev shipilev commented Oct 27, 2025

See the bug for more discussion.

We are seeing customer regressions in 21.0.9, notably on ECS Fargate. We root-caused it to JDK-8322420. That patch removed the handling of hierarchical_memory_limit, look at this hunk.

But at least cgroupv1 still needs them in some conditions, notably in ECS. There is a way to reproduce it with local Docker as well. The key is to set up host cgroup that would not be visible to the container, and so that the only way for container to know the memory limits would be to look into hierarchical_* values that kernel computes itself.

Unfortunately, it is not easy to revert the offending hunks from 21.0.9, as there were follow-up refactoring backports. So, to make it work, this PR reinstantiates the hunks using the new cgroups support code. It also makes code (subjectively) easier to read, and is in the spirit of past refactorings.

We are planning to pick this patch up to 21.0.9, at least into Corretto downstream as soon as possible to unbreak users. Therefore, the patch is also kept as crisp as possible.

I tried to come up with a regression test for it, but could not: local reproducers require amending host configuration, which requires superuser privileges, among other hassle it introduces.

Additional testing:

  • Reproducer with local Docker now passes
  • Reproducer with ECS Fargate now passes
  • Linux x86_64 server fastdebug, containers/ passes on cgroupsv1 host
  • Linux x86_64 server fastdebug, containers/ passes on cgroupsv2 host

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8370572: Cgroups hierarchical memory limit is not honored after JDK-8322420 (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28006/head:pull/28006
$ git checkout pull/28006

Update a local copy of the PR:
$ git checkout pull/28006
$ git pull https://git.openjdk.org/jdk.git pull/28006/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28006

View PR using the GUI difftool:
$ git pr show -t 28006

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28006.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 27, 2025

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 27, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Oct 27, 2025
@openjdk
Copy link

openjdk bot commented Oct 27, 2025

@shipilev The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 27, 2025
@shipilev shipilev changed the title 8370572: cgroup v1 hierarchical memory limit is not honored after JDK-8322420 8370572: Cgroups hierarchical memory limit is not honored after JDK-8322420 Oct 27, 2025
@mlbridge
Copy link

mlbridge bot commented Oct 27, 2025

Webrevs

@shipilev
Copy link
Member Author

shipilev commented Oct 27, 2025

Logs from local reproducer (see bug for details) -- asking to run with 1G in parent slice, and 25% of container memory as heap size. The goal is to have 256M heap size then.

Current mainline (broken):

[0.001s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.001s][debug][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
...
[0.001s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.001s][trace][os,container] Memory Limit is: 9223372036854771712
[0.001s][debug][os,container] container memory limit ignored: 9223372036854771712, upper bound is 264567476224
...
[0.141s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.141s][trace][os,container] Memory Limit is: 9223372036854771712
[0.141s][debug][os,container] container memory limit ignored: 9223372036854771712, upper bound is 264567476224
[0.141s][info ][gc,init     ] Memory: 246G
...
[0.141s][info ][gc,init     ] Heap Min Capacity: 32M
[0.141s][info ][gc,init     ] Heap Initial Capacity: 63104M
[0.141s][info ][gc,init     ] Heap Max Capacity: 63104M
...

This fix:

[0.001s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.001s][debug][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
...
[0.001s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.001s][trace][os,container] Memory Limit is: 9223372036854771712
[0.001s][trace][os,container] Path to /memory.use_hierarchy is /sys/fs/cgroup/memory/memory.use_hierarchy
[0.001s][trace][os,container] Use Hierarchy is: 1
[0.001s][trace][os,container] Path to /memory.stat is /sys/fs/cgroup/memory/memory.stat
[0.001s][trace][os,container] Hierarchical Memory Limit is: 1073741824
...
[0.040s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes
[0.040s][trace][os,container] Memory Limit is: 9223372036854771712
[0.040s][trace][os,container] Path to /memory.use_hierarchy is /sys/fs/cgroup/memory/memory.use_hierarchy
[0.041s][trace][os,container] Use Hierarchy is: 1
[0.041s][trace][os,container] Path to /memory.stat is /sys/fs/cgroup/memory/memory.stat
[0.041s][trace][os,container] Hierarchical Memory Limit is: 1073741824
...
[0.041s][info ][gc,init     ] Memory: 1024M
[0.041s][info ][gc,init     ] Heap Initial Capacity: 256M
[0.041s][info ][gc,init     ] Heap Max Capacity: 256M
...

@shipilev
Copy link
Member Author

ECS team also reports positive results with this patch.

@jerboaa, take a look, please?

Copy link
Member

@simonis simonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in this area, but after looking at JDK-8322420 and its two rather lengthy review threads for PR #17198 (which was abandoned) and PR #20646 which was finally approved and which contained the changes which led to this regression it seems that @jerboaa's comment in the first PR:

It is the hope to no longer needing to do this hierarchical look-up since we know where in the hierarchy we ought to look for the memory limit.

which referred to this code:

    bool is_ok = reader()->read_numerical_key_value("/memory.stat", "hierarchical_memory_limit", &hier_memlimit);
    if (!is_ok) {
      return OSCONTAINER_ERROR;
    }
    log_trace(os, container)("Hierarchical Memory Limit is: " JULONG_FORMAT, hier_memlimit);
    if (hier_memlimit < phys_mem) {
      verbose_log(hier_memlimit, phys_mem);
      return (jlong)hier_memlimit;
    }
    log_trace(os, container)("Hierarchical Memory Limit is: Unlimited");

was a little bit too optimistic, and finally led to this regression.

I'm therefore inclined to approve this fix after sleeping on it one more night :)

@jerboaa
Copy link
Contributor

jerboaa commented Oct 29, 2025

I tried to come up with a regression test for it, but could not: local reproducers require amending host configuration, which requires superuser privileges, among other hassle it introduces.

Without a proper regression test this is bound to fall through the cracks again. So are you sure this cannot be tested? It should be fine if the test needs root privileges (we could skip it if not root). But it would be better than not having one.

@shipilev
Copy link
Member Author

Without a proper regression test this is bound to fall through the cracks again. So are you sure this cannot be tested? It should be fine if the test needs root privileges (we could skip it if not root). But it would be better than not having one.

Yes, I tried to write a test, but it was not simple at all. AFAICS, you need to configure the host in a particular way to get to the interesting configuration, when part of hierarchy is hidden. So not only it would require root, it would also make changes to the host cgroup config (and properly revert them at the end of testing!). It would be better if we could come up with something like Docker-in-Docker kind of test, but that is probably a headache as well.

Anyway, we are dealing with the real-world, customer-facing breakage here, so I reasoned it was unwise to delay the immediately deployable fix, just because it was unclear how to write a reliable regression test for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-runtime hotspot-runtime-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

3 participants