'How do I elegantly and safely maximize the amount of heap space allocated to a Java application in Kubernetes?

I have a Kubernetes deployment that deploys a Java application based on the anapsix/alpine-java image. There is nothing else running in the container expect for the Java application and the container overhead.

I want to maximise the amount of memory the Java process can use inside the docker container and minimise the amount of ram that will be reserved but never used.

For example I have:

  1. Two Kubernetes nodes that have 8 gig of ram each and no swap
  2. A Kubernetes deployment that runs a Java process consuming a maximum of 1 gig of heap to operate optimally

How can I safely maximise the amount of pods running on the two nodes while never having Kubernetes terminate my PODs because of memory limits?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  template:
    metadata:
      labels:
    app: my-deployment
    spec:
      containers:
      - name: my-deployment
    image: myreg:5000/my-deployment:0.0.1-SNAPSHOT
    ports:
    - containerPort: 8080
      name: http
    resources:
      requests:
        memory: 1024Mi
      limits:
        memory: 1024Mi

Java 8 update 131+ has a flag -XX:+UseCGroupMemoryLimitForHeap to use the Docker limits that come from the Kubernetes deployment.

My Docker experiments show me what is happening in Kubernetes

If I run the following in Docker:

docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -version

I get:

VM settings:
Max. Heap Size (Estimated): 228.00M

This low value is because Java sets -XX:MaxRAMFraction to 4 by default and I get about 1/4 of the ram allocated...

If I run the same command with -XX:MaxRAMFraction=2 in Docker:

docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -XX:MaxRAMFraction=2 -version

I get:

VM settings:
Max. Heap Size (Estimated): 455.50M

Finally setting MaxRAMFraction=1 quickly causes Kubernetes to Kill my container.

docker run -m 1024m anapsix/alpine-java:8_server-jre_unlimited java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XshowSettings:vm -XX:MaxRAMFraction=1 -version

I get:

VM settings:
Max. Heap Size (Estimated): 910.50M


Solution 1:[1]

The reason Kubernetes kills your pods is the resource limit. It is difficult to calculate because of container overhead and the usual mismatches between decimal and binary prefixes in specification of memory usage. My solution is to entirely drop the limit and only keep the requirement(which is what your pod will have available in any case if it is scheduled). Rely on the JVM to limit its heap via static specification and let Kubernetes manage how many pods are scheduled on a single node via resource requirement.

At first you will need to determine the actual memory usage of your container when running with your desired heap size. Run a pod with -Xmx1024m -Xms1024m and connect to the hosts docker daemon it's scheduled on. Run docker ps to find your pod and docker stats <container> to see its current memory usage wich is the sum of JVM heap, other static JVM usage like direct memory and your containers overhead(alpine with glibc). This value should only fluctuate within kibibytes because of some network usage that is handled outside the JVM. Add this value as memory requirement to your pod template.

Calculate or estimate how much memory other components on your nodes need to function properly. There will at least be the Kubernetes kubelet, the Linux kernel, its userland, probably an SSH daemon and in your case a docker daemon running on them. You can choose a generous default like 1 Gibibyte excluding the kubelet if you can spare the extra few bytes. Specify --system-reserved=1Gi and --kube-reserved=100Mi in your kubelets flags and restart it. This will add those reserved resources to the Kubernetes schedulers calculations when determining how many pods can run on a node. See the official Kubernetes documentation for more information.

This way there will probably be five to seven pods scheduled on a node with eight Gigabytes of RAM, depending on the above chosen and measured values. They will be guaranteed the RAM specified in the memory requirement and will not be terminated. Verify the memory usage via kubectl describe node under Allocated resources. As for elegancy/flexibility, you just need to adjust the memory requirement and JVM heap size if you want to increase RAM available to your application.

This approach only works assuming that the pods memory usage will not explode, if it would not be limited by the JVM a rouge pod might cause eviction, see out of resource handling.

Solution 2:[2]

What we do in our case is we launch with high memory limit on kubernetes, observe over time under load and either tune memory usage to the level we want to reach with -Xmx or adapt memory limits (and requests) to the real memory consumption. Truth be told, we usually use the mix of both approaches. The key to this method is to have a decent monitoring enabled on your cluster (Prometheus in our case), if you want high level of finetuning you might also want to add something like a JMX prometheus exporter, to have a detailed insight into metrics when tuning your setup.

Solution 3:[3]

Important concepts

  • The memory request is mainly used during (Kubernetes) Pod scheduling.
  • The memory limit defines a memory limit for that cgroup.

According to the article Containerize your Java applications the best way to configure your JVM is to use the following JVM args:

-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0

Note, there is a bug where you need to specify 75.0 and not 75

To simulate what happens in Kubernetes with limits in the Linux container run:

docker run --memory="300m" openjdk:17-jdk-bullseye java -XX:+UseContainerSupport -XX:MinRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XshowSettings:vm -version

result:

VM settings:
    Max. Heap Size (Estimated): 218.50M
    Using VM: OpenJDK 64-Bit Server VM

It also works on old school Java 8:

docker run --memory="300m" openjdk:8-jdk-bullseye java -XX:+UseContainerSupport -XX:MinRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XshowSettings:vm -version

This way the container will read your requests from the cgroups (cgroups v1 or cgroups v2). Having a limit is extremely important to prevent evictions and noisy neighbours. I personally set the limit 10% over the request.

Older versions of the Java like Java 8 don't read the cgroups v2 and Docker desktop uses cgroups v2. To force Docker Desktop to use legacy cgroups1 set {"deprecatedCgroupv1": true} in ~/Library/Group\ Containers/group.com.docker/settings.json

Solution 4:[4]

I think the issue here is that the kubernetes memory limits are for the container and MaxRAMFraction is for jvm. So, if jvm heap is the same as kubernetes limits then there wont be enough memory left for the container itself.

One thing you can try is increasing

limits:
  memory: 2048Mi

keeping requests limit the same. Fundamental difference between requests and limits is that requests will let you go over the limit if there is memory available at the node level while limits is a hard limit. This may not be the ideal solution and you will have to figure out how much memory is your pod consuming on top of jvm, but as a quick fix increasing limits should work.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Radek 'Goblin' Pieczonka
Solution 3
Solution 4 Tejas