More memory leaks: JVM and JRuby

26 Sep 2017 | programming | estimated read time: 7 mins

The casual reader can get the impression that I do nothing else right now than debug memory leaks. This is luckily quite far from the truth. It just happens to be that I haven't had the chance to write about other things recently.

(The above is a reference to this post which was written a while ago. The preconditions there are different there though, and the root cause is also different.)

We have an app running on JRuby and JVM, which is leaking a lot of memory. It has to be restarted with a few days intervals. The app was previously running on Windows, but was recently moved to Linux. Previously, the server was scheduled to reboot every night, which "worked around" the memory leaks in a very silly and stupid (but working!) way.

Now, with the new setup, there are no such scheduled reboots any more, and so, the application is running out of memory.

The app was restarted yesterday. Right now, this is the memory usage:

root@microservice-jruby-xxxxxxxxxx:/# ps -ax -o pid,rss | grep 9762
 9762 2451060

As a first step, I activated -XX:+UseConcMarkSweepGC in JAVA_OPTS, which doesn't help the memory leak per se but makes sense on this environment anyway.

Then I Installed the openjdk-8-jdk-headless Debian package to get the jmap tool, which is useful for gathering heap dumps.

Re-learning how to create a heap dump

I have been making heap dumps before, but only on Windows. I tried now to make a dump on this Linux machine, but failed. It first complained about non-aligned memory address (might have been because JRE version was upgraded when I installed the openjdk-8-jdk-headless package, so restarted the process), then complained about this:

root@microservice-jruby-xxxxxxxxxx:/home/ecraft/apps/deliveryplanner-backend# jmap -dump:format=b,file=/tmp/heapdump-20170926-0901 -F 7410
Attaching to process ID 7410, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.131-b11
Dumping heap to /tmp/heapdump-20170926-0901 ...
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.InternalError: Metadata does not appear to be polymorphic
    at sun.jvm.hotspot.types.basic.BasicTypeDataBase.findDynamicTypeForAddress(BasicTypeDataBase.java:278)
    at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(VirtualBaseConstructor.java:102)
    at sun.jvm.hotspot.oops.Metadata.instantiateWrapperFor(Metadata.java:68)
    at sun.jvm.hotspot.memory.DictionaryEntry.klass(DictionaryEntry.java:71)
    at sun.jvm.hotspot.memory.Dictionary.classesDo(Dictionary.java:66)
    at sun.jvm.hotspot.memory.SystemDictionary.classesDo(SystemDictionary.java:190)
    at sun.jvm.hotspot.memory.SystemDictionary.allClassesDo(SystemDictionary.java:183)
    at sun.jvm.hotspot.utilities.HeapHprofBinWriter.writeClasses(HeapHprofBinWriter.java:954)
    at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(HeapHprofBinWriter.java:427)
    at sun.jvm.hotspot.tools.HeapDumper.run(HeapDumper.java:62)
    at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapDumper.main(HeapDumper.java:83)
    ... 6 more

I started googling and found this SO post which helped me. Of course, I was running it as the wrong user which caused this issue...

To solve the problem, I re-ran the command like this: sudo -u ecraft jmap -dump:format=b,file=/tmp/heapdump-20170926-0901 7410, which worked flawlessly. Then I transferred the file to my local computer (with scp), and ran jvisualvm. Unfortunately, it refused to open the file. It turned out jvisualvm can only use a full core dump, not a heap dump...

So I tried with jhat instead. Worked much better, now all I had to do was learn yet another new tool. I also learned that kill -3 <pid> can give you a core dump, but only if specifically instructed to do so. By default, it prints some pretty interesting information to stdout or stderr (a list of all current threads, and some heap data). It looked like this in my case (non-comprehensive form):

Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend: JNI global references: 15958
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend: Heap
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:  par new generation   total 77440K, used 18221K [0x00000006c5400000, 0x00000006ca800000, 0x00000006eed90000)
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:   eden space 68864K,  22% used [0x00000006c5400000, 0x00000006c6322050, 0x00000006c9740000)
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:   from space 8576K,  31% used [0x00000006c9fa0000, 0x00000006ca249518, 0x00000006ca800000)
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:   to   space 8576K,   0% used [0x00000006c9740000, 0x00000006c9740000, 0x00000006c9fa0000)
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:  concurrent mark-sweep generation total 274180K, used 193141K [0x00000006eed90000, 0x00000006ff951000, 0x00000007c0000000)
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:  Metaspace       used 80164K, capacity 109109K, committed 109280K, reserved 1144832K
Sep 26 06:36:11 microservice-jruby-xxxxxxxxxx yyyyyyyyyyyyyy-backend:   class space    used 10442K, capacity 14728K, committed 14816K, reserved 1048576K

Remember, this was now a short period of time after the Java process was started. Here is what the memory usage looked like:

root@microservice-jruby-xxxxxxxxxx:/# ps -ax -o pid,rss | grep 14302
14302 552424

I then gave it some time to run, and got back to see how it was doing. I also discussed with a colleague and heard that Eclipse Memory Analyzer is a quite decent tool for debugging these issues (it was also mentioned in one of the blog posts linked below.)

This is what it looked like after a few days:

root@microservice-jruby-m3prdes201:/# ps -ax -o pid,rss | grep 14302
14302 2297204

It feels like a leak, but more analysis is needed before we know for sure. I took out the SIGQUIT (kill -3) data again, and made a heap dump as well:

Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend: JNI global references: 57283
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend: Heap
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:  par new generation   total 77440K, used 5352K [0x00000006c5400000, 0x00000006ca800000, 0x00000006eed90000)
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:   eden space 68864K,   6% used [0x00000006c5400000, 0x00000006c5889f80, 0x00000006c9740000)
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:   from space 8576K,   8% used [0x00000006c9fa0000, 0x00000006ca050450, 0x00000006ca800000)
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:   to   space 8576K,   0% used [0x00000006c9740000, 0x00000006c9740000, 0x00000006c9fa0000)
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:  concurrent mark-sweep generation total 2141628K, used 1864161K [0x00000006eed90000, 0x00000007718ff000, 0x00000007c0000000)
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:  Metaspace       used 98653K, capacity 149726K, committed 149936K, reserved 1181696K
Sep 29 05:56:51 microservice-jruby-m3prdes201 deliveryplanner-backend:   class space    used 11473K, capacity 17531K, committed 17612K, reserved 1048576K

We were now close to the weekend, so we restarted preventively.

Using Eclipse Memory Analyzer Tool to further dig into the issue

Then, I opened the heap dump in MAT a few days later. Since the heap dump was 2 GiB this time, it took a bit of time for MAT to open and analyze it. Unfortunately, it eventually ran into an error:

GC overhead limit exceeded

By googling I found this StackOverflow post, which led me to tweak MemoryAnalyzer.ini like below, and then restart MAT and reopen the file. (The default was to max out at 1024 MiB, which was obviously too little to be able to load a file of this size.)

-startup
../Eclipse/plugins/org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar
--launcher.library
../Eclipse/plugins/org.eclipse.equinox.launcher.cocoa.macosx.x86_64_1.1.300.v20150602-1417
-vmargs
-Xms1024m
-Xmx4096m
-Dorg.eclipse.swt.internal.carbon.smallFonts
-XstartOnFirstThread

It actually felt a bit faster this time, perhaps because the JVM now had a bigger maximum heap size to play with.

When the file was done opening, I went straight to the "leak suspects report". Maybe it would give me a hint of the right direction?

jruby.org.MetaClass: our prime suspect

Problem suspect 1

One gigabyte of MetaClass instances, that's quite a lot indeed. Metaclass does not seem to be a very big and complex class itself, but its parent class is RubyClass which is a much longer class.

Then I looked at the histogram:

Histogram of heap usage per class

The histogram above also shows the same picture basically. 40 000 Ruby MetaClass instances, with a total of 468 000 IR methods. What are these methods, more specifically? I mean - if I group these methods by name, what names are the most common ones? Let's try and find that out!

In the process of doing this, I learned that Eclipse MAT has an interesting thing called the OQL (Object Query Language?), which is essentially an SQL-like language which you can use to query the heap dump for data. Very interesting indeed! Unfortunately, I didn't yet manage to do any GROUP BY queries with it, which would super cool since I could then see more specifically what methods are most commonly being defined.

Here was one of my initial OQL queries by the way:

SELECT realClass.baseName.toString() FROM org.jruby.MetaClass

I found out that there was a way to workaround this: by exporting the results to CSV, I could then just pipe it via some neat Unix utilities and get a grouped view:

$ cat meta_class_base_names.csv | sort | uniq -c | sort -n -r
37852 Logger,
2527 Class,
 436 Module,
 239 PatternFormatter,
 133 ConnectionOptions,
  36 JavaPackage,
  11 Function,
   5 Object,
   5 ExternalRuntime,
   5 Cell,
   1 realClass.baseName.toString(),
   1 ThreadedConnectionPool,
   1 String,
   1 StdoutOutputter,
   1 RootLogger,
   1 Resources,
   1 Resolv,
   1 Proc,
   1 Hash,
   1 FeatureFlag,
   1 Database,
   1 Cors,
   1 Configuration,
   1 Command,
   1 Array,

Almost 40000 Logger classes. That's clearly not reasonable. The number of Class instances is high, perhaps even to the point of unreasonable, but there's more than 10 times more Logger classes, so let's focus on these for starters.

Finding the root cause and final conclusions

The obvious next question then is, of course: why do we have such an enormous amount of Logger classes being defined? I looked more in the JRuby source code and found out that the cachedName property would probably give us an even better name. Let's re-run the query above like this:

SELECT realClass.cachedName.toString() FROM org.jruby.MetaClass

$ cat real_class_cached_name.csv | sort | uniq -c | sort -n -r
37852 Log4r::Logger,
2527 Class,
 436 Module,
 239 Log4r::PatternFormatter,
 133 Faraday::ConnectionOptions,
  36 Java::JavaPackage,
  11 FFI::Function,
   5 Object,
   5 ExecJS::ExternalRuntime,
   5 Celluloid::Proxy::Cell,
   1 realClass.cachedName.toString(),
   1 String,
   1 Sequel::ThreadedConnectionPool,
   1 Sequel::JDBC::Database,
   1 Resolv,
   1 Rack::Cors::Resources,
   1 Rack::Cors,
   1 Proc,
   1 Log4r::StdoutOutputter,
   1 Log4r::RootLogger,
   1 Hash,
   1 Celluloid::Supervision::Configuration,
   1 Bundler::Thor::Command,
   1 Bundler::FeatureFlag,
   1 Array,

Alright - I think we have the root cause found here. log4r seems to be used in our application and/or framework in such a way that it creates a huge number of classes, eventually bringing the system down on its knees. I don't think it's really log4r's fault; it's more likely we use it in a way that it was not designed for, causing huge amounts of memory leaking.

But this was strange. I know that we did, a few months ago, move away from log4r to use our home-weaved logging_library instead. It does not use log4r under the hood, since log4r is more-or-less abandoned upstream. I digged into the Gemfile.lock of my application and looked to see where the log4r dependency was coming from.

Ah. I see it now... This application uses a somewhat older version of the application server, where we hadn't yet gotten rid of log4r. There is actually an open pull request to upgrade it, we just haven't had time to test that thoroughly enough to be able to ship it to the customer. So, we will focus on getting that PR merged and deployed now ASAP, and check if that makes the leak go away permanently. If not, I will undoubtedly let you know...

As always, thanks for reading! I hope it helped you to get ideas and inspiration for when you have to debug similar issues yourself.

perlun.eu.org · About the blog · Programming · Select language / Välj språk

More memory leaks: JVM and JRuby

Re-learning how to create a heap dump

Using Eclipse Memory Analyzer Tool to further dig into the issue

jruby.org.MetaClass: our prime suspect

Finding the root cause and final conclusions

Links