CSI (Crash Scene Investigation) HotSpot: Common JVM Crash Causes and Solutions

The following is intended to outline our general product direction. It is intended for information purposes
only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing decisions. The development,
release, timing, and pricing of any features or functionality described for Oracle’s products may change
and remains at the sole discretion of Oracle Corporation.
Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and
prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed
discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and
Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q
under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website
at http://www.oracle.com/investor. All information in this presentation is current as of September 2019
and Oracle undertakes no duty to update any statement in light of new information or future events.
Safe Harbor
Copyright © 2019 Oracle and/or its affiliates.

CSI (Crash Scene Investigation) HotSpot:
Common JVM Crash Causes and Solutions
[DEV4421]
Principal Member of Technical Staff
Java Platform Group
September 17, 2019
David Buck

JVM Sustaining Engineer
OpenJDK Update Project
Maintainer
JavaOne Rock Star
Co-author of Oracle WebLogic
Server 11g 構築・運用ガイド
@DavidBuckJP
https://blogs.oracle.com/buck/
Who am I? David Buck (left)

Insurance Institute
for Highway Safety
[CC BY-SA 3.0
(https://creativecom
mons.org/licenses/by
-sa/3.0)]

Motivation
Identify root cause
Prevent future occurrences
Collect information to help others debug further

JVM Crash
JVM process terminates abnormally
OS signals a fatal error (e.g. SIGSEGV, SIGFPE)
JVM detects internal unrecoverable error
Native-level manifestation
“Crash” is often used in other contexts for any process that ends in
failure, but here we mean the above only

Why not try to recover?
Often, “unrecoverable” means continuing would be too risky
Fast fail is preferred
Integrity of data
Quicker detection and resolution of problem
Redundancy of JVM instances (clustering) maintains availability

Responding to a crash
Collect any necessary data
Restart JVM process
Ideally the above two will be automated
Analyze offline

Fatal Error Log
hs_err_<pid>.log
Output to
-XX:ErrorFile
JVM current working directory
Temporary directory (e.g. /tmp) if can’t write to CWD
Useful for identifying known issues
Useful for identifying environmental / application issues
Not very useful for trying to identify new JVM bugs
Should not contain sensitive data
Avoid credentials on command line or environmental variables

Fatal Error Log Audience
JVM Vendor
Identify known issues
Quicker core file analysis
Lots of JVM internal data
End Users
Anyone troubleshooting a crash
Lots of useful non-internal data

JVM Vendor
End Users
Lots useful non-internal data
JVM

JVM Vendor
End Users
Lots useful non-internal data
JVM
hs_err_4242.log

Core File
Memory dump of the JVM process
Large heap -> Large core file
May contain sensitive data (passwords, PII, etc.)
Truncation is very often an issue
May consume significant disk space
Automatic restart could result in disk space exhaustion
Make sure you have plenty of space on file system
Configure core file output to non-critical file system

Core File
Heap
“Other Stuff”
Thread Stacks and more
“Other Stuff”
Heap
“Other Stuff”

Core File
Not enabled by default in many configurations
Linux be sure to set “ulimit –c unlimited”
Disabled by default on non-Server Windows
-XX:+CreateCoredumpOnCrash (JDK >= 9)
-XX:+CreateMinidumpOnCrash (JDK <= 8)

Serviceability Agent
Platform-independent core file
debugging
Built-in knowledge of JVM
internals
Possibly able to recover JFR data
from core file
Much easier to use JDK >= 9

Other Important Data
Native libraries loaded by JVM
Copies of libraries (Linux / macOS / Solaris)
PDB files (Windows)
Any unexpected output in log files / stdout / stderr
OutOfMemoryError
StackOverflowError
Strange OS or native library output

Identifying native libraries
Linux: gdb “info shared”
Windows: windbg “lm”
Solaris: “pldd corefile” (yes... this works!)
macOS: lldb “image list”
can be automated (e.g. pkgapp)

Stack Overflow
Way more dangerous than many people think
HotSpot is able to recover most of the time
Can silently corrupt memory
JVM behavior is considered undefined until reboot
Very easy to handle while interpreting bytecode
Impossible to guarantee proper handling in native code

Stack Overflow
Guard
Page
Stack Pointer

Stack Overflow
Guard
Page
Stack Pointer
○ Read
○ Write
× Execute
× Read
× Write
× Execute

Stack Overflow
Guard
Page
Stack Pointer
○ Read
○ Write
× Execute
× Read
× Write
× Execute
ＳＩＧＳＥＧＶ

Stack Overflow
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
× Read
× Write
× Execute
× Read
× Write
× Execute

Stack Overflow
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
× Read
× Write
× Execute
× Read
× Write
× Execute
ＳＩＧＳＥＧＶ

Stack Overflow
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
○ Read
○ Write
× Execute
× Read
× Write
× Execute
StackOverFlowError

Unwinding a Stack Overflow
If SOFE thrown in a critical section
Java-level data may be left in inconsistent state
Java-level lock may be left “held” by nobody (likely hang)
No JVM crash, but system unlikely to be able to continue running

Unwinding a Stack Overflow
No way to unwind arbitrary native code
Must be executing Java when we “discover” the overflow

Stack Banging
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
× Read
× Write
× Execute
× Read
× Write
× Execute

Stack Banging
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
× Read
× Write
× Execute
× Read
× Write
× Execute
StackOverFlowError

Stack Banging
Can be controlled by StackShadowPages
Too low of a value makes you more vulnerable to unrecoverable
stack overflow
Too high of a value could waste stack space

Stack Overflow
Red
Pages
Stack Pointer
Yellow
Pages
[guard page]
○ Read
○ Write
× Execute
× Read
× Write
× Execute
× Read
× Write
× Execute
??????????????

Red / Yellow Pages
StackYellowPages
StackRedPages
Too low values makes you more vulnerable to unrecoverable stack
overflow
Too high of a value could waste stack space

Stack Overflow
Even when we recover, the JVM should be restarted
Locks held at the time of SOFE may be left locked
Data structures may be been left in an inconsistent state
Other stack overflows may have silently corrupted native data

VirtualMachineError
Thrown to indicate that the Java Virtual Machine is broken or has
run out of resources necessary for it to continue operating.
VirtualMachineError
InternalError OutOfMemoryError StackOverFlowError UnknownError

If You See One Stack Overflow…
Public Domain, https://commons.wikimedia.org/w/index.php?curid=696464

Stack Overflow
OutOfMemory and StackOverflow Exception counts:
StackOverflowErrors=1
Error log (JDK >= 8) will record number of SOFE that were
successfully handled
Many stack overflow crashes show no obvious sign of stack
overflow

Stack Overflow
No StackOverFlow is benign
All SOFEs should be investigated for root cause and resolved
SOFE is very hard to eliminate as a possible root cause for many
crashes

Use of Internal / Private APIs

sun.misc.Unsafe
Java code can directly access various JVM internal functionality
Used sparingly to implement parts of the Java SE Class Library
Never intended for use outside of Sun / Oracle

By Cbmeeks / processed by Pixel8 - Original uploader was Cbmeeks at en.wikipedia, CC 表示-継承 3.0, https://commons.wikimedia.org/w/index.php?curid=3672924

BASIC Support for Direct Access
PEEK
Retrieve data from an arbitrary address
POKE
Write an arbitrary value to an arbitrary address

sun.misc.Unsafe
Allocate uninitialized memory on Heap
More flexible memory model
PEEK/POKE of JVM address space

Unsafe Usage
Reflection
Serialization
NIO
java.util.concurrent
Encryption / Decryption
BigDecimal / BigInteger
Java2D
CPU usage monitoring (JMX)

private static final Unsafe theUnsafe = new Unsafe();
public static Unsafe getUnsafe() {
Class cc =
sun.reflect.Reflection.getCallerClass(2);
if (cc.getClassLoader() != null)
throw new SecurityException("Unsafe");
return theUnsafe;
}

Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
unsafe = (Unsafe) f.get(null);

Isn’t more flexibility a good thing?

No

No
Not Always

The problem with more flexibility
Without limits on what code is allowed to do, we lose the ability to
reason about it.
Tradeoffs are sometimes reasonable, but only if you know you’re
making them.
Most “users” of Unsafe are not aware that their systems depend on
an unsupported and dangerous API.

Jigsaw: closing the loophole
By Jared Tarbell - Flickr: sky puzzle, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=31953973

Native Code
Native code can do anything (Unsafe on steroids!)
JNI used heavily within the JDK
HotSpot: ~1.1 mloc (c and c++)
Class library / tools ~0.9 mloc (c and c++)
Gross majority of native-caused crashes are 3rd party code

Native Code
Debugging / troubleshooting native code requires close familiarity
with platform and native tools.
Native code can cause memory corruption that only manifests as a
crash later in JVM code.

Native Code – strict JNI checking
Xcheck:jni tells the JVM to sanity check arguments and other
prerequisites during any JNI call.
The additional checking comes at a performance cost.
Can help identify mistakes in calling the JNI API, but not anything
else.
Still, JNI usage mistakes are common and are often found with
Xcheck:jni

Native Code – Signal Handling
Native code may install its own signal handlers
HotSpot makes heavy use of signals internally
Error log will list any handlers installed for signals we care about:
Signal Handlers:
…
SIGILL: [libjvm.so+0x8c1cb0],
sa_mask[0]=11111111011111111101111111111110,
sa_flags=SA_RESTART|SA_SIGINFO
SIGUSR1: SIG_DFL,
sa_mask[0]=00000000000000000000000000000000,
sa_flags=none
…

Native Code – Signal Handling
Native code may install its own signal handlers
HotSpot makes heavy use of signals internally
Error log will list any handlers installed for signals we care about:
Signal Handlers:
…
SIGILL: [libyourmom.so+0x8c1cb0],
sa_mask[0]=11111111011111111101111111111110,
sa_flags=SA_RESTART|SA_SIGINFO
SIGUSR1: SIG_DFL,
sa_mask[0]=00000000000000000000000000000000,
sa_flags=none
…

Native Code – Signal Chaining
Prevents native code from overriding HotSpot handlers
Keeps track of any custom handler native code tries to install
HotSpot signal handler is called by the OS first
Signal originated (PC) from HotSpot code -> HotSpot handles it
Signal originated elsewhere -> HotSpot calls custom handler
HotSpot
Handler
OS
Custom
Handler

Native Code – Signal Chaining
HotSpot signal chaining code needs to override OS-provided signal
functions (e.g. sigaction).
Easiest way to force signal chaining is to preload the HotSpot signal
chaining library:
export LD_PRELOAD=<libjvm.so dir>/libjsig.so

Memory Exhaustion
Out of backing store
Address space layout issues

Memory Exhaustion
OS is out of backing store (RAM or swap space)

Memory Exhaustion
Address space exhaustion
32-bit JVMs have less than 4GB of address space
32-bit Windows defaults to 2GB!
64-bit platforms
Address space layout issues can prevent allocation
Most often seen on Solaris

Memory Exhaustion
Get Rid of OutOfMemoryError Messages [DEV3420]
Poonam Parhar
Thursday, September 19, 12:15 PM - 01:00 PM | Moscone South - Room
304
Troubleshooting Native Memory Leaks in Java Applications
CodeOne 2018
Slides available on-line
Poonam’s blog has great related content

ClassA
public class ClassA {
public int doSomething(int i1, int i2, int i3)
{
return i1+i1+i3;
}
}

ClassB
public class ClassB {
public Integer doSomethingElse(int i1, int i2, int i3)
{
return new Integer(i1+i1+i3);
}
}

ClassC
public class ClassC extends ClassA {}

Demo
public class Demo {
public static void main(String[] args) {
ClassA obj = new ClassC();
System.out.println(obj.doSomething(1,2,3));
}
}

Object
ClassA ClassB
ClassC
Demo

Lets do something bad…
public class ClassC extends ClassB {}

$ java Demo
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
Demo.main([Ljava/lang/String;)V @15: invokevirtual
Reason:
Type 'ClassC' (current frame, stack[1]) is not assignable to 'ClassA'
Current Frame:
bci: @15
flags: { }
locals: { '[Ljava/lang/String;', 'ClassC' }
stack: { 'java/io/PrintStream', 'ClassC', integer, integer, integer }
Bytecode:
0x0000000: bb00 0259 b700 034c b200 042b 0405 06b6
0x0000010: 0005 b600 06b1
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)

As expected, the verifier protects us from ourselves.

As expected, the verifier protects us from ourselves.
What if we disable it…

We reap what we sow
[dbuck@dbuck02 demo1]$ java -Xverify:none Demo
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fa93be7991c, pid=22925, tid=140364857087744
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x46391c]
#
# Core dump written. Default location: /home/dbuck/BCV_TOI/demo/demo1/core or core.22925
#
# An error report file with more information is saved as:
# /home/dbuck/BCV_TOI/demo/demo1/hs_err_pid22925.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

Demo Takeaways
No obvious evidence that bad bytecode was root cause of crash
A class is only valid in the context of previously loaded classes
No malicious intent / 3rd party tools used

OS Issue (example)
Intel keeps adding new SIMD
registers
Preexisting SIMD registers keep
growing
128 bit -> 256 bit -> 512 bit
Using these registers helps avoid
having to spill to local memory
By XMM_registers.png: Jonasmikederivative work: Racecar56 - XMM_registers.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=8540155

XMM Corruption
Linux kernels have not always correctly saved / restored XMM
register content on context switch
Can lead to virtually random memory corruption
Only hint that OS is a factor: recent kernel update
Has happened at least 3 times in the past decade
Code compiled with newer toolchains depends on XMM much
more heavily than in the past

JVM Bug
Most “obvious” cause of JVM crashed
Very hard for end users to identify root cause
Often possible to work around many issues

Performance / Stability Tradeoff
It's easy to make it fast.
It's easy to make it correct.
It’s almost impossible to do both at the same time.

Garbage Collector Complexity
Single threaded is simpler than parallel
STW (Throughput) is simpler than Concurrent

Garbage Collector Complexity
Serial Parallel
Concurrent Mark and Sweep
G1

Bytecode Execution Complexity
Interpreter C1 JIT C2 JIT

JIT Crashes
Can happen anywhere
During JIT compilation
During execution of JITed code
Anywhere else (e.g. during GC of data corrupted by JITed code)

JIT Crash During Compilation
--------------- T H R E A D ---------------
Current thread (0x000000000061e800): JavaThread "C2
CompilerThread1" daemon [_thread_in_vm, id=12,
stack(0xfffffd7ef87fe000,0xfffffd7ef88fe000)]
Current CompileTask:
C2: 15252 9024 b 4
com.sun.crypto.provider.CipherCore::update (609
bytes)

JIT Crash During Execution
Java Execution Thread (Not JIT compilation thread)
Stack: [0xfffffffcc8a00000,0xfffffffcc8b00000], sp=0xfffffffcc8afb780, free space=1005k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J oracle.j2ee.ws.wsdl.extensions.AbstractSerializer.startMarshall(Ljavax/wsdl/De
finition;Loracle/j2ee/ws/wsdl/util/XMLWriter;Ljavax/wsdl/extensions/Extensibil
ityElement;)V
j oracle.j2ee.ws.wsdl.extensions.addressing.EndpointReferenceSerializer.marshall
(Ljava/lang/Class;Ljavax/xml/namespace/QName;Ljavax/wsdl/extensions/Extensibil
ityElement;Ljava/io/PrintWriter;Ljavax/wsdl/Definition;Ljavax/wsdl/extensions/
ExtensionRegistry;)V+41

Other JIT Crashes
Compilation events (10 events):
Event: 15.131 Thread 0x000000000061e800 nmethod 9019 0xfffffd7fee3ecdd0 code [0xfffffd7fee3ed020,
0xfffffd7fee3eddd0]
Event: 15.131 Thread 0x000000000061c000 9020 b 4 javax.crypto.Cipher$Transform::matches (48 bytes)
Event: 15.143 Thread 0x000000000061c000 nmethod 9020 0xfffffd7fee3f3210 code [0xfffffd7fee3f3420,
0xfffffd7fee3f4030]
Event: 15.143 Thread 0x000000000061e800 9021 b 4 javax.crypto.Cipher::checkOpmode (21 bytes)
Event: 15.143 Thread 0x000000000061e800 nmethod 9021 0xfffffd7fee3f2990 code [0xfffffd7fee3f2ae0,
0xfffffd7fee3f2b38]
Event: 15.144 Thread 0x000000000061c000 9022 b 4 com.sun.crypto.provider.CipherCore::init (552 bytes)
Event: 15.150 Thread 0x000000000061c000 nmethod 9022 0xfffffd7fee3ebc90 code [0xfffffd7fee3ebe80,
0xfffffd7fee3ec5e0]
Event: 15.151 Thread 0x0000000000621000 9023 b 2 com.sun.crypto.provider.CipherCore::init (552 bytes)
Event: 15.153 Thread 0x0000000000621000 nmethod 9023 0xfffffd7feea17010 code [0xfffffd7feea17460,
0xfffffd7feea18d48]
Event: 15.182 Thread 0x000000000061e800 9024 b 4 com.sun.crypto.provider.CipherCore::update (609 bytes)

JIT Bug Workarounds
Globally disable JIT
-Xint (Interpreter only)
-XX:TieredStopAtLevel=1 (Interpreter + C1 JIT only)
Disable JIT for a particular package / class / method
-XX:CompileCommand=
exclude,
oracle/j2ee/ws/wsdl/extensions/AbstractSerializer,startMarshall

Always Use Up-to-date Runtime
1000s of stability fixes during lifetime of a major release
Tremendous effort to avoid regression / incompatibilities in update
releases
Security vulnerabilities alone should justify staying up to date
Risk of known issues / vulnerabilities > Risk of updating

Conclusion
Most JVM crashes reported can be resolved or worked around by
end users
Lots of end-user actionable data in hs_err log
A quick sanity check of the “usual suspects” can resolve most crash
issues

Resources
Java SE Troubleshooting Guide
https://docs.oracle.com/en/java/javase/11/troubleshoot/index.ht
ml
Poonam’s CodeOne Native Leak slides
https://www.slideshare.net/PoonamBajaj5/troubleshooting-
native-memory-leaks-in-java-applications
Poonam’s blog
https://blogs.oracle.com/poonam/

The preceding is intended to outline our general product direction. It is intended for information purposes
only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing decisions. The development,
release, timing, and pricing of any features or functionality described for Oracle’s products may change
and remains at the sole discretion of Oracle Corporation.
Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and
prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed
discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and
Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q
under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website
at http://www.oracle.com/investor. All information in this presentation is current as of September 2019
and Oracle undertakes no duty to update any statement in light of new information or future events.
Safe Harbor

CSI (Crash Scene Investigation) HotSpot: Common JVM Crash Causes and Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CSI (Crash Scene Investigation) HotSpot: Common JVM Crash Causes and Solutions

Similar to CSI (Crash Scene Investigation) HotSpot: Common JVM Crash Causes and Solutions (20)

More from David Buck

More from David Buck (20)

Recently uploaded

Recently uploaded (20)

CSI (Crash Scene Investigation) HotSpot: Common JVM Crash Causes and Solutions