The document outlines a general product direction from Oracle that is intended for informational purposes only and should not be relied upon for purchasing decisions. Any features or functionality described are subject to change at Oracle's sole discretion. Statements in the presentation relating to Oracle's future plans are forward-looking statements subject to risks and uncertainties detailed in Oracle's SEC filings. All information is current as of September 2019 and Oracle undertakes no duty to update any statements. The document is protected by copyright law.
3. JVM Sustaining Engineer
OpenJDK Update Project
Maintainer
JavaOne Rock Star
Co-author of Oracle WebLogic
Server 11g 構築・運用ガイド
@DavidBuckJP
https://blogs.oracle.com/buck/
Who am I? David Buck (left)
7. JVM Crash
JVM process terminates abnormally
OS signals a fatal error (e.g. SIGSEGV, SIGFPE)
JVM detects internal unrecoverable error
Native-level manifestation
“Crash” is often used in other contexts for any process that ends in
failure, but here we mean the above only
8. Why not try to recover?
Often, “unrecoverable” means continuing would be too risky
Fast fail is preferred
Integrity of data
Quicker detection and resolution of problem
Redundancy of JVM instances (clustering) maintains availability
9. Responding to a crash
Collect any necessary data
Restart JVM process
Ideally the above two will be automated
Analyze offline
13. Fatal Error Log
hs_err_<pid>.log
Output to
-XX:ErrorFile
JVM current working directory
Temporary directory (e.g. /tmp) if can’t write to CWD
Useful for identifying known issues
Useful for identifying environmental / application issues
Not very useful for trying to identify new JVM bugs
Should not contain sensitive data
Avoid credentials on command line or environmental variables
14. Fatal Error Log Audience
JVM Vendor
Identify known issues
Quicker core file analysis
Lots of JVM internal data
End Users
Anyone troubleshooting a crash
Lots of useful non-internal data
15. Fatal Error Log Audience
JVM Vendor
Identify known issues
Quicker core file analysis
Lots of JVM internal data
End Users
Anyone troubleshooting a crash
Lots useful non-internal data
JVM
16. Fatal Error Log Audience
JVM Vendor
Identify known issues
Quicker core file analysis
Lots of JVM internal data
End Users
Anyone troubleshooting a crash
Lots useful non-internal data
JVM
hs_err_4242.log
17. Core File
Memory dump of the JVM process
Large heap -> Large core file
May contain sensitive data (passwords, PII, etc.)
Truncation is very often an issue
May consume significant disk space
Automatic restart could result in disk space exhaustion
Make sure you have plenty of space on file system
Configure core file output to non-critical file system
20. Core File
Not enabled by default in many configurations
Linux be sure to set “ulimit –c unlimited”
Disabled by default on non-Server Windows
-XX:+CreateCoredumpOnCrash (JDK >= 9)
-XX:+CreateMinidumpOnCrash (JDK <= 8)
24. Other Important Data
Native libraries loaded by JVM
Copies of libraries (Linux / macOS / Solaris)
PDB files (Windows)
Any unexpected output in log files / stdout / stderr
OutOfMemoryError
StackOverflowError
Strange OS or native library output
25. Identifying native libraries
Linux: gdb “info shared”
Windows: windbg “lm”
Solaris: “pldd corefile” (yes... this works!)
macOS: lldb “image list”
can be automated (e.g. pkgapp)
27. Stack Overflow
Way more dangerous than many people think
HotSpot is able to recover most of the time
Can silently corrupt memory
JVM behavior is considered undefined until reboot
Very easy to handle while interpreting bytecode
Impossible to guarantee proper handling in native code
37. Unwinding a Stack Overflow
If SOFE thrown in a critical section
Java-level data may be left in inconsistent state
Java-level lock may be left “held” by nobody (likely hang)
No JVM crash, but system unlikely to be able to continue running
38. Unwinding a Stack Overflow
No way to unwind arbitrary native code
Must be executing Java when we “discover” the overflow
42. Stack Banging
Can be controlled by StackShadowPages
Too low of a value makes you more vulnerable to unrecoverable
stack overflow
Too high of a value could waste stack space
49. Red / Yellow Pages
StackYellowPages
StackRedPages
Too low values makes you more vulnerable to unrecoverable stack
overflow
Too high of a value could waste stack space
50. Stack Overflow
Even when we recover, the JVM should be restarted
Locks held at the time of SOFE may be left locked
Data structures may be been left in an inconsistent state
Other stack overflows may have silently corrupted native data
51. VirtualMachineError
Thrown to indicate that the Java Virtual Machine is broken or has
run out of resources necessary for it to continue operating.
VirtualMachineError
InternalError OutOfMemoryError StackOverFlowError UnknownError
52. If You See One Stack Overflow…
Public Domain, https://commons.wikimedia.org/w/index.php?curid=696464
53. Stack Overflow
OutOfMemory and StackOverflow Exception counts:
StackOverflowErrors=1
Error log (JDK >= 8) will record number of SOFE that were
successfully handled
Many stack overflow crashes show no obvious sign of stack
overflow
54. Stack Overflow
No StackOverFlow is benign
All SOFEs should be investigated for root cause and resolved
SOFE is very hard to eliminate as a possible root cause for many
crashes
56. sun.misc.Unsafe
Java code can directly access various JVM internal functionality
Used sparingly to implement parts of the Java SE Class Library
Never intended for use outside of Sun / Oracle
57. By Cbmeeks / processed by Pixel8 - Original uploader was Cbmeeks at en.wikipedia, CC 表示-継承 3.0, https://commons.wikimedia.org/w/index.php?curid=3672924
58.
59. BASIC Support for Direct Access
PEEK
Retrieve data from an arbitrary address
POKE
Write an arbitrary value to an arbitrary address
62. private static final Unsafe theUnsafe = new Unsafe();
public static Unsafe getUnsafe() {
Class cc =
sun.reflect.Reflection.getCallerClass(2);
if (cc.getClassLoader() != null)
throw new SecurityException("Unsafe");
return theUnsafe;
}
63. Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
unsafe = (Unsafe) f.get(null);
64. Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
unsafe = (Unsafe) f.get(null);
69. The problem with more flexibility
Without limits on what code is allowed to do, we lose the ability to
reason about it.
Tradeoffs are sometimes reasonable, but only if you know you’re
making them.
Most “users” of Unsafe are not aware that their systems depend on
an unsupported and dangerous API.
70. Jigsaw: closing the loophole
By Jared Tarbell - Flickr: sky puzzle, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=31953973
71. Native Code
Native code can do anything (Unsafe on steroids!)
JNI used heavily within the JDK
HotSpot: ~1.1 mloc (c and c++)
Class library / tools ~0.9 mloc (c and c++)
Gross majority of native-caused crashes are 3rd party code
72. Native Code
Debugging / troubleshooting native code requires close familiarity
with platform and native tools.
Native code can cause memory corruption that only manifests as a
crash later in JVM code.
73. Native Code – strict JNI checking
Xcheck:jni tells the JVM to sanity check arguments and other
prerequisites during any JNI call.
The additional checking comes at a performance cost.
Can help identify mistakes in calling the JNI API, but not anything
else.
Still, JNI usage mistakes are common and are often found with
Xcheck:jni
74. Native Code – Signal Handling
Native code may install its own signal handlers
HotSpot makes heavy use of signals internally
Error log will list any handlers installed for signals we care about:
Signal Handlers:
…
SIGILL: [libjvm.so+0x8c1cb0],
sa_mask[0]=11111111011111111101111111111110,
sa_flags=SA_RESTART|SA_SIGINFO
SIGUSR1: SIG_DFL,
sa_mask[0]=00000000000000000000000000000000,
sa_flags=none
…
75. Native Code – Signal Handling
Native code may install its own signal handlers
HotSpot makes heavy use of signals internally
Error log will list any handlers installed for signals we care about:
Signal Handlers:
…
SIGILL: [libjvm.so+0x8c1cb0],
sa_mask[0]=11111111011111111101111111111110,
sa_flags=SA_RESTART|SA_SIGINFO
SIGUSR1: SIG_DFL,
sa_mask[0]=00000000000000000000000000000000,
sa_flags=none
…
76. Native Code – Signal Handling
Native code may install its own signal handlers
HotSpot makes heavy use of signals internally
Error log will list any handlers installed for signals we care about:
Signal Handlers:
…
SIGILL: [libyourmom.so+0x8c1cb0],
sa_mask[0]=11111111011111111101111111111110,
sa_flags=SA_RESTART|SA_SIGINFO
SIGUSR1: SIG_DFL,
sa_mask[0]=00000000000000000000000000000000,
sa_flags=none
…
77. Native Code – Signal Chaining
Prevents native code from overriding HotSpot handlers
Keeps track of any custom handler native code tries to install
HotSpot signal handler is called by the OS first
Signal originated (PC) from HotSpot code -> HotSpot handles it
Signal originated elsewhere -> HotSpot calls custom handler
HotSpot
Handler
OS
Custom
Handler
78. Native Code – Signal Chaining
HotSpot signal chaining code needs to override OS-provided signal
functions (e.g. sigaction).
Easiest way to force signal chaining is to preload the HotSpot signal
chaining library:
export LD_PRELOAD=<libjvm.so dir>/libjsig.so
81. Memory Exhaustion
Address space exhaustion
32-bit JVMs have less than 4GB of address space
32-bit Windows defaults to 2GB!
64-bit platforms
Address space layout issues can prevent allocation
Most often seen on Solaris
82. Memory Exhaustion
Get Rid of OutOfMemoryError Messages [DEV3420]
Poonam Parhar
Thursday, September 19, 12:15 PM - 01:00 PM | Moscone South - Room
304
Troubleshooting Native Memory Leaks in Java Applications
CodeOne 2018
Slides available on-line
Poonam’s blog has great related content
95. Demo
public class Demo {
public static void main(String[] args) {
ClassA obj = new ClassC();
System.out.println(obj.doSomething(1,2,3));
}
}
96. $ java Demo
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
Demo.main([Ljava/lang/String;)V @15: invokevirtual
Reason:
Type 'ClassC' (current frame, stack[1]) is not assignable to 'ClassA'
Current Frame:
bci: @15
flags: { }
locals: { '[Ljava/lang/String;', 'ClassC' }
stack: { 'java/io/PrintStream', 'ClassC', integer, integer, integer }
Bytecode:
0x0000000: bb00 0259 b700 034c b200 042b 0405 06b6
0x0000010: 0005 b600 06b1
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
98. As expected, the verifier protects us from ourselves.
What if we disable it…
99. We reap what we sow
[dbuck@dbuck02 demo1]$ java -Xverify:none Demo
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fa93be7991c, pid=22925, tid=140364857087744
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x46391c]
#
# Core dump written. Default location: /home/dbuck/BCV_TOI/demo/demo1/core or core.22925
#
# An error report file with more information is saved as:
# /home/dbuck/BCV_TOI/demo/demo1/hs_err_pid22925.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)
100. Demo Takeaways
No obvious evidence that bad bytecode was root cause of crash
A class is only valid in the context of previously loaded classes
No malicious intent / 3rd party tools used
101. OS Issue (example)
Intel keeps adding new SIMD
registers
Preexisting SIMD registers keep
growing
128 bit -> 256 bit -> 512 bit
Using these registers helps avoid
having to spill to local memory
By XMM_registers.png: Jonasmikederivative work: Racecar56 - XMM_registers.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=8540155
102. XMM Corruption
Linux kernels have not always correctly saved / restored XMM
register content on context switch
Can lead to virtually random memory corruption
Only hint that OS is a factor: recent kernel update
Has happened at least 3 times in the past decade
Code compiled with newer toolchains depends on XMM much
more heavily than in the past
103. JVM Bug
Most “obvious” cause of JVM crashed
Very hard for end users to identify root cause
Often possible to work around many issues
104. Performance / Stability Tradeoff
It's easy to make it fast.
It's easy to make it correct.
It’s almost impossible to do both at the same time.
108. JIT Crashes
Can happen anywhere
During JIT compilation
During execution of JITed code
Anywhere else (e.g. during GC of data corrupted by JITed code)
109. JIT Crash During Compilation
--------------- T H R E A D ---------------
Current thread (0x000000000061e800): JavaThread "C2
CompilerThread1" daemon [_thread_in_vm, id=12,
stack(0xfffffd7ef87fe000,0xfffffd7ef88fe000)]
Current CompileTask:
C2: 15252 9024 b 4
com.sun.crypto.provider.CipherCore::update (609
bytes)
113. JIT Bug Workarounds
Globally disable JIT
-Xint (Interpreter only)
-XX:TieredStopAtLevel=1 (Interpreter + C1 JIT only)
Disable JIT for a particular package / class / method
-XX:CompileCommand=
exclude,
oracle/j2ee/ws/wsdl/extensions/AbstractSerializer,startMarshall
114. Always Use Up-to-date Runtime
1000s of stability fixes during lifetime of a major release
Tremendous effort to avoid regression / incompatibilities in update
releases
Security vulnerabilities alone should justify staying up to date
Risk of known issues / vulnerabilities > Risk of updating
115. Conclusion
Most JVM crashes reported can be resolved or worked around by
end users
Lots of end-user actionable data in hs_err log
A quick sanity check of the “usual suspects” can resolve most crash
issues