These slides walk you through how Tanel troubleshooted another complex Oracle Database performance issue that included "cache buffers chain" latch contention, "log file sequential read" wait event and an ORA-600 bug!
2. blog.tanelpoder.com 2
Let's get started!
This is a reconstruction using the real problem
screenshots and AWR/ASH reports, plus follow-
up experiments on my test environment.
3. blog.tanelpoder.com 3
Symptoms
โข Some queries intermittently extremely slow
โข Concurrency wait class waits going through the roof
โข Performance gets back to normal by itself
5. blog.tanelpoder.com 5
The AWR approach
Average wait of
28ms for a latch
is very long!
Could it be CPU
starvation
(scheduling
latency?)
6. blog.tanelpoder.com 6
The AWR approach
Plenty of idle CPU
time (remember this
is an average over
30 minute AWR
period!)
Warning, on Linux
the LOAD also
includes processes
waiting for disk IO!
13. blog.tanelpoder.com 13
The ASH approach
ASH links together many
facts about a session's
doings (wait event, sql_id,
username, block# etc)
And ASH has 1-second
measurement granularity,
we can manually query
that
16. blog.tanelpoder.com 16
One latch or many?
The P1 is the
latch address!
Now we know that
almost all of the
latch contention
was against a
single child latch!
17. blog.tanelpoder.com 17
So, who's blocking us?
Bummer, we
know the blocker
for only 1% of
the latch waits!
So, we don't know
who blocked us for
99% of the waits...
This is a wait
interface
shortcoming...
18. blog.tanelpoder.com 18
The ASH approach
ASH report incorrectly
labels "unknown"
blocker as "Held
Shared"
Sometimes the latch
waits are extremely
long!
Let's pick the one
blocker reported and
hope it's relevant ๏ *
19. blog.tanelpoder.com 19
What was the blocker doing?
SQL> SELECT session_state,event,sql_id, COUNT(*) * 10 seconds
FROM dba_hist_active_sess_history
WHERE session_id = 5914
AND sample_time BETWEEN <spike_start> AND <spike_end>;
%This SESSION EVENT SQL_ID
------ ------- ---------------------------------------- -------------
67% ON CPU 2bdg4ygkpyxc9
14% WAITING log file sequential read 2bdg4ygkpyxc9
10% ON CPU fmdctt76kf3mb
5% WAITING library cache: mutex X
5% WAITING log buffer space 2bdg4ygkpyxc9
It was a user
session by "JDBC
Thin Client"
Why would a
user session
read a redo log
file?!
20. blog.tanelpoder.com 20
Alternative: LatchProf Collector ("ASH" of Latch Holders)
SQL> SELECT latch_name, hold_mode, sid, COUNT(*)
FROM latchprof_view
WHERE latch_name = 'cache buffers chains'
AND child_address = '385BA5C4'
GROUP BY latch_name, hold_mode, sid
ORDER BY COUNT(*) DESC;
LATCH_NAME HOLD_MODE SID COUNT(*)
------------------------------ -------------- ---------- ----------
cache buffers chains MAYBE-SHARED 19 3
cache buffers chains SHARED 201 2
cache buffers chains MAYBE-SHARED 78 2
cache buffers chains EXCLUSIVE 201 2
cache buffers chains MAYBE-SHARED 132 1
cache buffers chains MAYBE-SHARED 79 1
Report holders
for only the child
latch involved in
the waits
โข http://tech.e2sn.com/oracle/troubleshooting/latch-contention-
troubleshooting
โข http://blog.tanelpoder.com/files/scripts/tools/collectors/latchprof_install.sql
21. blog.tanelpoder.com 21
What are the reasons for reading a redo log file?
1. LGWR doing a log switch (but ours was a user session!)
2. A Streams/GoldenGate/LogMiner log mining operation?
โข But this was a regular application SELECT query against normal tables
3. Manual dumping of redo log contents
โข ALTER SYSTEM DUMP LOGFILE โฆ
4. Automatic Block Media Recovery?
โข http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmblock
.htm#BRADV118
โข v$database_block_corruption was empty (IIRC)
โข I thought to check the alert log for any corruption warningsโฆ
22. blog.tanelpoder.com 22
alert.log
Mon Feb 24 15:54:18 2014
Dumping diagnostic data in directory=[cdmp_20140224155418], requested by
(instance=1, osid=25519), summary=[incident=74688].
Mon Feb 24 15:56:58 2014
Errors in file
/u01/app/oracle/diag/rdbms/lin112/LIN112/trace/LIN112_ora_25519.trc
(incident=74689):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [],
[], [], [], []
Incident details in:
/u01/app/oracle/diag/rdbms/lin112/LIN112/incident/incdir_74689/LIN112_ora_2551
9_i74689.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 24 15:56:59 2014
Sweep [inc][74689]: completed
Sweep [inc2][74689]: completed
Wow, ORA-600s!
kdsgrp = Kernel
Data Get Row Piece
23. blog.tanelpoder.com 23
Process trace file โ errorstack
Dump continued from file: /u01/app/oracle/diag/rdbms/lin112/.../LIN112_ora_25519.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
----- Current SQL Statement for this session (sql_id=6wvgqn05s855u) -----
SELECT /*+ INDEX_RS_ASC(t) */ COUNT(LENGTH(owner)) FROM t_c_hotsos t WHERE object_id > 1
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
kgerinv()+41 call kgerinv_internal() 11B60460 ? F6970020 ?
10809D5C ? 258 ? 0 ? 0 ?
kgeasnmierr()+47 call kgerinv() 11B60460 ? F6970020 ?
10809D5C ? 0 ? FFB53640 ?
kdsgrp1_dump()+1138 call kgeasnmierr() 11B60460 ? F6970020 ?
10809D5C ? 0 ?
kdsgrp1()+32 call kdsgrp1_dump() F663C9D0 ? F663C9D0 ? 0 ?
qetlbr()+247 call kdsgrp1() F663C9D0 ? F663C9D0 ? 0 ?
qertbFetchByRowID() call qetlbr() F663C9D0 ? F6639F08 ?
+5617 2233A6E0 ? 0 ? F663C8FC ? 0 ?
0 ?
qergsFetch()+497 call 00000000 2233A6E0 ? F663C8D0 ?
1024AA74 ? FFB53A08 ? 7FFF ?
opifch2()+2659 call 00000000 2233A5E4 ? F663CB38 ?
Apparently the
failure in kdsgrp1
function causes
some special dump
function to be called
This is a reproduced
test query
24. blog.tanelpoder.com 24
Process trace file โ pinned buffer history
...
END OF PROCESS STATE
----- Pinned Buffer History -----
---------------------
PINNED BUFFER HISTORY (oldest pin first)
---------------------
BH (0x283f2628) file#: 1 rdba: 0x00430445 (1/197701) class: 1 ba: 0x28264000
set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,25
dbwrid: 0 obj: 132533 objn: 132533 tsn: 0 afn: 1 hint: f
hash: [0x38563cf4,0x38563cf4] lru: [0x23be8998,0x323fa1c0]
ckptq: [NULL] fileq: [NULL] objq: [0x23be89b0,0x323fa1d8] objaq: [0x23be89b8,0x323fa1e0]
st: XCURRENT md: NULL fpin: 'kdiwh100: kdircys' tch: 3
flags: only_sequential_access
LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
buffer tsn: 0 rdba: 0x00430445 (1/197701)
scn: 0x0000.02919937 seq: 0x02 flg: 0x04 tail: 0x99370602
frmt: 0x02 chkval: 0x7eb5 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x28264000 to 0x28266000
28264000 0000A206 00430445 02919937 04020000 [....E.C.7.......]
28264010 00007EB5 00000002 000205B5 02919929 [.~..........)...]
28264020 00000000 00020002 00000000 00000000 [................]
28264030 00000000 00000000 00000000 00000000 [................]
Also a number of
recently accessed
buffers/blocks will
be dumped (from
memory)
You can issue
"dump_pinned_
buffer_history"
manually too
25. blog.tanelpoder.com 25
Process trace file โ buffer change history (REDO)
END OF PINNED BUFFER HISTORY
*** timestamp before redo dump: 02/24/2014 15:52:16
***********************************************
* Dump Online Redo for Buffers in Pin History *
***********************************************
$$$$$$$ Dump Online Redo for DBA list (tsn.rdba in hex):
0x0.00430445 0x9.020001f9 0x9.020001fa 0x9.020001fb 0x9.020001fc 0x9.020001fd
0x9.020001fe 0x9.020001ff 0x9.02000202 0x9.02000203 0x9.02000204:
DUMP REDO
Opcodes *.*
DBAs (file#, block#):
(1, 197701) (8, 505) (8, 506) (8, 507) (8, 508) (8, 509) (8, 510) (8, 511) (8, 514) .
SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
Times: 02/24/2014 14:52:16 thru eternity
REDO RECORD - Thread:1 RBA: 0x000a13.0000b070.0010 LEN: 0x10e0 VLD: 0x0d
SCN: 0x0000.0291d462 SUBSCN: 1 02/24/2014 15:13:56
(LWN RBA: 0x000a13.0000b070.0010 LEN: 0115 NST: 0001 SCN: 0x0000.0291d462)
CHANGE #1 TYP:2 CLS:1 AFN:2 DBA:0x00826444 OBJ:5823 SCN:0x0000.0291d2a7 SEQ:2 OP:11.2 ENC:0
RBL:0
KTB Redo
op: 0x01 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F xid: 0x0023.017.0000028a uba: 0x01c0156a.01c6.03
KDO Op code: IRP row dependencies Disabled
...
Oracle also
dumps the
online REDO
against the
recently pinned
blocks!!!
This is why the
blocking session
was waiting for the
log file sequential
read!
26. blog.tanelpoder.com 26
A Corruption?
โข So, do we have a corruption?
โข No block checksum / checking errors
โข dbverify, RMAN, DBMS_REPAIR tools didn't report a problem
โข As this was an ORA-600 error, it's time to search the MOS
โข "kdsgrp1 ora 600"
28. blog.tanelpoder.com 28
Causal Chain
1. A session took a cache buffers chains latch and accessed a
buffer using the data layer function kdsgrp1
2. The session hit an ORA-600 due to a bug (not corruption)
3. As the data access function failed, an errorstack dump with
the recently accessed buffer dump was invoked
4. The recent buffer dump also read and dumped relevant
changes from the online redo logs (log file sequential read!)
5. The cache buffers chains latch was held until the end of the
dump!
29. blog.tanelpoder.com 29
Shouldn't we have waited for buffer busy waits?
โข With regular logical IOs the buffer contents are not read while
holding the CBC latch:
1. Take CBC latch in shared mode
2. Walk the buffer hash chain until you find the relevant buffer header
3. Upgrade the CBC latch to exclusive mode
4. Pin the buffer header
5. Release the CBC latch
6. Now access the buffer (call transaction, data layer etc)
7. Take the CBC latch again (in shared mode)
8. Unpin the buffer header
9. Release the CBC latch
If someone else
wants to pin the
buffer now, they'd
wait for buffer busy
waits
30. blog.tanelpoder.com 30
Sometimes "short" logical IOs can skip a few steps
โข With "short" LIOs like unique index lookup LIOs (etc) Oracle
can avoid the buffer pinning codepath:
1. Take CBC latch in shared mode
2. Walk the buffer hash chain until you find the relevant buffer header
3. Upgrade the CBC latch to exclusive mode
4. Pin the buffer header
5. Release the CBC latch
6. Now access the buffer (call transaction, data layer etc)
7. Take the CBC latch again (in shared mode)
8. Unpin the buffer header
9. Release the CBC latch
This shows up as
consistent reads โ
examination
counter in
v$sesstat
If someone wants
to get the CBC latch
in exclusive mode
now, they'd wait for
the latch
31. blog.tanelpoder.com 31
Conclusion
1. Troubleshoot by following the causal chain of events
2. Don't try to jump to the "solution" or "root cause"
immediately
โข There are many possible root causes
3. Sometimes you need to bridge a gap in the chain with your
own reasoning (and later verify)
โข "Why would a user session need to read from a redo log?"
4. Sometimes you need to selectively ignore/postpone
evidence
โข Latch contention is not always a "too heavy usage" issue