Troubleshooting Another Complex Oracle Performance Issue (vol. 3)

blog.tanelpoder.com 1
Troubleshooting a Yet Another Complex
Performance Issue
Tanel Põder
https://blog.tanelpoder.com
@tanelpoder

Let's get started!
This is a reconstruction using the real problem
screenshots and AWR/ASH reports, plus follow-
up experiments on my test environment.

Symptoms
• Some queries intermittently extremely slow
• Concurrency wait class waits going through the roof
• Performance gets back to normal by itself

The AWR approach
AWR isn't that
great for short
spikes

The AWR approach
Average wait of
28ms for a latch
is very long!
Could it be CPU
starvation
(scheduling
latency?)

The AWR approach
Plenty of idle CPU
time (remember this
is an average over
30 minute AWR
period!)
Warning, on Linux
the LOAD also
includes processes
waiting for disk IO!

The AWR approach
Plenty of Idle
CPU cycles

The AWR approach
The IOWAIT is
IDLE as far as
CPU utilization is
concerned!!!
IOWAIT is a
flawed metric (I
don't use it)

The AWR approach
Can't really systematically
figure out which SQL_IDs
are responsible for the
waits

The AWR approach
Could lots of LIOs be the
cause for the extreme
latch contention?

The AWR approach
Again we do not know
who and why does cause
the contention...

The AWR approach
Nothing special here
either. kcbgtcr is the
Consistent Read LIO
function

The ASH approach
ASH links together many
facts about a session's
doings (wait event, sql_id,
username, block# etc)
And ASH has 1-second
measurement granularity,
we can manually query
that

Drill down into Concurrency wait class with ASH:
99% of
Concurrency
waits is due to a
CBC latch!

All due to a SELECT
statement (73
different SQLIDs
spotted)

One latch or many?
The P1 is the
latch address!
Now we know that
almost all of the
latch contention
was against a
single child latch!

So, who's blocking us?
Bummer, we
know the blocker
for only 1% of
the latch waits!
So, we don't know
who blocked us for
99% of the waits...
This is a wait
interface
shortcoming...

The ASH approach
ASH report incorrectly
labels "unknown"
blocker as "Held
Shared"
Sometimes the latch
waits are extremely
long!
Let's pick the one
blocker reported and
hope it's relevant  *

What was the blocker doing?
SQL> SELECT session_state,event,sql_id, COUNT(*) * 10 seconds
FROM dba_hist_active_sess_history
WHERE session_id = 5914
AND sample_time BETWEEN <spike_start> AND <spike_end>;
%This SESSION EVENT SQL_ID
------ ------- ---------------------------------------- -------------
67% ON CPU 2bdg4ygkpyxc9
14% WAITING log file sequential read 2bdg4ygkpyxc9
10% ON CPU fmdctt76kf3mb
5% WAITING library cache: mutex X
5% WAITING log buffer space 2bdg4ygkpyxc9
It was a user
session by "JDBC
Thin Client"
Why would a
user session
read a redo log
file?!

Alternative: LatchProf Collector ("ASH" of Latch Holders)
SQL> SELECT latch_name, hold_mode, sid, COUNT(*)
FROM latchprof_view
WHERE latch_name = 'cache buffers chains'
AND child_address = '385BA5C4'
GROUP BY latch_name, hold_mode, sid
ORDER BY COUNT(*) DESC;
LATCH_NAME HOLD_MODE SID COUNT(*)
------------------------------ -------------- ---------- ----------
cache buffers chains MAYBE-SHARED 19 3
cache buffers chains SHARED 201 2
cache buffers chains EXCLUSIVE 201 2
Report holders
for only the child
latch involved in
the waits
• http://tech.e2sn.com/oracle/troubleshooting/latch-contention-
troubleshooting
• http://blog.tanelpoder.com/files/scripts/tools/collectors/latchprof_install.sql

What are the reasons for reading a redo log file?
1. LGWR doing a log switch (but ours was a user session!)
2. A Streams/GoldenGate/LogMiner log mining operation?
• But this was a regular application SELECT query against normal tables
3. Manual dumping of redo log contents
• ALTER SYSTEM DUMP LOGFILE …
4. Automatic Block Media Recovery?
• http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmblock
.htm#BRADV118
• v$database_block_corruption was empty (IIRC)
• I thought to check the alert log for any corruption warnings…

alert.log
Mon Feb 24 15:54:18 2014
Dumping diagnostic data in directory=[cdmp_20140224155418], requested by
(instance=1, osid=25519), summary=[incident=74688].
Mon Feb 24 15:56:58 2014
Errors in file
/u01/app/oracle/diag/rdbms/lin112/LIN112/trace/LIN112_ora_25519.trc
(incident=74689):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [],
[], [], [], []
Incident details in:
/u01/app/oracle/diag/rdbms/lin112/LIN112/incident/incdir_74689/LIN112_ora_2551
9_i74689.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 24 15:56:59 2014
Sweep [inc][74689]: completed
Sweep [inc2][74689]: completed
Wow, ORA-600s!
kdsgrp = Kernel
Data Get Row Piece

Process trace file – errorstack
Dump continued from file: /u01/app/oracle/diag/rdbms/lin112/.../LIN112_ora_25519.trc
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
----- Current SQL Statement for this session (sql_id=6wvgqn05s855u) -----
SELECT /*+ INDEX_RS_ASC(t) */ COUNT(LENGTH(owner)) FROM t_c_hotsos t WHERE object_id > 1
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
kgerinv()+41 call kgerinv_internal() 11B60460 ? F6970020 ?
10809D5C ? 258 ? 0 ? 0 ?
kgeasnmierr()+47 call kgerinv() 11B60460 ? F6970020 ?
10809D5C ? 0 ? FFB53640 ?
kdsgrp1_dump()+1138 call kgeasnmierr() 11B60460 ? F6970020 ?
10809D5C ? 0 ?
kdsgrp1()+32 call kdsgrp1_dump() F663C9D0 ? F663C9D0 ? 0 ?
qetlbr()+247 call kdsgrp1() F663C9D0 ? F663C9D0 ? 0 ?
qertbFetchByRowID() call qetlbr() F663C9D0 ? F6639F08 ?
+5617 2233A6E0 ? 0 ? F663C8FC ? 0 ?
0 ?
qergsFetch()+497 call 00000000 2233A6E0 ? F663C8D0 ?
1024AA74 ? FFB53A08 ? 7FFF ?
opifch2()+2659 call 00000000 2233A5E4 ? F663CB38 ?
Apparently the
failure in kdsgrp1
function causes
some special dump
function to be called
This is a reproduced
test query

Process trace file – pinned buffer history
...
END OF PROCESS STATE
----- Pinned Buffer History -----
---------------------
PINNED BUFFER HISTORY (oldest pin first)
---------------------
BH (0x283f2628) file#: 1 rdba: 0x00430445 (1/197701) class: 1 ba: 0x28264000
set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,25
dbwrid: 0 obj: 132533 objn: 132533 tsn: 0 afn: 1 hint: f
hash: [0x38563cf4,0x38563cf4] lru: [0x23be8998,0x323fa1c0]
ckptq: [NULL] fileq: [NULL] objq: [0x23be89b0,0x323fa1d8] objaq: [0x23be89b8,0x323fa1e0]
st: XCURRENT md: NULL fpin: 'kdiwh100: kdircys' tch: 3
flags: only_sequential_access
LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
buffer tsn: 0 rdba: 0x00430445 (1/197701)
scn: 0x0000.02919937 seq: 0x02 flg: 0x04 tail: 0x99370602
frmt: 0x02 chkval: 0x7eb5 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x28264000 to 0x28266000
28264000 0000A206 00430445 02919937 04020000 [....E.C.7.......]
28264010 00007EB5 00000002 000205B5 02919929 [.~..........)...]
28264020 00000000 00020002 00000000 00000000 [................]
28264030 00000000 00000000 00000000 00000000 [................]
Also a number of
recently accessed
buffers/blocks will
be dumped (from
memory)
You can issue
"dump_pinned_
buffer_history"
manually too

Process trace file – buffer change history (REDO)
END OF PINNED BUFFER HISTORY
*** timestamp before redo dump: 02/24/2014 15:52:16
***********************************************
* Dump Online Redo for Buffers in Pin History *
***********************************************
$$$$$$$ Dump Online Redo for DBA list (tsn.rdba in hex):
0x0.00430445 0x9.020001f9 0x9.020001fa 0x9.020001fb 0x9.020001fc 0x9.020001fd
0x9.020001fe 0x9.020001ff 0x9.02000202 0x9.02000203 0x9.02000204:
DUMP REDO
Opcodes *.*
DBAs (file#, block#):
(1, 197701) (8, 505) (8, 506) (8, 507) (8, 508) (8, 509) (8, 510) (8, 511) (8, 514) .
SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
Times: 02/24/2014 14:52:16 thru eternity
REDO RECORD - Thread:1 RBA: 0x000a13.0000b070.0010 LEN: 0x10e0 VLD: 0x0d
SCN: 0x0000.0291d462 SUBSCN: 1 02/24/2014 15:13:56
(LWN RBA: 0x000a13.0000b070.0010 LEN: 0115 NST: 0001 SCN: 0x0000.0291d462)
CHANGE #1 TYP:2 CLS:1 AFN:2 DBA:0x00826444 OBJ:5823 SCN:0x0000.0291d2a7 SEQ:2 OP:11.2 ENC:0
RBL:0
KTB Redo
op: 0x01 ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F xid: 0x0023.017.0000028a uba: 0x01c0156a.01c6.03
KDO Op code: IRP row dependencies Disabled
...
Oracle also
dumps the
online REDO
against the
recently pinned
blocks!!!
This is why the
blocking session
was waiting for the
log file sequential
read!

A Corruption?
• So, do we have a corruption?
• No block checksum / checking errors
• dbverify, RMAN, DBMS_REPAIR tools didn't report a problem
• As this was an ORA-600 error, it's time to search the MOS
• "kdsgrp1 ora 600"

A bug!!!

Causal Chain
1. A session took a cache buffers chains latch and accessed a
buffer using the data layer function kdsgrp1
2. The session hit an ORA-600 due to a bug (not corruption)
3. As the data access function failed, an errorstack dump with
the recently accessed buffer dump was invoked
4. The recent buffer dump also read and dumped relevant
changes from the online redo logs (log file sequential read!)
5. The cache buffers chains latch was held until the end of the
dump!

Shouldn't we have waited for buffer busy waits?
• With regular logical IOs the buffer contents are not read while
holding the CBC latch:
1. Take CBC latch in shared mode
2. Walk the buffer hash chain until you find the relevant buffer header
3. Upgrade the CBC latch to exclusive mode
4. Pin the buffer header
5. Release the CBC latch
6. Now access the buffer (call transaction, data layer etc)
7. Take the CBC latch again (in shared mode)
8. Unpin the buffer header
If someone else
wants to pin the
buffer now, they'd
wait for buffer busy
waits

Sometimes "short" logical IOs can skip a few steps
• With "short" LIOs like unique index lookup LIOs (etc) Oracle
can avoid the buffer pinning codepath:
1. Take CBC latch in shared mode
2. Walk the buffer hash chain until you find the relevant buffer header
3. Upgrade the CBC latch to exclusive mode
4. Pin the buffer header
6. Now access the buffer (call transaction, data layer etc)
7. Take the CBC latch again (in shared mode)
8. Unpin the buffer header
This shows up as
consistent reads –
examination
counter in
v$sesstat
If someone wants
to get the CBC latch
in exclusive mode
now, they'd wait for
the latch

Conclusion
1. Troubleshoot by following the causal chain of events
2. Don't try to jump to the "solution" or "root cause"
immediately
• There are many possible root causes
3. Sometimes you need to bridge a gap in the chain with your
own reasoning (and later verify)
• "Why would a user session need to read from a redo log?"
4. Sometimes you need to selectively ignore/postpone
evidence
• Latch contention is not always a "too heavy usage" issue

Thanks!!!
Oracle Troubleshooting Training by Tanel Poder
blog: https://blog.tanelpoder.com
github: https://github.com/tanelpoder
twitter: @tanelpoder

Troubleshooting Another Complex Oracle Performance Issue (vol. 3)

Recommended

Recommended

More Related Content

More from Tanel Poder

More from Tanel Poder (9)

Recently uploaded

Recently uploaded (20)

Troubleshooting Another Complex Oracle Performance Issue (vol. 3)