SQLPASS presentation on performance tuning and best practices for XML and XQuery in Microsoft SQL Server 2005, SQL Server 2008, SQL Server 2008 R2 and SQL Server 2012.
1. Best Practices and
Performance Tuning of
XML Queries in SQL Server
AD-501-M
Michael Rys
Principal Program Manager
Microsoft Corp
mrys@microsoft.com
@SQLServerMike
October 11-14, Seattle, WA
2. Session Objectives
• Understand when and how
to use XML in SQL Server
• Understand and correct common
performance problems with XML and
XQuery
3. Session Agenda
XML Scenarios and when to store XML
XML Design Optimizations
General Optimizations
XML Datatype method Optimizations
XQuery Optimizations
XML Index Optimizations
AD-501-M| XQuery Performance 3
5. XML Scenarios
Data Exchange between loosely-coupled systems
• XML is ubiquitous, extensible, platform independent transport format
• Message Envelope in XML
Simple Object Access Protocol (SOAP), RSS, REST
• Message Payload/Business Data in XML
• Vertical Industry Exchange schemas
Document Management
• XHTML, DocBook, Home-grown, domain-specific markup (e.g.
contracts), OpenOffice, Microsoft Office XML (both default and user-
extended)
Ad-hoc modeling of semistructured data
• Storing and querying heterogeneous complex objects
• Semistructured data with sparse, highly-varying
structure at the instance level
• XML provides self-describing format and extensible schemas
→Transport, Store, and Query XML data
AD-501-M| XQuery Performance 5
6. Decision Tree: Processing XML In SQL Server
Does the data fit
Shred the XML
the relational
Yes into relations
model?
No structured
Known sparse
Shred the structured
XML into relations, store Shred known
Is the data semi- semistructured aspects sparse data into
structured? Yes as XML and/or sparse
sparse columns
col
No Open schema
Is the XML Promote
Yes
Is the data a Search within
constrainedthe
Query into by frequently queried
document? the XML?
XML? properties
Yes schemas? relationally
No Yes
Use primary and
Constrain XML if
Store as Define a full-text secondary XML
validation XML is
Store as cost
varbinary(max) index indexes as
ok AD-501-M| needed 6
XQuery Performance
7. SQL Server XML Data Type Architecture
XML Relational
XML
XML Parser XML Schemata
Schema
Validation Collection
OpenXML/nodes() PATH
XML-DML XML data type Rowsets
Index
(binary XML) PRIMARY Node
Table PROP
XML INDEX with
FOR XML Index
TYPE directive
VALUE
XQuery Index
AD-501-M| XQuery Performance 7
8. General Impacts
Concurrency Control
• Locks on both XML data type and relevant
rows in primary and secondary XML Indices
• Lock escalation on indices
• Snapshot Isolation reduces locks and lock contention
Transaction Logs
• Bulkinsert into XML Indices may fill transaction log
• Delay the creation of the XML indexes and use the SIMPLE recovery
model
• Preallocate database file instead of dynamically growing
• Place log on different disk
In-Row/Out-of-Row of XML large object
• Moving XML into side table or out-of-row if
mixed with relational data reduces scan time
Due to clustering, insertion into XML Index may not be linear
• Chose integer/bigint identity column as key
AD-501-M| XQuery Performance 8
9. Choose The Right XML Model
• Element-centric versus attribute-centric
<Customer><name>Joe</name></Customer>
<Customer name="Joe" />
+: Attributes often better performing querying
–: Parsing Attributes uniqueness check
• Generic element names with type attribute vs Specific
element names
<Entity type="Customer">
<Prop type="Name">Joe</Prop>
</Entity>
<Customer><name>Joe</name></Customer>
+: Specific names shorter path expressions
+: Specific names no filter on type attribute
/Entity[@type="Customer"]/Prop[@type="Name"] vs /Customer/name
• Wrapper elements
<Orders><Order id="1"/></Orders>
+: No wrapper elements smaller XML, shorter path expressions
AD-501-M| XQuery Performance 9
10. Use an XML Schema Collection?
Using no XML Schema (untyped XML)
• Can still use XQuery and XML Index!!!
• Atomic values are always weakly typed strings
compare as strings to avoid runtime
conversions and loss of index usage
• No schema validation overhead
• No schema evolution revalidation costs
XML Schema provides structural information
• Atomic typed elements are now using only one instead of two
rows in node table/XML index (closer to attributes)
• Static typing can detect cardinality and feasibility of expression
XML Schema provides semantic information
• Elements/attributes have correct atomic
type for comparison and order semantics
• No runtime casts required and better use of index for value lookup
AD-501-M| XQuery Performance 10
11. XQuery Methods
query() creates new, untyped XML data type
instance
exist() returns 1 if the XQuery expression returns
at least one item, 0 otherwise
value() extracts an XQuery value into the SQL
value and type space
• Expression has to statically be a singleton
• String value of atomized XQuery item is cast to
SQL type
• SQL type has to be SQL scalar type
(no XML or CLR UDT) AD-501-M| XQuery Performance 11
12. XQuery: nodes()
Returns a row per selected node as a special
XML data type instance
• Preserves the original structure and types
• Can only be used with the XQuery methods (but not
modify()), count(*), and IS (NOT) NULL
Appears as Table-valued Function (TVF) in
queryplan if no index present
AD-501-M| XQuery Performance 12
13. sql:column()/sql:variable()
Map SQL value and type into XQuery values and types in context of XQuery or
XML-DML
• sql:variable(): accesses a SQL variable/parameter
declare @value int
set @value=42
select * from T
where
T.x.exist('/a/b[@id=sql:variable("@value")]')=1
• sql:column(): accesses another column value
tables: T(key int, x xml), S(key int, val int)
select * from T join S on T.key=S.key
where T.x.exist('/a/b[@id=sql:column("S.val")]')=1
• Restrictions in SQL Server:
No XML, CLR UDT, datetime, or deprecated text/ntext/image
AD-501-M| XQuery Performance 13
15. Optimal Use Of Methods
How to Cast from XML to SQL
BAD:
CAST( CAST(xmldoc.query('/a/b/text()') as
nvarchar(500)) as int)
GOOD:
xmldoc.value('(/a/b/text())[1]', 'int')
BAD:
node.query('.').value('@attr',
'nvarchar(50)')
GOOD:
node.value('@attr', 'nvarchar(50)')
AD-501-M| XQuery Performance 15
16. Optimal Use Of Methods
Grouping value() method
Group value() methods on same XML instance next to
each other if the path expressions in the value()
methods are
• Simple path expressions that only use child and attribute axis
and do not contain wildcards, predicates, node tests, ordinals
• The path expressions infer statically a singleton
The singleton can be statically inferred from
• the DOCUMENT and XML Schema Collection
• Relative paths on the context node provided by the nodes()
method
Requires XML index to be present
AD-501-M| XQuery Performance 16
17. Optimal Use of Methods
Using the right method to join and compare
Use exist() method, sql:column()/sql:variable() and an
XQuery comparison for checking for a value or joining
if secondary XML indices present
BAD:*
select doc
from doc_tab join authors
on doc.value('(/doc/mainauthor/lname/text())[1]',
'nvarchar(50)') = lastname
GOOD:
select doc
from doc_tab join authors
on 1 = doc.exist('/doc/mainauthor/lname/text()[. =
sql:column("lastname")]')
* If applied on XML variable/no index present, value()
method is most of the time more efficient
AD-501-M| XQuery Performance 17
18. Optimal Use of Methods
Avoiding bad costing with nodes()
nodes() without XML index is a Table-valued function (details later)
Bad cardinality estimates can lead to bad plans
• BAD:
select c.value('@id', 'int') as CustID
, c.value('@name', 'nvarchar(50)') as CName
from Customer, @x.nodes('/doc/customer') as N(c)
where Customer.ID = c.value('@id', 'int')
• BETTER (if only one wrapper doc element):
select c.value('@id', 'int') as CustID
, c.value('@name', 'nvarchar(50)') as CName
from Customer, @x.nodes('/doc[1]') as D(d)
cross apply d.nodes('customer') as N(c)
where Customer.ID = c.value('@id', 'int')
Use temp table (insert into #temp select … from nodes()) or Table-
valued parameter instead of XML to get better estimates
AD-501-M| XQuery Performance 18
19. Optimal Use Of Methods
Avoiding multiple method evaluations
Use subqueries
• BAD:
SELECT CASE isnumeric (doc.value(
'(/doc/customer/order/price)[1]', 'nvarchar(32)'))
WHEN 1 THEN doc.value(
'(/doc/customer/order/price)[1]', 'decimal(5,2)')
ELSE 0 END
FROM T
• GOOD:
SELECT CASE isnumeric (Price)
WHEN 1 THEN CAST(Price as decimal(5,2))
ELSE 0 END
FROM (SELECT doc.value(
'(/doc/customer/order/price)[1]',
'nvarchar(32)')) as Price FROM T) X
Use subqueries also with NULLIF()
AD-501-M| XQuery Performance 19
20. Combined SQL And XQuery/DML Processing
SELECT x.query('…'), y FROM T WHERE …
Static Metadata
SQL Parser XQuery Parser
Phase
XML
Static Typing Static Typing Schema
Collection
Algebrization Algebrization
Static Optimization of
combined Logical and
Physical Operation Tree
Dynamic Runtime Optimization XML and
Phase and Execution of rel.
physical Op Tree Indices
AD-501-M| XQuery Performance 20
21. New XQuery Algebra Operators
XML Reader TVF
Table-Valued Function XML Reader UDF with XPath Filter
Used if no Primary XML Index is present
Creates node table rowset in query flow
Multiple XPath filters can be pushed in to reduce node table
to subtree
Base cardinality estimate is always 10’000 rows!
Some adjustment based on pushed path filters
XMLReader node table format example (simplified)
ID TAG ID Node Type-ID VALUE HID
1.3.1 4 (TITLE) Element 2 (xs:string) Bad Bugs #title#section#book
AD-501-M| XQuery Performance 21
22. New XQuery Algebra Operators
UDX
• Serializer UDX
serializes the query result as XML
• XQuery String UDX
evaluates the XQuery string() function
• XQuery Data UDX
evaluates the XQuery data() function
• Check UDX
validates XML being inserted
• UDX name visible in SSMS properties window
AD-501-M| XQuery Performance 22
23. Optimal Use Of XQuery
Atomization of nodes
Value comparisons, XQuery casts and value() method
casts require atomization of item
• attribute:
/person[@age = 42]
/person[data(@age) = 42]
• Atomic typed element:
/person[age = 42] /person[data(age) = 42]
• Untyped, mixed content typed element (adds UDX):
/person[age = 42] /person[data(age) = 42]
/person[string(age) = 42]
• If only one text node for untyped element (better):
/person[age/text() = 42]
/person[data(age/text()) = 42]
• value() method on untyped elements:
value('/person/age', 'int')
value('/person/age/text()', 'int')
String() aggregates all text nodes, prohibits index use
AD-501-M| XQuery Performance 23
24. Optimal Use Of XQuery
Casting Values
Value comparisons require casts and type promotion
• Untyped attribute:
/person[@age = 42] /person[xs:decimal(@age) = 42]
• Untyped text node():
/person[age/text() = 42]
/person[xs:decimal(age/text()) = 42]
• Typed element (typed as xs:int):
/person[salary = 3e4] /person[xs:double(salary) =
3e4]
Casting is expensive and prohibits index lookup
Tips to avoid casting
• Use appropriate types for comparison (string for untyped)
• Use schema to declare type AD-501-M| XQuery Performance 24
25. Optimal Use Of XQuery
Maximize XPath expressions
Single paths are more efficient than twig paths
Avoid predicates in the middle of path expressions
book[@ISBN = "1-8610-0157-6"]/author[first-
name = "Davis"]
/book[@ISBN = "1-8610-0157-6"] "∩"
/book/author[first-name = "Davis"]
Move ordinals to the end of path expressions
• Make sure you get the same semantics!
• /a[1]/b[1] ≠ (/a/b)[1] ≠ /a/b[1]
• (/book/@isbn)[1] is better than/book[1]/@isbn
AD-501-M| XQuery Performance 25
26. Optimal Use Of XQuery
Maximize XPath expressions in exist()
Use context item in predicate to lengthen path in exist()
• Existential quantification makes returned node irrelevant
• BAD:
SELECT * FROM docs WHERE 1 = xCol.exist
('/book/subject[text() = "security"]')
• GOOD:
SELECT * FROM docs WHERE 1 = xCol.exist
('/book/subject/text()[. = "security"]')
• BAD:
SELECT * FROM docs WHERE 1 = xCol.exist
('/book[@price > 9.99 and @price < 49.99]')
• GOOD:
SELECT * FROM docs WHERE 1 = xCol.exist
('/book/@price[. > 9.99 and . < 49.99]')
This does not work with or-predicate AD-501-M| XQuery Performance 26
27. Optimal Use Of XQuery
Inefficient operations: Parent axis
Most frequent offender: parent axis with nodes()
• BAD:
select o.value('../@id', 'int') as CustID
, o.value('@id', 'int') as OrdID
from T
cross apply x.nodes('/doc/customer/orders') as N(o)
• GOOD:
select c.value('@id', 'int') as CustID
, o.value('@id', 'int') as OrdID
from T cross apply x.nodes('/doc/customer') as N1(c)
cross apply c.nodes('orders') as N2(o)
AD-501-M| XQuery Performance 27
28. Optimal Use Of XQuery
Inefficient operations
Avoid descendant axes and // in the middle of path
expressions if the data structure is known.
• // still can use the HID lookup, but is less efficient
XQuery construction performs worse than FOR XML
• BAD:
SELECT notes.query('
<Customer cid="{sql:column(''cid'')}">{
<name>{sql:column("name")}</name>, /
}</Customer>')
FROM Customers WHERE cid=1
• GOOD:
SELECT cid as "@cid", name, notes as "*"
FROM Customers WHERE cid=1
FOR XML PATH('Customer'), TYPE
AD-501-M| XQuery Performance 28
29. Optimal Use Of FOR XML
Use TYPE directive when assigning result to XML
• BAD:
declare @x xml;
set @x =
(select * from Customers for xml raw);
• GOOD:
declare @x xml;
set @x =
(select * from Customers for xml raw,
type);
Use FOR XML PATH for complex grouping and additional
hierarchy levels over FOR XML EXPLICIT
Use FOR XML EXPLICIT for complex nesting if FOR XML PATH
performance is not appropriate
AD-501-M| XQuery Performance 29
30. XML Indices
Create XML index on XML column
CREATE PRIMARY XML INDEX idx_1 ON docs (xDoc)
Create secondary indexes on tags, values, paths
Creation:
• Single-threaded only for primary XML index
• Multi-threaded for secondary XML indexes
Uses:
• Primary Index will always be used if defined (not a cost
based decision)
• Results can be served directly from index
• SQL’s cost based optimizer will consider secondary indexes
Maintenance:
• Primary and Secondary Indices will be efficiently maintained
during updates
• Only subtree that changes will be updated
• No online index rebuild
• Clustered key may lead to non-linear maintenance cost
Schema revalidation still checks whole instance
AD-501-M| XQuery Performance 30
31. Example Index Contents
insert into Person values (42,
'<book ISBN=”1-55860-438-3”>
<section>
<title>Bad Bugs</title>
Nobody loves bad bugs.
</section>
<section>
<title>Tree Frogs</title>
All right-thinking people
<bold>love</bold> tree frogs.
</section>
</book>')
AD-501-M| XQuery Performance 31
32. Primary XML Index
CREATE PRIMARY XML INDEX PersonIdx ON Person (Pdesc)
PK XID TAG ID Node Type-ID VALUE HID
42 1 1 (book) Element 1 (bookT) null #book
42 1.1 2 (ISBN) Attribute 2 (xs:string) 1-55860-438-3 #@ISBN#book
42 1.3 3 (section) Element 3 (sectionT) null #section#book
42 1.3.1 4 (TITLE) Element 2 (xs:string) Bad Bugs #title#section#book
42 1.3.3 -- Text -- Nobody loves #text()#section#book
bad bugs.
42 1.5 3 (section) Element 3 (sectionT) null #section#book
42 1.5.1 4 (title) Element 2 (xs:string) Tree frogs #title#section#book
42 1.5.3 -- Text -- All right-thinking #text()#section#book
people
42 1.5.5 7 (bold) Element 4 (boldT) love #bold#section#book
42 1.5.7 -- Text -- tree frogs #text()#section#book
Assumes typed data; Columns and Values are simplified, see VLDB 2004 paper for details
AD-501-M| XQuery Performance 32
33. Secondary XML Indices
XML Column Primary XML Index (1 per XML column)
in table T(id, x) Clustered on Primary Key (of table T), XID
PK XID NID TID VALUE LVALUE HID xsinil …
id x
1
1 Binary XML
1
1
2 Binary XML 2
2
1 34 1
2
3 1
2
2
3 Binary XML
3
3
3
Non-clustered Secondary Indices (n per primary Index)
Value Index Property Index Path Index
AD-501-M| XQuery Performance 33
35. Takeaway: XML Indices
PRIMARY XML Index – Use when lots of XQuery
FOR VALUE – Useful for queries where values are
more selective than paths such as
//*[.=“Seattle”]
FOR PATH – Useful for Path expressions: avoids
joins by mapping paths to hierarchical index
(HID) numbers. Example: /person/address/zip
FOR PROPERTY – Useful when optimizer chooses
other index (for example, on relational column,
or FT Index) in addition so row is already known
AD-501-M| XQuery Performance 35
36. Shredding Approaches
Approach Complex Bulkload Server Business Programming Scale/
Shapes vs logic Performance
Midtier
SQLXML Yes with Yes midtier staging annotated very good/
Bulkload limits tables on XSD and small very good
with server, API
annotated XSLT on
schema midtier
ADO.Net No No midtier midtier, DataSet API good/good
DataSet SSIS or SSIS
CLR Table- Yes No Server Server or C#, VB limited/good
valued or midtier custom code
function midtier
OpenXML Yes No Server T-SQL declarative T- limited/good
SQL, XPath
against
variable
nodes() Yes No Server T-SQL declarative good/careful
SQL, XQuery
against var or
table
37. To Promote or Not Promote…
Promotion pre-calculates paths
Requires relational query
• XQuery does not know about promotion
Promotion during loading of the data
• Using any of the shredding mechanisms
• 1-to-1 or 1-to-many relationships
Promotion using computed columns
• 1-to-1 only
• Persist computed column: Fast lookup and retrieval
• Relational index on persisted computed column: Fast lookup
• Have to be precise
Promotion using Triggers
• 1-to-1 or 1-to-many relationships
• Trigger overhead
Relational View over XML data
• Filters on relational view are not pushed down due to different type/value system
AD-501-M| XQuery Performance 37
38. Promotion using computed columns
Use a schema-bound UDF that encapsulates XQuery
Persist computed column
• Fast lookup and retrieval
Relational index on persisted computed column
• Fast lookup
Query will have to use the schema-bound UDF to match
CAVEAT: No parallel plans with a persisted computed
column based on a UDF
AD-501-M| XQuery Performance 38
39. Use of Full-Text Index for Optimization
Can provide improvement for XQuery contains() queries
Query for documents where section title contains “optimization”
Use Fulltext index to prefilter candidates (includes false positives)
SELECT * FROM docs
WHERE contains(xCol, 'optimization')
1 = xCol.exist('
/book/section/title/text()[contains(.,"optimization")]
AND 1 = xCol.exist('
')
/book/section/title/text()[contains(.,"optimization")]
')
AD-501-M| XQuery Performance 39
40. Futures: Selective XML Index
CREATE SELECTIVE XML INDEX pxi_index ON Tbl(xmlcol)
FOR (
-– the first four match XQuery predicates
-- in all XML data type methods
-- simple flavor - default mapping (xs:untypedAtomic),
-- no optimization hints
node42 = ‘/a/b’,
pathatc = ‘/a/b/c/@atc’,
-- advanced flavor - use of optimization hints
path02 =‘/a/b/c’ as XQUERY ‘xs:string’ MAXLENGTH(25),
node13 = ‘/a/b/d’ as XQUERY ‘xs:double SINGLETON,
-– the next two match value() method
-- require regular SQL Server type semantics
-- they can be mixed with the XQUERY ones
-- specifying a type is mandatory for the SQL type semantics
pathfloat = ‘/a/b/c’ as SQL FLOAT,
pathabd = ‘/a/b/d’ as SQL VARCHAR(200)
)
41. Session Takeaways
• Understand when and how
to use XML in SQL Server
• Understand and correct common
performance problems with XML and
XQuery
• Shred “relational” XML to relations
• Use XML datatype for semistructured
and markup scenarios
• Write your XQueries so that XML
Indices can be used
• Use persisted computed columns to
promote XQuery results (with caveat)
44. Complete the Evaluation Form to Win!
Win a Dell Mini Netbook – every day – just for
submitting your completed form. Each session
evaluation form represents a chance to win.
Pick up your evaluation form:
• In each presentation room Sponsored by Dell
• Online on the PASS Summit website
Drop off your completed form:
• Near the exit of each presentation room
• At the Registration desk
• Online on the PASS Summit website
AD-501-M| XQuery Performance 44
45. Thank you
for attending this session and the
2011 PASS Summit in Seattle
October 11-14, Seattle, WA
46. Microsoft SQL Microsoft Expert Pods Hands-on Labs
Server Clinic Product Pavilion Meet Microsoft SQL
Server Engineering
Work through your Talk with Microsoft SQL Get experienced through
team members &
technical issues with SQL Server & BI experts to self-paced & instructor-
SQL MVPs
Server CSS & get learn about the next led labs on our cloud
architectural guidance version of SQL Server based lab platform -
from SQLCAT and check out the new bring your laptop or use
Database Consolidation HP provided hardware
Appliance
Room 611 Expo Hall 6th Floor Lobby Room 618-620
AD-501-M| XQuery Performance 46