Schemaless Change detection in XML Documents using Semantic Identifiers

1

Change Detection in XML Documents
using Semantic Identifiers
BY
KAILAASH BALACHANDRAN

Outline


Motivation



Introduction



The Approach
•
•

2-step Algorithm

•


Identifiers
Axioms

Semantic Change Detection
•

Finding Identifiers

•

Matching Nodes



Examples



Conclusion

2

Motivation(1)

3

Fig.1. Version 1

Fig.2. Version 2

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<author>
<book>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<price>$56</price>
</book>
</author>

Motivation(1)

4

Fig.1. Version 1

Fig.2. Version 2

<author>
<book>
</book>
<book>
<price> $56</price>
</book>
</author>

<author>
<book>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<price>$56</price>
</book>
</author>

Motivation(2)
Fig.1. Version 1
<author>
<book>
</book>
<book>
<price> $56</price>
</book>
</author>

Fig.3. Version 3

5

<publisher>Doubleday
<book>
<author>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<author>
</author>
<price> $56</price>
</book> </publisher>

Motivation(2)

6

Fig.1. Version 1

Fig.3. Version 3

<author>
<book>
</book>
<book>
<price> $56</price>
</book>
</author>

<book>
<author>
</author>
<price> $35</price>
</book>
</publisher>
<book>
<author>
</author>
<price> $56</price>

Motivation(3)
Disadvantages of Structural detection approach:

 Difficult to associate elements in different versions.
 Break down when the changes are significant.

 Affects Incremental Evaluation.
 High cost of change of data.

7

Introduction
What is Semantic Based Change Detection?
A process of Identifying changes between successive versions of a document
based on its semantics, rather than on the structure of the document.
The Approach:
1. Find Semantic Identifier for each node in the XML model.
2. Compute these Identifiers to associate nodes across multiple versions.

8

Identifiers

9

 Type is list of labels from root to element separated by a ‘/’.

 Identifier serves to distinguish elements of same type.
 Two nodes x and y, are semantically the same if and only if their identifiers evaluate to
the same result.
Eval(x,L) = Eval(y,L)

Node
x

Same Result
Node
y

where,
• x,y are the nodes,
• List of Expressions L = { E1,E2…En}

Identifiers

10

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

Version 3:

<author>
<book>
</book>
<book>
<price> $56</price>
</book>
</author>

<book>
<author>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<author>

Identifiers

11

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

<name> is
local

<author>
<book>
</book>
<book>
<price> $56</price>
</book>
</author>

Version 3:

<name> is
non-local

<book>
<author>
</book>
</publisher>
<publisher>Pocket Star <book>
<author>

Identify nodes based on its
Semantics

12

The Algorithm
Phase 1:
 Bottom up fashion.
 Identifies all local identifiers.
 Semantically different nodes are identified.
Phase 2:
 Runs recursively and identifies non-local identifiers.
 All semantically distinct nodes are found.
Any remaining node is a redundant copy of another node in the document.

Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<book>
<author>
</author>
</book>
</publisher>
<book>
<author>
</author>

Semantically different.

13

Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<book>
<author>
</author>
</book>
</publisher>
<book>
<author>
</author>

Are they semantically the same?

14

Semantics(Phase 2)
<book>
<author>
</author>
</book>
</publisher>
<book>
<author>
</author>

15

Axiom 2: Nodes that are structurally
identical are semantically identical
if and only if their respective parents
are semantically identical or if they
are both root nodes.

No, because they’re in context of two
different books


16

How to handle structural changes ?
A

X
Y

Z

Version 1

Y

X
Version 2

Assumption: Identifying information will remain nearby.

Z

 Type Territory : The territory of a type T is the set of all text nodes that are
descendants of the least common ancestor (lca) of all of the type T nodes.
 Within the type territory is the territory controlled by individual nodes of that
type.
 Node Territory : The territory of a type T node p is the type territory of T
excluding all text nodes that are descendants of other type T nodes.

17

Node and Type Territory

18

document root
type territory of p

lca (p)

node territory of p1

node territory of p2

p2
p1

p3

Node territory

Finding Identifiers

19

Version 1:

Version 2:

<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<book>
<title>t2</title>
</book>
<book>
<title>t1</title>
</book></author>
</bib>

<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
<book>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
<book>

Identifiers
<bib>
<book>
<title>t1</title>
</book>
</author>
<book>
<title>t2</title>
</book>
<book>
<title>t1</title>
</book></author>
</bib>

20
Node

IDENTIFIER

book

(../author/name/text(),
title/text())

Identifiers

21

Values of Identifiers for <book> in Version 1
<bib>
<book>
<title>t1</title>
</book>
</author>
<book>
<title>t2</title>
</book>
<book>
<title>t1</title>
</book></author>
</bib>

Value of Identifier = n1, t1



Identifiers
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

22

Identifiers
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

Value of Identifier = p1, t1

Value of Identifier = p2, t2

23

Identifiers

24

Values of Identifiers for <book> in both versions:

Version 1

Version 2

Node

IDENTIFIER

Node

IDENTIFIER

book (top)

n1 , t1

book 1 (top)

p1 , t1

book 2
(bottom)

p2 , t2

How to map both ?

book
(middle)

n2 , t2

book
(bottom)

n2 , t1

Matching

25

 Admits: q admits p if and only if q is in the node territory of p.
 Nodes p and q are matched if and only if p and q admit each other.
 Consider nodes p and q that reside in different versions Vp and Vq.

q1,
q2….qn

q1,
q2….qn

Node q in Vq

Node p in Vp


26
bib

Book matches:
pub
Version 1

p1

bib
author
name

n1

book

name

title pub n2
t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

p2

book

title
author

pub

author author
name name

n1

book
title

author

t2

name

n2
Version 2

n2

bib

Book matches:
pub

admits
Version 1

p1

bib
author
name

n1

book

27

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2

bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

28

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2

bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

29

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2


30
bib

Author matches:
pub
Version 1

p1

bib
author
name

n1

book

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2

Conclusion


Semantic change detection technique.
•

Find identifiers for each node in the XML document

•

Associate nodes across versions.



Information that identifies an element is conserved across changes.



Time complexity is O(n*log(n))



We can match nodes even when structural changes are significant.

31

Schemaless Change detection in XML Documents using Semantic Identifiers

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Recently uploaded

Recently uploaded (20)

Schemaless Change detection in XML Documents using Semantic Identifiers