Change Detection is a process of comparing successive versions of documents to identify the changes. The success of XML as the standard for data exchange has paved way for a number of change detection techniques that focus more on structural changes, rather than on the semantics. Existing structural change detection mechanisms tend to break down when the changes made are significantly large. This paper discusses a schema less, semantics based framework that associates semantic identifiers to elements in successive versions, thus clearing the obstacle of efficient association of elements even if the structural change is significant.
5. Motivation(2)
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
Fig.3. Version 3
5
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
6. Motivation(2)
6
Fig.1. Version 1
Fig.3. Version 3
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
7. Motivation(3)
Disadvantages of Structural detection approach:
Difficult to associate elements in different versions.
Break down when the changes are significant.
Affects Incremental Evaluation.
High cost of change of data.
7
8. Introduction
What is Semantic Based Change Detection?
A process of Identifying changes between successive versions of a document
based on its semantics, rather than on the structure of the document.
The Approach:
1. Find Semantic Identifier for each node in the XML model.
2. Compute these Identifiers to associate nodes across multiple versions.
8
9. Identifiers
9
Type is list of labels from root to element separated by a ‘/’.
Identifier serves to distinguish elements of same type.
Two nodes x and y, are semantically the same if and only if their identifiers evaluate to
the same result.
Eval(x,L) = Eval(y,L)
Node
x
Same Result
Node
y
where,
• x,y are the nodes,
• List of Expressions L = { E1,E2…En}
10. Identifiers
10
Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:
Version 3:
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
11. Identifiers
11
Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:
<name> is
local
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
Version 3:
<name> is
non-local
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
12. Identify nodes based on its
Semantics
12
The Algorithm
Phase 1:
Bottom up fashion.
Identifies all local identifiers.
Semantically different nodes are identified.
Phase 2:
Runs recursively and identifies non-local identifiers.
All semantically distinct nodes are found.
Any remaining node is a redundant copy of another node in the document.
13. Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>
Semantically different.
13
14. Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>
Are they semantically the same?
14
15. Identify nodes based on its
Semantics(Phase 2)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>
15
Axiom 2: Nodes that are structurally
identical are semantically identical
if and only if their respective parents
are semantically identical or if they
are both root nodes.
No, because they’re in context of two
different books
16. Semantic Change Detection
16
How to handle structural changes ?
A
X
Y
Z
Version 1
Y
X
Version 2
Assumption: Identifying information will remain nearby.
Z
17. Semantic Change Detection
Type Territory : The territory of a type T is the set of all text nodes that are
descendants of the least common ancestor (lca) of all of the type T nodes.
Within the type territory is the territory controlled by individual nodes of that
type.
Node Territory : The territory of a type T node p is the type territory of T
excluding all text nodes that are descendants of other type T nodes.
17
18. Node and Type Territory
18
document root
type territory of p
lca (p)
node territory of p1
node territory of p2
p2
p1
p3
Node territory
21. Identifiers
21
Values of Identifiers for <book> in Version 1
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>
Value of Identifier = n1, t1
Value of Identifier = n2, t2
Value of Identifier = n2, t1
22. Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>
22
23. Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>
Value of Identifier = p1, t1
Value of Identifier = p2, t2
23
24. Identifiers
24
Values of Identifiers for <book> in both versions:
Version 1
Version 2
Node
IDENTIFIER
Node
IDENTIFIER
book (top)
n1 , t1
book 1 (top)
p1 , t1
book 2
(bottom)
p2 , t2
How to map both ?
book
(middle)
n2 , t2
book
(bottom)
n2 , t1
25. Matching
25
Admits: q admits p if and only if q is in the node territory of p.
Nodes p and q are matched if and only if p and q admit each other.
Consider nodes p and q that reside in different versions Vp and Vq.
q1,
q2….qn
q1,
q2….qn
Node q in Vq
Node p in Vp
26. Semantic Change Detection
26
bib
Book matches:
pub
Version 1
p1
bib
author
name
n1
book
name
title pub n2
t1
p1
t1
book
book
title pub title
t2
p2
t1
pub
p1
p2
book
title
author
pub
author author
name name
n1
book
title
author
t2
name
n2
Version 2
n2
27. Semantic Change Detection
bib
Book matches:
pub
admits
Version 1
p1
bib
author
name
n1
book
27
name
t1
book
title
pub n2
title
pub title
t1
p1
t2
p2
t1
author author
name name
n1
book
pub
p1
p2
book
title
author
pub
book
title
t2
n2
Version 2
author
name
n2
28. Semantic Change Detection
bib
Book matches:
pub
Node match
Version 1
p1
bib
author
name
n1
book
name
t1
p1
t1
book
book
title pub title
t2
p2
t1
pub
p1
pub
p2
book
title
author
title pub n2
28
author author
name name
n1
book
title
t2
n2
Version 2
author
name
n2
29. Semantic Change Detection
bib
Book matches:
pub
Node match
Version 1
p1
bib
author
name
n1
book
name
t1
p1
t1
book
book
title pub title
t2
p2
t1
pub
p1
pub
p2
book
title
author
title pub n2
29
author author
name name
n1
book
title
t2
n2
Version 2
author
name
n2
30. Semantic Change Detection
30
bib
Author matches:
pub
Version 1
p1
bib
author
name
n1
book
name
t1
book
title
pub n2
title
pub title
t1
p1
t2
p2
t1
author author
name name
n1
book
pub
p1
p2
book
title
author
pub
book
title
t2
n2
Version 2
author
name
n2
31. Conclusion
Semantic change detection technique.
•
Find identifiers for each node in the XML document
•
Associate nodes across versions.
Information that identifies an element is conserved across changes.
Time complexity is O(n*log(n))
We can match nodes even when structural changes are significant.
31