SlideShare a Scribd company logo
1 of 31
Download to read offline
1

Change Detection in XML Documents
using Semantic Identifiers
BY
KAILAASH BALACHANDRAN
Outline


Motivation



Introduction



The Approach
•
•

2-step Algorithm

•


Identifiers
Axioms

Semantic Change Detection
•

Finding Identifiers

•

Matching Nodes



Examples



Conclusion

2
Motivation(1)

3

Fig.1. Version 1

Fig.2. Version 2

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher>
<price>$56</price>
</book>
</author>
Motivation(1)

4

Fig.1. Version 1

Fig.2. Version 2

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher>
<price>$56</price>
</book>
</author>
Motivation(2)
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

Fig.3. Version 3

5

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Motivation(2)

6

Fig.1. Version 1

Fig.3. Version 3

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Motivation(3)
Disadvantages of Structural detection approach:

 Difficult to associate elements in different versions.
 Break down when the changes are significant.

 Affects Incremental Evaluation.
 High cost of change of data.

7
Introduction
What is Semantic Based Change Detection?
A process of Identifying changes between successive versions of a document
based on its semantics, rather than on the structure of the document.
The Approach:
1. Find Semantic Identifier for each node in the XML model.
2. Compute these Identifiers to associate nodes across multiple versions.

8
Identifiers

9

 Type is list of labels from root to element separated by a ‘/’.

 Identifier serves to distinguish elements of same type.
 Two nodes x and y, are semantically the same if and only if their identifiers evaluate to
the same result.
Eval(x,L) = Eval(y,L)

Node
x

Same Result
Node
y

where,
• x,y are the nodes,
• List of Expressions L = { E1,E2…En}
Identifiers

10

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

Version 3:

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
Identifiers

11

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

<name> is
local

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

Version 3:

<name> is
non-local

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
Identify nodes based on its
Semantics

12

The Algorithm
Phase 1:
 Bottom up fashion.
 Identifies all local identifiers.
 Semantically different nodes are identified.
Phase 2:
 Runs recursively and identifies non-local identifiers.
 All semantically distinct nodes are found.
Any remaining node is a redundant copy of another node in the document.
Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

Semantically different.

13
Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

Are they semantically the same?

14
Identify nodes based on its
Semantics(Phase 2)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

15

Axiom 2: Nodes that are structurally
identical are semantically identical
if and only if their respective parents
are semantically identical or if they
are both root nodes.

No, because they’re in context of two
different books
Semantic Change Detection

16

How to handle structural changes ?
A

X
Y

Z

Version 1

Y

X
Version 2

Assumption: Identifying information will remain nearby.

Z
Semantic Change Detection
 Type Territory : The territory of a type T is the set of all text nodes that are
descendants of the least common ancestor (lca) of all of the type T nodes.
 Within the type territory is the territory controlled by individual nodes of that
type.
 Node Territory : The territory of a type T node p is the type territory of T
excluding all text nodes that are descendants of other type T nodes.

17
Node and Type Territory

18

document root
type territory of p

lca (p)

node territory of p1

node territory of p2

p2
p1

p3

Node territory
Finding Identifiers

19

Version 1:

Version 2:

<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
<book>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
<book>
Identifiers
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

20
Node

IDENTIFIER

book

(../author/name/text(),
title/text())
Identifiers

21

Values of Identifiers for <book> in Version 1
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

Value of Identifier = n1, t1

Value of Identifier = n2, t2

Value of Identifier = n2, t1
Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

22
Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

Value of Identifier = p1, t1

Value of Identifier = p2, t2

23
Identifiers

24

Values of Identifiers for <book> in both versions:

Version 1

Version 2

Node

IDENTIFIER

Node

IDENTIFIER

book (top)

n1 , t1

book 1 (top)

p1 , t1

book 2
(bottom)

p2 , t2

How to map both ?

book
(middle)

n2 , t2

book
(bottom)

n2 , t1
Matching

25

 Admits: q admits p if and only if q is in the node territory of p.
 Nodes p and q are matched if and only if p and q admit each other.
 Consider nodes p and q that reside in different versions Vp and Vq.

q1,
q2….qn

q1,
q2….qn

Node q in Vq

Node p in Vp
Semantic Change Detection

26
bib

Book matches:
pub
Version 1

p1

bib
author
name

n1

book

name

title pub n2
t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

p2

book

title
author

pub

author author
name name

n1

book
title

author

t2

name

n2
Version 2

n2
Semantic Change Detection
bib

Book matches:
pub

admits
Version 1

p1

bib
author
name

n1

book

27

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2
Semantic Change Detection
bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

28

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2
Semantic Change Detection
bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

29

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2
Semantic Change Detection

30
bib

Author matches:
pub
Version 1

p1

bib
author
name

n1

book

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2
Conclusion


Semantic change detection technique.
•

Find identifiers for each node in the XML document

•

Associate nodes across versions.



Information that identifies an element is conserved across changes.



Time complexity is O(n*log(n))



We can match nodes even when structural changes are significant.

31

More Related Content

Viewers also liked

Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingAlberta Soranzo
 
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
Eltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company ProfileEltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company Profile
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company ProfileEltra Consultants
 
Introduction to web designing
Introduction to web designingIntroduction to web designing
Introduction to web designingRajat Shah
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card SortingDCU_MPIUA
 
Life at Siegel+Gale
Life at Siegel+Gale Life at Siegel+Gale
Life at Siegel+Gale Siegel+Gale
 
THANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENGTHANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENGTHANATOS Digital Agency
 
Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016Company Profile Design
 
eXo Digital Agency - Company Profile
eXo Digital Agency - Company ProfileeXo Digital Agency - Company Profile
eXo Digital Agency - Company ProfileeXo Digital Agency
 
TEN Creative Design Agency Creds
TEN Creative Design Agency CredsTEN Creative Design Agency Creds
TEN Creative Design Agency CredsTEN Creative
 
LEAP Agency Company Profile
LEAP Agency Company ProfileLEAP Agency Company Profile
LEAP Agency Company ProfilePrecision Group
 
Ppt of company profile in project
Ppt of company profile in projectPpt of company profile in project
Ppt of company profile in projectshivakumaranupama
 
Tcs company profile presentation -sample
Tcs company profile presentation  -sampleTcs company profile presentation  -sample
Tcs company profile presentation -sampleSivaraj Ganapathy
 
Company Profile Sample
Company Profile SampleCompany Profile Sample
Company Profile SampleYagika Madan
 

Viewers also liked (14)

Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card Sorting
 
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
Eltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company ProfileEltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company Profile
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
 
Introduction to web designing
Introduction to web designingIntroduction to web designing
Introduction to web designing
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card Sorting
 
Life at Siegel+Gale
Life at Siegel+Gale Life at Siegel+Gale
Life at Siegel+Gale
 
THANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENGTHANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENG
 
Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016
 
eXo Digital Agency - Company Profile
eXo Digital Agency - Company ProfileeXo Digital Agency - Company Profile
eXo Digital Agency - Company Profile
 
TEN Creative Design Agency Creds
TEN Creative Design Agency CredsTEN Creative Design Agency Creds
TEN Creative Design Agency Creds
 
LEAP Agency Company Profile
LEAP Agency Company ProfileLEAP Agency Company Profile
LEAP Agency Company Profile
 
Mix Digital Marketing Agency Credentials
Mix Digital Marketing Agency CredentialsMix Digital Marketing Agency Credentials
Mix Digital Marketing Agency Credentials
 
Ppt of company profile in project
Ppt of company profile in projectPpt of company profile in project
Ppt of company profile in project
 
Tcs company profile presentation -sample
Tcs company profile presentation  -sampleTcs company profile presentation  -sample
Tcs company profile presentation -sample
 
Company Profile Sample
Company Profile SampleCompany Profile Sample
Company Profile Sample
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Schemaless Change detection in XML Documents using Semantic Identifiers

  • 1. 1 Change Detection in XML Documents using Semantic Identifiers BY KAILAASH BALACHANDRAN
  • 2. Outline  Motivation  Introduction  The Approach • • 2-step Algorithm •  Identifiers Axioms Semantic Change Detection • Finding Identifiers • Matching Nodes  Examples  Conclusion 2
  • 3. Motivation(1) 3 Fig.1. Version 1 Fig.2. Version 2 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <salesprice>$35</salesprice> <isbn>0385504209</isbn> </book> <book> <title>Angels & Demons</title> <publisher>Pocket Star</publisher> <price>$56</price> </book> </author>
  • 4. Motivation(1) 4 Fig.1. Version 1 Fig.2. Version 2 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <salesprice>$35</salesprice> <isbn>0385504209</isbn> </book> <book> <title>Angels & Demons</title> <publisher>Pocket Star</publisher> <price>$56</price> </book> </author>
  • 5. Motivation(2) Fig.1. Version 1 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> Fig.3. Version 3 5 <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> <price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> <price> $56</price> </book> </publisher>
  • 6. Motivation(2) 6 Fig.1. Version 1 Fig.3. Version 3 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> <price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> <price> $56</price> </book> </publisher>
  • 7. Motivation(3) Disadvantages of Structural detection approach:  Difficult to associate elements in different versions.  Break down when the changes are significant.  Affects Incremental Evaluation.  High cost of change of data. 7
  • 8. Introduction What is Semantic Based Change Detection? A process of Identifying changes between successive versions of a document based on its semantics, rather than on the structure of the document. The Approach: 1. Find Semantic Identifier for each node in the XML model. 2. Compute these Identifiers to associate nodes across multiple versions. 8
  • 9. Identifiers 9  Type is list of labels from root to element separated by a ‘/’.  Identifier serves to distinguish elements of same type.  Two nodes x and y, are semantically the same if and only if their identifiers evaluate to the same result. Eval(x,L) = Eval(y,L) Node x Same Result Node y where, • x,y are the nodes, • List of Expressions L = { E1,E2…En}
  • 10. Identifiers 10 Local Identifier: An identifier is local if it evaluates to descendants of the context node, otherwise it is non-local. Version 1: Version 3: <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author><price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author><price> $56</price> </book> </publisher>
  • 11. Identifiers 11 Local Identifier: An identifier is local if it evaluates to descendants of the context node, otherwise it is non-local. Version 1: <name> is local <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> Version 3: <name> is non-local <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author><price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author><price> $56</price> </book> </publisher>
  • 12. Identify nodes based on its Semantics 12 The Algorithm Phase 1:  Bottom up fashion.  Identifies all local identifiers.  Semantically different nodes are identified. Phase 2:  Runs recursively and identifies non-local identifiers.  All semantically distinct nodes are found. Any remaining node is a redundant copy of another node in the document.
  • 13. Identify nodes based on its Semantics(Phase 1) Axiom 1: Nodes that are structurally different are semantically different. <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> Semantically different. 13
  • 14. Identify nodes based on its Semantics(Phase 1) Axiom 1: Nodes that are structurally different are semantically different. <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> Are they semantically the same? 14
  • 15. Identify nodes based on its Semantics(Phase 2) <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> 15 Axiom 2: Nodes that are structurally identical are semantically identical if and only if their respective parents are semantically identical or if they are both root nodes. No, because they’re in context of two different books
  • 16. Semantic Change Detection 16 How to handle structural changes ? A X Y Z Version 1 Y X Version 2 Assumption: Identifying information will remain nearby. Z
  • 17. Semantic Change Detection  Type Territory : The territory of a type T is the set of all text nodes that are descendants of the least common ancestor (lca) of all of the type T nodes.  Within the type territory is the territory controlled by individual nodes of that type.  Node Territory : The territory of a type T node p is the type territory of T excluding all text nodes that are descendants of other type T nodes. 17
  • 18. Node and Type Territory 18 document root type territory of p lca (p) node territory of p1 node territory of p2 p2 p1 p3 Node territory
  • 19. Finding Identifiers 19 Version 1: Version 2: <bib> <author><name>n1</name> <book> <title>t1</title> <publisher>p1</publisher> </book> </author> <author><name>n2</name> <book> <title>t2</title> <publisher>p2</publisher> </book> <book> <title>t1</title> <publisher>p1</publisher> </book></author> </bib> <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> <book> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> <book>
  • 21. Identifiers 21 Values of Identifiers for <book> in Version 1 <bib> <author><name>n1</name> <book> <title>t1</title> <publisher>p1</publisher> </book> </author> <author><name>n2</name> <book> <title>t2</title> <publisher>p2</publisher> </book> <book> <title>t1</title> <publisher>p1</publisher> </book></author> </bib> Value of Identifier = n1, t1 Value of Identifier = n2, t2 Value of Identifier = n2, t1
  • 22. Identifiers Values of Identifiers for <book> in Version 2 <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> </book> </pub> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> </book></pub> </bib> 22
  • 23. Identifiers Values of Identifiers for <book> in Version 2 <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> </book> </pub> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> </book></pub> </bib> Value of Identifier = p1, t1 Value of Identifier = p2, t2 23
  • 24. Identifiers 24 Values of Identifiers for <book> in both versions: Version 1 Version 2 Node IDENTIFIER Node IDENTIFIER book (top) n1 , t1 book 1 (top) p1 , t1 book 2 (bottom) p2 , t2 How to map both ? book (middle) n2 , t2 book (bottom) n2 , t1
  • 25. Matching 25  Admits: q admits p if and only if q is in the node territory of p.  Nodes p and q are matched if and only if p and q admit each other.  Consider nodes p and q that reside in different versions Vp and Vq. q1, q2….qn q1, q2….qn Node q in Vq Node p in Vp
  • 26. Semantic Change Detection 26 bib Book matches: pub Version 1 p1 bib author name n1 book name title pub n2 t1 p1 t1 book book title pub title t2 p2 t1 pub p1 p2 book title author pub author author name name n1 book title author t2 name n2 Version 2 n2
  • 27. Semantic Change Detection bib Book matches: pub admits Version 1 p1 bib author name n1 book 27 name t1 book title pub n2 title pub title t1 p1 t2 p2 t1 author author name name n1 book pub p1 p2 book title author pub book title t2 n2 Version 2 author name n2
  • 28. Semantic Change Detection bib Book matches: pub Node match Version 1 p1 bib author name n1 book name t1 p1 t1 book book title pub title t2 p2 t1 pub p1 pub p2 book title author title pub n2 28 author author name name n1 book title t2 n2 Version 2 author name n2
  • 29. Semantic Change Detection bib Book matches: pub Node match Version 1 p1 bib author name n1 book name t1 p1 t1 book book title pub title t2 p2 t1 pub p1 pub p2 book title author title pub n2 29 author author name name n1 book title t2 n2 Version 2 author name n2
  • 30. Semantic Change Detection 30 bib Author matches: pub Version 1 p1 bib author name n1 book name t1 book title pub n2 title pub title t1 p1 t2 p2 t1 author author name name n1 book pub p1 p2 book title author pub book title t2 n2 Version 2 author name n2
  • 31. Conclusion  Semantic change detection technique. • Find identifiers for each node in the XML document • Associate nodes across versions.  Information that identifies an element is conserved across changes.  Time complexity is O(n*log(n))  We can match nodes even when structural changes are significant. 31