Archive integration with RDF

Archive integration at Mattilsynet
Bouvet Tech Meetup 2014-06-11
Lars Marius Garshol, larsga@sesam.no, http://twitter.com/larsga
1

Archive integrations
A few systems integrated with the archive
– every integration is expensive and painful
Need many more integrations
– to reduce amount of manual work
– hesitation because of cost
Consequences of integrations
– if archive upgraded, must retest all systems
– archive slows down integrated systems
– changes to archive structure require
rewriting all integrations
Arkiv
Regelverk
Fagsystem
#2
Fagsystem
#1
Nettsider
Rekrut-
tering
Kvalitets-
systemet

WebCruiter integration
3
Very simple project
– integrate WebCruiter with ePhorte
Doing it with RDF because
– it’s much easier and cheaper
– want to extend to more integrations later
– first step toward new architecture
Good example project
– because it’s so simple
4

SESAM principles
4
Base everything on RDF and SDShare feeds
– dynamic flows of structured data
Extracts from data sources do not map to a
common model
– instead, extract data as they are in the source
– later translate to representation needed by
consumers
– this way, changes in source or target do not spill over
to the other
No hard bindings from code to data model
– code should have no knowledge of the data model
– all data model-specific logic should be configuration
– makes data changes much easier to handle

W3C standard
– for interchange of structured data
– has query language, schema languages, formats, ...
Essentially a graph database
– known as a triple store
– like Neo4j or similar
– but standardized
– and with many extra features
Note that databases are schemaless
– so this is NoSQL
– powerful query language with SPARQL
5
RDF?

Architecture
6
WebCruiter
WS
XML in
files
SDShar
e
Oversettelse ePhorteRDF
SDShar
e
SDShar
eOversettelse
SDShar
e
ePhorte adapterHTTP POST
HTTP POST
SPARQL
Update
SPARQL
Update
SPARQL
Update
external call
Bus
Boxes in orange are
Sesam components

SDShare
A protocol for tracking changes in a data source
– essentially allows clients to keep track of all changes, for
replication purposes
– based on Atom and REST
Data source can be anything
– triple store
– relational database
– XML files on disk
– ...
Data flows as RDF
– not an absolute must, but it’s how we do things
A CEN specification
– http://sdshare.org

Basic workings
Server Client
Frag
men
t
Server publishes fragments
representing changes in
datastore
Client pulls these in, updates
local copy of dataset
Frag
men
t
Frag
men
t
Frag
men
t

From WebCruiter to triple store
9
Frag
men
t
Frag
men
t
Frag
men
t
Frag
men
t
XML adapter
SDShare server
Triple store
SDShare
client
On the server:
• XPath queries to map to RDF
On the client:
• Two URLs

Translation of metadata
11
Title: Søknad om betalingsutsettelse
Process: 384192
Author: 123
Customer:789
Oversetter
Tittel: Søknad om betalingsutsettelse
Sak: 485283
Ansvarlig: 456
Kontakt: 987
Doktype: I
Arkivdel: 17
Application
Archive
Active
Directory
12
3
xy
z
45
6
789
987

How the mapping works
12
Standard RDF vocabulary
– mapping between properties
– traversing properties to add values
– uses owl:sameAs to map values
Java implementation
– called metadata-translator (~500 LOC)
– uses very simple SDShare push protocol
– writes translated data to Virtuoso
Supports multiple mappings
– configured using graphs so we know which
properties and values to translate to

What’s to be mapped?
13
Department cannot be mapped
– structure in WebCruiter added manually
Users cannot be mapped, either
– no common key
– solved using Duke
Department can be defaulted
– in the cases where we know the user
WebCruiter ePhorte

Data transfer to translation
14
Simply write SPARQL queries to
– produce fragment feed (based on timestamps)
– produce a fragment (trivial)
– produce a snapshot (trivial)
Then configure SDShare client
– just requires two URLs
– translation receives an HTTP POST with the
fragment, then does its job

ePhorte adapter
15
Receives RDF
– introspects the RDF and translates to Java API
– Java API is stubs calling SOAP services
Given <foo> rdf:type <.../MyClass>
– it looks up the Java class “MyClass” then
instantiates
Then, given <foo> <.../prop> “value”
– it looks up method “setProp” on MyClass
– calls object.setProp(“value”)
That’s it
– requires translation to produce RDF exactly aligned
with Java API
– means there’s no code
https://github.com/Mattilsynet/arkivgrensesnitt

Configuration
16
WebCruiter
WS
XML in
files
SDShar
e
Oversettelse ePhorteRDF
SDShar
e
SDShar
eOversettelse
SDShar
e
ePhorte adapterHTTP POST
external call
Bus
Look, ma, no code!
XPath mapping
RDF mapping
SQL queries
SPARQL queries
Look, ma, no code!
not much code!

Properties
Adding more object types or properties is
simple
– we just extend the mapping (and maybe
queries)
Data quality improves with more data
– if we don’t have the data to translate
employees that information gets lost
– if the necessary mapping is added later
translation improves automagically
Adding more systems is very easy
– requires more SDShare feeds plus mappings
17

The public journal problem
18
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
ePhorte

The public journal solution
19
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
Oracle
ePhorte
RDFfiltered
SDShare SDShare

20
Relatively small project, not that many hours
– includes writing reusable ephorte-adapter
– parts of writing the metadata translator, too
– also the XML adapter
– system documentation
– automated deploy system based on Jenkins
Flexible, simple solution
– most of it reusable
– actually captures, as a side-effect, information not
available in any other system
Conclusion

Archive integration with RDF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Archive integration with RDF

Similar to Archive integration with RDF (20)

More from Lars Marius Garshol

More from Lars Marius Garshol (20)

Recently uploaded

Recently uploaded (20)

Archive integration with RDF