1. Archive integration at Mattilsynet
Bouvet Tech Meetup 2014-06-11
Lars Marius Garshol, larsga@sesam.no, http://twitter.com/larsga
1
2. Archive integrations
A few systems integrated with the archive
– every integration is expensive and painful
Need many more integrations
– to reduce amount of manual work
– hesitation because of cost
Consequences of integrations
– if archive upgraded, must retest all systems
– archive slows down integrated systems
– changes to archive structure require
rewriting all integrations
Arkiv
Regelverk
Fagsystem
#2
Fagsystem
#1
Nettsider
Rekrut-
tering
Kvalitets-
systemet
3. WebCruiter integration
3
Very simple project
– integrate WebCruiter with ePhorte
Doing it with RDF because
– it’s much easier and cheaper
– want to extend to more integrations later
– first step toward new architecture
Good example project
– because it’s so simple
4
4. SESAM principles
4
Base everything on RDF and SDShare feeds
– dynamic flows of structured data
Extracts from data sources do not map to a
common model
– instead, extract data as they are in the source
– later translate to representation needed by
consumers
– this way, changes in source or target do not spill over
to the other
No hard bindings from code to data model
– code should have no knowledge of the data model
– all data model-specific logic should be configuration
– makes data changes much easier to handle
5. W3C standard
– for interchange of structured data
– has query language, schema languages, formats, ...
Essentially a graph database
– known as a triple store
– like Neo4j or similar
– but standardized
– and with many extra features
Note that databases are schemaless
– so this is NoSQL
– powerful query language with SPARQL
5
RDF?
7. SDShare
A protocol for tracking changes in a data source
– essentially allows clients to keep track of all changes, for
replication purposes
– based on Atom and REST
Data source can be anything
– triple store
– relational database
– XML files on disk
– ...
Data flows as RDF
– not an absolute must, but it’s how we do things
A CEN specification
– http://sdshare.org
8. Basic workings
Server Client
Frag
men
t
Server publishes fragments
representing changes in
datastore
Client pulls these in, updates
local copy of dataset
Frag
men
t
Frag
men
t
Frag
men
t
9. From WebCruiter to triple store
9
Frag
men
t
Frag
men
t
Frag
men
t
Frag
men
t
XML adapter
SDShare server
Triple store
SDShare
client
On the server:
• XPath queries to map to RDF
On the client:
• Two URLs
11. Translation of metadata
11
Title: Søknad om betalingsutsettelse
Process: 384192
Author: 123
Customer:789
Oversetter
Tittel: Søknad om betalingsutsettelse
Sak: 485283
Ansvarlig: 456
Kontakt: 987
Doktype: I
Arkivdel: 17
Application
Archive
Active
Directory
12
3
xy
z
45
6
789
987
12. How the mapping works
12
Standard RDF vocabulary
– mapping between properties
– traversing properties to add values
– uses owl:sameAs to map values
Java implementation
– called metadata-translator (~500 LOC)
– uses very simple SDShare push protocol
– writes translated data to Virtuoso
Supports multiple mappings
– configured using graphs so we know which
properties and values to translate to
13. What’s to be mapped?
13
Department cannot be mapped
– structure in WebCruiter added manually
Users cannot be mapped, either
– no common key
– solved using Duke
Department can be defaulted
– in the cases where we know the user
WebCruiter ePhorte
14. Data transfer to translation
14
Simply write SPARQL queries to
– produce fragment feed (based on timestamps)
– produce a fragment (trivial)
– produce a snapshot (trivial)
Then configure SDShare client
– just requires two URLs
– translation receives an HTTP POST with the
fragment, then does its job
15. ePhorte adapter
15
Receives RDF
– introspects the RDF and translates to Java API
– Java API is stubs calling SOAP services
Given <foo> rdf:type <.../MyClass>
– it looks up the Java class “MyClass” then
instantiates
Then, given <foo> <.../prop> “value”
– it looks up method “setProp” on MyClass
– calls object.setProp(“value”)
That’s it
– requires translation to produce RDF exactly aligned
with Java API
– means there’s no code
https://github.com/Mattilsynet/arkivgrensesnitt
17. Properties
Adding more object types or properties is
simple
– we just extend the mapping (and maybe
queries)
Data quality improves with more data
– if we don’t have the data to translate
employees that information gets lost
– if the necessary mapping is added later
translation improves automagically
Adding more systems is very easy
– requires more SDShare feeds plus mappings
17
18. The public journal problem
18
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
ePhorte
19. The public journal solution
19
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
Oracle
ePhorte
RDFfiltered
SDShare SDShare
20. 20
Relatively small project, not that many hours
– includes writing reusable ephorte-adapter
– parts of writing the metadata translator, too
– also the XML adapter
– system documentation
– automated deploy system based on Jenkins
Flexible, simple solution
– most of it reusable
– actually captures, as a side-effect, information not
available in any other system
Conclusion