SlideShare a Scribd company logo
1 of 21
Archive integration at Mattilsynet
Bouvet Tech Meetup 2014-06-11
Lars Marius Garshol, larsga@sesam.no, http://twitter.com/larsga
1
Archive integrations
A few systems integrated with the archive
– every integration is expensive and painful
Need many more integrations
– to reduce amount of manual work
– hesitation because of cost
Consequences of integrations
– if archive upgraded, must retest all systems
– archive slows down integrated systems
– changes to archive structure require
rewriting all integrations
Arkiv
Regelverk
Fagsystem
#2
Fagsystem
#1
Nettsider
Rekrut-
tering
Kvalitets-
systemet
WebCruiter integration
3
Very simple project
– integrate WebCruiter with ePhorte
Doing it with RDF because
– it’s much easier and cheaper
– want to extend to more integrations later
– first step toward new architecture
Good example project
– because it’s so simple
4
SESAM principles
4
Base everything on RDF and SDShare feeds
– dynamic flows of structured data
Extracts from data sources do not map to a
common model
– instead, extract data as they are in the source
– later translate to representation needed by
consumers
– this way, changes in source or target do not spill over
to the other
No hard bindings from code to data model
– code should have no knowledge of the data model
– all data model-specific logic should be configuration
– makes data changes much easier to handle
W3C standard
– for interchange of structured data
– has query language, schema languages, formats, ...
Essentially a graph database
– known as a triple store
– like Neo4j or similar
– but standardized
– and with many extra features
Note that databases are schemaless
– so this is NoSQL
– powerful query language with SPARQL
5
RDF?
Architecture
6
WebCruiter
WS
XML in
files
SDShar
e
Oversettelse ePhorteRDF
SDShar
e
SDShar
eOversettelse
SDShar
e
ePhorte adapterHTTP POST
HTTP POST
SPARQL
Update
SPARQL
Update
SPARQL
Update
external call
Bus
Boxes in orange are
Sesam components
SDShare
A protocol for tracking changes in a data source
– essentially allows clients to keep track of all changes, for
replication purposes
– based on Atom and REST
Data source can be anything
– triple store
– relational database
– XML files on disk
– ...
Data flows as RDF
– not an absolute must, but it’s how we do things
A CEN specification
– http://sdshare.org
Basic workings
Server Client
Frag
men
t
Server publishes fragments
representing changes in
datastore
Client pulls these in, updates
local copy of dataset
Frag
men
t
Frag
men
t
Frag
men
t
From WebCruiter to triple store
9
Frag
men
t
Frag
men
t
Frag
men
t
Frag
men
t
XML adapter
SDShare server
Triple store
SDShare
client
On the server:
• XPath queries to map to RDF
On the client:
• Two URLs
10
Translation of metadata
11
Title: Søknad om betalingsutsettelse
Process: 384192
Author: 123
Customer:789
Oversetter
Tittel: Søknad om betalingsutsettelse
Sak: 485283
Ansvarlig: 456
Kontakt: 987
Doktype: I
Arkivdel: 17
Application
Archive
Active
Directory
12
3
xy
z
45
6
789
987
How the mapping works
12
Standard RDF vocabulary
– mapping between properties
– traversing properties to add values
– uses owl:sameAs to map values
Java implementation
– called metadata-translator (~500 LOC)
– uses very simple SDShare push protocol
– writes translated data to Virtuoso
Supports multiple mappings
– configured using graphs so we know which
properties and values to translate to
What’s to be mapped?
13
Department cannot be mapped
– structure in WebCruiter added manually
Users cannot be mapped, either
– no common key
– solved using Duke
Department can be defaulted
– in the cases where we know the user
WebCruiter ePhorte
Data transfer to translation
14
Simply write SPARQL queries to
– produce fragment feed (based on timestamps)
– produce a fragment (trivial)
– produce a snapshot (trivial)
Then configure SDShare client
– just requires two URLs
– translation receives an HTTP POST with the
fragment, then does its job
ePhorte adapter
15
Receives RDF
– introspects the RDF and translates to Java API
– Java API is stubs calling SOAP services
Given <foo> rdf:type <.../MyClass>
– it looks up the Java class “MyClass” then
instantiates
Then, given <foo> <.../prop> “value”
– it looks up method “setProp” on MyClass
– calls object.setProp(“value”)
That’s it
– requires translation to produce RDF exactly aligned
with Java API
– means there’s no code
https://github.com/Mattilsynet/arkivgrensesnitt
Configuration
16
WebCruiter
WS
XML in
files
SDShar
e
Oversettelse ePhorteRDF
SDShar
e
SDShar
eOversettelse
SDShar
e
ePhorte adapterHTTP POST
external call
Bus
Look, ma, no code!
XPath mapping
RDF mapping
SQL queries
SPARQL queries
Look, ma, no code!
not much code!
Properties
Adding more object types or properties is
simple
– we just extend the mapping (and maybe
queries)
Data quality improves with more data
– if we don’t have the data to translate
employees that information gets lost
– if the necessary mapping is added later
translation improves automagically
Adding more systems is very easy
– requires more SDShare feeds plus mappings
17
The public journal problem
18
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
ePhorte
The public journal solution
19
Internet
DMZ Secure zone
Oracle
ePhorte
Journal
app
Oracle
ePhorte
RDFfiltered
SDShare SDShare
20
Relatively small project, not that many hours
– includes writing reusable ephorte-adapter
– parts of writing the metadata translator, too
– also the XML adapter
– system documentation
– automated deploy system based on Jenkins
Flexible, simple solution
– most of it reusable
– actually captures, as a side-effect, information not
available in any other system
Conclusion
Questions?
21

More Related Content

What's hot

How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 

What's hot (20)

Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
 
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
HBaseConEast2016: Coprocessors – Uses, Abuses and SolutionsHBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
 
Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Built in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat GulecBuilt in physical and logical replication in postgresql-Firat Gulec
Built in physical and logical replication in postgresql-Firat Gulec
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Data Integration
Data IntegrationData Integration
Data Integration
 

Similar to Archive integration with RDF

Semantic Web Servers
Semantic Web ServersSemantic Web Servers
Semantic Web Servers
webhostingguy
 
Publishing Linked Data 3/5 Semtech2011
Publishing Linked Data 3/5 Semtech2011Publishing Linked Data 3/5 Semtech2011
Publishing Linked Data 3/5 Semtech2011
Juan Sequeda
 
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by  Sudarshan Kadambi and Partha NageswaranSpark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark Summit
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
Comparative Study That Aims Rdf Processing For The Java Platform
Comparative Study That Aims Rdf Processing For The Java PlatformComparative Study That Aims Rdf Processing For The Java Platform
Comparative Study That Aims Rdf Processing For The Java Platform
Computer Science
 

Similar to Archive integration with RDF (20)

Semantic Web Servers
Semantic Web ServersSemantic Web Servers
Semantic Web Servers
 
Publishing Linked Data 3/5 Semtech2011
Publishing Linked Data 3/5 Semtech2011Publishing Linked Data 3/5 Semtech2011
Publishing Linked Data 3/5 Semtech2011
 
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by  Sudarshan Kadambi and Partha NageswaranSpark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
 
Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZ
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache Marmotta
 
Wedi2014
Wedi2014Wedi2014
Wedi2014
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
 
Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Web Spa
Web SpaWeb Spa
Web Spa
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Comparative Study That Aims Rdf Processing For The Java Platform
Comparative Study That Aims Rdf Processing For The Java PlatformComparative Study That Aims Rdf Processing For The Java Platform
Comparative Study That Aims Rdf Processing For The Java Platform
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 

More from Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Archive integration with RDF

  • 1. Archive integration at Mattilsynet Bouvet Tech Meetup 2014-06-11 Lars Marius Garshol, larsga@sesam.no, http://twitter.com/larsga 1
  • 2. Archive integrations A few systems integrated with the archive – every integration is expensive and painful Need many more integrations – to reduce amount of manual work – hesitation because of cost Consequences of integrations – if archive upgraded, must retest all systems – archive slows down integrated systems – changes to archive structure require rewriting all integrations Arkiv Regelverk Fagsystem #2 Fagsystem #1 Nettsider Rekrut- tering Kvalitets- systemet
  • 3. WebCruiter integration 3 Very simple project – integrate WebCruiter with ePhorte Doing it with RDF because – it’s much easier and cheaper – want to extend to more integrations later – first step toward new architecture Good example project – because it’s so simple 4
  • 4. SESAM principles 4 Base everything on RDF and SDShare feeds – dynamic flows of structured data Extracts from data sources do not map to a common model – instead, extract data as they are in the source – later translate to representation needed by consumers – this way, changes in source or target do not spill over to the other No hard bindings from code to data model – code should have no knowledge of the data model – all data model-specific logic should be configuration – makes data changes much easier to handle
  • 5. W3C standard – for interchange of structured data – has query language, schema languages, formats, ... Essentially a graph database – known as a triple store – like Neo4j or similar – but standardized – and with many extra features Note that databases are schemaless – so this is NoSQL – powerful query language with SPARQL 5 RDF?
  • 6. Architecture 6 WebCruiter WS XML in files SDShar e Oversettelse ePhorteRDF SDShar e SDShar eOversettelse SDShar e ePhorte adapterHTTP POST HTTP POST SPARQL Update SPARQL Update SPARQL Update external call Bus Boxes in orange are Sesam components
  • 7. SDShare A protocol for tracking changes in a data source – essentially allows clients to keep track of all changes, for replication purposes – based on Atom and REST Data source can be anything – triple store – relational database – XML files on disk – ... Data flows as RDF – not an absolute must, but it’s how we do things A CEN specification – http://sdshare.org
  • 8. Basic workings Server Client Frag men t Server publishes fragments representing changes in datastore Client pulls these in, updates local copy of dataset Frag men t Frag men t Frag men t
  • 9. From WebCruiter to triple store 9 Frag men t Frag men t Frag men t Frag men t XML adapter SDShare server Triple store SDShare client On the server: • XPath queries to map to RDF On the client: • Two URLs
  • 10. 10
  • 11. Translation of metadata 11 Title: Søknad om betalingsutsettelse Process: 384192 Author: 123 Customer:789 Oversetter Tittel: Søknad om betalingsutsettelse Sak: 485283 Ansvarlig: 456 Kontakt: 987 Doktype: I Arkivdel: 17 Application Archive Active Directory 12 3 xy z 45 6 789 987
  • 12. How the mapping works 12 Standard RDF vocabulary – mapping between properties – traversing properties to add values – uses owl:sameAs to map values Java implementation – called metadata-translator (~500 LOC) – uses very simple SDShare push protocol – writes translated data to Virtuoso Supports multiple mappings – configured using graphs so we know which properties and values to translate to
  • 13. What’s to be mapped? 13 Department cannot be mapped – structure in WebCruiter added manually Users cannot be mapped, either – no common key – solved using Duke Department can be defaulted – in the cases where we know the user WebCruiter ePhorte
  • 14. Data transfer to translation 14 Simply write SPARQL queries to – produce fragment feed (based on timestamps) – produce a fragment (trivial) – produce a snapshot (trivial) Then configure SDShare client – just requires two URLs – translation receives an HTTP POST with the fragment, then does its job
  • 15. ePhorte adapter 15 Receives RDF – introspects the RDF and translates to Java API – Java API is stubs calling SOAP services Given <foo> rdf:type <.../MyClass> – it looks up the Java class “MyClass” then instantiates Then, given <foo> <.../prop> “value” – it looks up method “setProp” on MyClass – calls object.setProp(“value”) That’s it – requires translation to produce RDF exactly aligned with Java API – means there’s no code https://github.com/Mattilsynet/arkivgrensesnitt
  • 16. Configuration 16 WebCruiter WS XML in files SDShar e Oversettelse ePhorteRDF SDShar e SDShar eOversettelse SDShar e ePhorte adapterHTTP POST external call Bus Look, ma, no code! XPath mapping RDF mapping SQL queries SPARQL queries Look, ma, no code! not much code!
  • 17. Properties Adding more object types or properties is simple – we just extend the mapping (and maybe queries) Data quality improves with more data – if we don’t have the data to translate employees that information gets lost – if the necessary mapping is added later translation improves automagically Adding more systems is very easy – requires more SDShare feeds plus mappings 17
  • 18. The public journal problem 18 Internet DMZ Secure zone Oracle ePhorte Journal app ePhorte
  • 19. The public journal solution 19 Internet DMZ Secure zone Oracle ePhorte Journal app Oracle ePhorte RDFfiltered SDShare SDShare
  • 20. 20 Relatively small project, not that many hours – includes writing reusable ephorte-adapter – parts of writing the metadata translator, too – also the XML adapter – system documentation – automated deploy system based on Jenkins Flexible, simple solution – most of it reusable – actually captures, as a side-effect, information not available in any other system Conclusion