SlideShare a Scribd company logo
1 of 37
Australian Society of Archivists
Conference 2016, Parramatta
Session 5: Description and Innovation
Binary Trees? Automatically Identifying
the links between born-digital records.
Ross Spencer
Digital Preservation Analyst
Systems Strategy and Standards team
Department of Internal Affairs
How do we view the world?
Department of Internal Affairs
Binary Trees?
Department of Internal Affairs
But that looks like a network graph?!
• It is!
• Records (Items) connected across many recordkeeping and
archival contexts
• Across functions; People; Agency; Subject; Context; References;
Subject... Date, File Format...
• No boundaries!
@ArvhivesNZ:ItemA -> references -> @DOC:ItemB
Department of Internal Affairs
We know this...
• Continuum model (Multiple contexts over space and time)
• ICA Draft Conceptual Model (RiC)
• 73 Record Relations RiC-R1 to RiC-R73
• Three of which we might be able to (more easily) automate?
• Has Copy; Is Copy Of; Has Part
• Wherein (I suggest) lies the issue...
Department of Internal Affairs
Archives NZ Context
2011 Archives New Zealand developed its new conceptual model and metadata
schema for archival description.
Designed to accommodate description of born-digital records.
much discussion among archivists about the practicalities of describing relationships
between items.
It was acknowledged that, given the volumes of digital records likely to be in each
transfer, neither agency nor Archives staff were likely to examine the content of items
visually one-by-one to determine which other items they referred to...
~ Talei Masters
Department of Internal Affairs
What then do we do?
• Mathematical properties of digital files...
• Signals ->
• Numbers ->
• Encoding Schemes (UTF8, ASCII) - >
• Data Structures ->
• File Formats -> User Content.
• Reduce again to a series of numbers that we can interpret to use
numerical properties:
• Greater than; less than; equal to; not equal to...
Department of Internal Affairs
In the relationship between numbers we can find the
relationships between records
Department of Internal Affairs
Relations we might be able to create...
• Relationship One: Is Identical
• Relationship Two: Is Similar
• Relationship Three: Contains Hyperlink
• Relationship Four: Contains CMS Reference
• Relationship Five: Contains Embedded Digital Objects
• Relationship Six: Contains Intra-Item Relationships
• Relationship Seven: Contains Object References
• Relationship Eight: Item Mentions
Department of Internal Affairs
Relationship One: Is Identical
●
We often have checksums available in digital repository
●
First comparison in a digital transfer...
Does Checksum A still equal Checksum A?
●
If yes, accept, continue to transfer...
●
If no... reject! Inspect!
●
Expose this information in the catalogue and compare; what
happens?
Department of Internal Affairs
Relationship One: Is Identical
Archival Context A
Record Keeping System
A
Archival Context B
Record Keeping System B
Department of Internal Affairs
Relationship Two: Is Similar
• MD5 (Rivest, 1992):
• File A (Zero changes):
8c69dc0668c4c73092a7042df45e756adb170742
• File B (1 Byte Removed):
6b75b8f235c148efd1b03d9c113664895b5aa7cd
Department of Internal Affairs
Relationship Two: Is Similar
• SSDEEP (Kornblum, 2006):
• File A (Zero changes):
1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z
BB1U6hP
• File C (First 250 Bytes Removed*):
1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9
ZBB1U6hP
*Less than two tweets (140 bytes)
Department of Internal Affairs
Relationship Two: Is Similar
• First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)
• Oliver et al. (2014) Thresholds should be tuned for each application
• Fiirst application is item level sentencing during transfer feasibility
investigations
• Manually sentence... 10 records per hour
• Follow links to those not of archival value...
* Trend Micro Locality Sensitivity Hash!
Department of Internal Affairs
Relationship Two: Is Similar
MD5 Hash Fuzzy Hash
Department of Internal Affairs
Relationship Two: Similar
Department of Internal Affairs
Relationship Two: Similar
Department of Internal Affairs
Relationship Two: Is Similar
You liked this record... you might also like...
Department of Internal Affairs
Relationship Three: Contains HTTP://
• Burnhill et al. (2015)
• 64,000 e-theses, 46,000 pointed out to external sources
• Websites, external files, etc.
Department of Internal Affairs
Relationship Three: Contains HTTP://
#!/bin/bash
set -e
#FILES LOCATION
FILES='/home/digital-preservation/accessions'
dp_analysis ()
{
echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]'
echo
}
# Find loop...
oIFS=$IFS
IFS=$'n'
time(find "$FILES" -type f | while read -r file; do
dp_analysis "$file"
done)
IFS=$oIFS
Department of Internal Affairs
Relationship Three: Contains HTTP://
• https://gist.github.com/ross-
spencer/a6411a021afb7de7e3dc6dd713f7b520
• ~5059 parseable born-digital records
• ~4800 lines contained hyperlinks
Department of Internal Affairs
Relationship Four: Contains CMS Reference
echo -e $(catdoc "$file" | grep -e "A[0-9]{6}"
• Matches the Archway catalogue reference number, e.g.
A204050; A123456; and not AZ12345
• CMS reference could be sent alongside transfer metadata
for such searches.
• Flag existence (at least) - FYI to the end user – be that the
transfer archivist, to the agency, to the researcher
Department of Internal Affairs
Relationship Five: Contains Embedded Object
$ java -jar tika-app1.13.jar -z <filename> --extract-
dir=<dirname>
Department of Internal Affairs
Relationship Six: Contains Intra-Item Record
Department of Internal Affairs
Relationship Seven: Contains Object Reference
A digital preservation risk...
Department of Internal Affairs
Relationship Seven: Contains Object Reference
Extract files from PPT OLE2 -> Read PowerPoint Document Obect ->
Look for:
Department of Internal Affairs
Relationship Eight: Item Mentions
Dictionary:
Helen Clark
Helen Elizabeth Clark
John Key
United Nations
Prime Minister
University of Auckland
Jenny Shipley
Labour Party
Department of Internal Affairs
Relationship
Eight:
Item Mentions
Department of Internal Affairs
Discussion
• Data structures – support needed in catalogue, and digital
preservation system...
• Extensbile, flexible enough not to (need to) know what the
future holds...
• AS/NZS 5478:2015, Recordkeeping metadata property
reference set (RMPRS) states:
“The digital world is increasingly using networked
relationships”.
Department of Internal Affairs
Discussion
• Verhoeven (2016) – Devil’s Bridges!
– Ontological, graph/network based infrastuctures
– Vernacular ontologies
– Understand, Make, Improve Quality of our Connections
– redistribution of power and the possibilities of world
making (and remaking) in the archive
Department of Internal Affairs
Providing the algorithms are transparent, what then
provides a more objective view of the world than machine
generated relations?
Department of Internal Affairs
Discussion
• ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech):
– Still a draft if 80% content is different from published?
– Draft because it’s marked as such in metadata?
– Draft when it has been delivered in the wild?
• ICA... RiC-R4: ‘has Subject’ semantics (This Presentation):
– Graph technologies?
– Digtial preservation?
– Processing of digital archives?
– Binary trees?!
Department of Internal Affairs
“RiC-CM aspires to reflect both facets of the Principle of Provenance, as
these have traditionally been understood and practiced, and at the same
time recognize a more expansive and dynamic understanding of
provenance. It is this more expansive understanding that is embodied in
the word “Contexts.” RiC-CM is intended to enable a fuller, if forever
incomplete, description of the contexts in which records emerge and exist,
so as to enable multiple perspectives and multiple avenues of access.”
Department of Internal Affairs
Discussion
• Impact for record keeping; transfer; digital preservtion,
discovery...
• Digital preservation – linked objects, hyperlinks, embedded
objects...
• Not all geekery!
• Remember the content of these records...
• Remember the connections...
• Remember use-cases for digital preservation, it does not
operate, in and of itself!
Department of Internal Affairs
Conclusion
• “Computer forensic examiners are often overwhelmed with
data. Modern hard drives contain more information that
cannot be manually examined in a reasonable time period
creating a need for data reduction techniques.” - Kornblum
(2006)
• So how do we begin?
One relation at a time...
Department of Internal Affairs
Links
• Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101
• SSCOMPARE: https://github.com/exponential-decay/sscompare
• TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments
• Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop
• Apache Tika: https://tika.apache.org/
• Full Paper: Hopefully in Archives and Manuscripts sometime soon!
Thank you
ross.spencer@dia.govt.nz
@beet_keeper

More Related Content

Similar to Binary Trees? Automatically identifying the links between born-digital records

A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...Jenny Mitcham
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMnortherncollaboration
 
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdmSailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdmNASIG
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
The Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from SwedenThe Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from SwedenMarcus Smith
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked dataEnno Meijers
 
IIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdfIIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdf4Science
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...Jenny Mitcham
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overviewAmit Sheth
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projectszsrlibrary
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM4Science
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 

Similar to Binary Trees? Automatically identifying the links between born-digital records (20)

Aba adams
Aba adamsAba adams
Aba adams
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdmSailing the Digital Serial Seas: Charting a New Course with CONTENTdm
Sailing the Digital Serial Seas: Charting a New Course with CONTENTdm
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
The Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from SwedenThe Digital Archaeological Workflow: A Case Study from Sweden
The Digital Archaeological Workflow: A Case Study from Sweden
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
IIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdfIIIF and DSpace 7 - IIIF Conference 2023.pdf
IIIF and DSpace 7 - IIIF Conference 2023.pdf
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 

Recently uploaded

Climate change and safety and health at work
Climate change and safety and health at workClimate change and safety and health at work
Climate change and safety and health at workChristina Parmionova
 
VIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Vani 8617697112 Independent Escort Service PuneVIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Vani 8617697112 Independent Escort Service PuneCall girls in Ahmedabad High profile
 
DNV publication: China Energy Transition Outlook 2024
DNV publication: China Energy Transition Outlook 2024DNV publication: China Energy Transition Outlook 2024
DNV publication: China Energy Transition Outlook 2024Energy for One World
 
How the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersHow the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersCongressional Budget Office
 
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...Christina Parmionova
 
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28JSchaus & Associates
 
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...Garima Khatri
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...Suhani Kapoor
 
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...narwatsonia7
 
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up NumberMs Riya
 
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Servicenarwatsonia7
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationEnergy for One World
 
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...anilsa9823
 
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.Christina Parmionova
 
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...ResolutionFoundation
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...narwatsonia7
 

Recently uploaded (20)

Rohini Sector 37 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 37 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 37 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 37 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Climate change and safety and health at work
Climate change and safety and health at workClimate change and safety and health at work
Climate change and safety and health at work
 
VIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Vani 8617697112 Independent Escort Service PuneVIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Vani 8617697112 Independent Escort Service Pune
 
DNV publication: China Energy Transition Outlook 2024
DNV publication: China Energy Transition Outlook 2024DNV publication: China Energy Transition Outlook 2024
DNV publication: China Energy Transition Outlook 2024
 
How the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists LawmakersHow the Congressional Budget Office Assists Lawmakers
How the Congressional Budget Office Assists Lawmakers
 
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service
(VASUDHA) Call Girls Balaji Nagar ( 7001035870 ) HI-Fi Pune Escorts Service
 
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
“Exploring the world: One page turn at a time.” World Book and Copyright Day ...
 
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28
 
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...
VIP Mumbai Call Girls Andheri West Just Call 9920874524 with A/C Room Cash on...
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
 
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
 
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas  Whats Up Number
##9711199012 Call Girls Delhi Rs-5000 UpTo 10 K Hauz Khas Whats Up Number
 
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
Call Girls Service AECS Layout Just Call 7001305949 Enjoy College Girls Service
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological Civilization
 
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
 
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
 
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
 
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
 

Binary Trees? Automatically identifying the links between born-digital records

  • 1. Australian Society of Archivists Conference 2016, Parramatta Session 5: Description and Innovation Binary Trees? Automatically Identifying the links between born-digital records. Ross Spencer Digital Preservation Analyst Systems Strategy and Standards team
  • 2. Department of Internal Affairs How do we view the world?
  • 3. Department of Internal Affairs Binary Trees?
  • 4. Department of Internal Affairs But that looks like a network graph?! • It is! • Records (Items) connected across many recordkeeping and archival contexts • Across functions; People; Agency; Subject; Context; References; Subject... Date, File Format... • No boundaries! @ArvhivesNZ:ItemA -> references -> @DOC:ItemB
  • 5. Department of Internal Affairs We know this... • Continuum model (Multiple contexts over space and time) • ICA Draft Conceptual Model (RiC) • 73 Record Relations RiC-R1 to RiC-R73 • Three of which we might be able to (more easily) automate? • Has Copy; Is Copy Of; Has Part • Wherein (I suggest) lies the issue...
  • 6. Department of Internal Affairs Archives NZ Context 2011 Archives New Zealand developed its new conceptual model and metadata schema for archival description. Designed to accommodate description of born-digital records. much discussion among archivists about the practicalities of describing relationships between items. It was acknowledged that, given the volumes of digital records likely to be in each transfer, neither agency nor Archives staff were likely to examine the content of items visually one-by-one to determine which other items they referred to... ~ Talei Masters
  • 7. Department of Internal Affairs What then do we do? • Mathematical properties of digital files... • Signals -> • Numbers -> • Encoding Schemes (UTF8, ASCII) - > • Data Structures -> • File Formats -> User Content. • Reduce again to a series of numbers that we can interpret to use numerical properties: • Greater than; less than; equal to; not equal to...
  • 8. Department of Internal Affairs In the relationship between numbers we can find the relationships between records
  • 9. Department of Internal Affairs Relations we might be able to create... • Relationship One: Is Identical • Relationship Two: Is Similar • Relationship Three: Contains Hyperlink • Relationship Four: Contains CMS Reference • Relationship Five: Contains Embedded Digital Objects • Relationship Six: Contains Intra-Item Relationships • Relationship Seven: Contains Object References • Relationship Eight: Item Mentions
  • 10. Department of Internal Affairs Relationship One: Is Identical ● We often have checksums available in digital repository ● First comparison in a digital transfer... Does Checksum A still equal Checksum A? ● If yes, accept, continue to transfer... ● If no... reject! Inspect! ● Expose this information in the catalogue and compare; what happens?
  • 11. Department of Internal Affairs Relationship One: Is Identical Archival Context A Record Keeping System A Archival Context B Record Keeping System B
  • 12. Department of Internal Affairs Relationship Two: Is Similar • MD5 (Rivest, 1992): • File A (Zero changes): 8c69dc0668c4c73092a7042df45e756adb170742 • File B (1 Byte Removed): 6b75b8f235c148efd1b03d9c113664895b5aa7cd
  • 13. Department of Internal Affairs Relationship Two: Is Similar • SSDEEP (Kornblum, 2006): • File A (Zero changes): 1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z BB1U6hP • File C (First 250 Bytes Removed*): 1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9 ZBB1U6hP *Less than two tweets (140 bytes)
  • 14. Department of Internal Affairs Relationship Two: Is Similar • First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.) • Oliver et al. (2014) Thresholds should be tuned for each application • Fiirst application is item level sentencing during transfer feasibility investigations • Manually sentence... 10 records per hour • Follow links to those not of archival value... * Trend Micro Locality Sensitivity Hash!
  • 15. Department of Internal Affairs Relationship Two: Is Similar MD5 Hash Fuzzy Hash
  • 16. Department of Internal Affairs Relationship Two: Similar
  • 17. Department of Internal Affairs Relationship Two: Similar
  • 18. Department of Internal Affairs Relationship Two: Is Similar You liked this record... you might also like...
  • 19. Department of Internal Affairs Relationship Three: Contains HTTP:// • Burnhill et al. (2015) • 64,000 e-theses, 46,000 pointed out to external sources • Websites, external files, etc.
  • 20. Department of Internal Affairs Relationship Three: Contains HTTP:// #!/bin/bash set -e #FILES LOCATION FILES='/home/digital-preservation/accessions' dp_analysis () { echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]' echo } # Find loop... oIFS=$IFS IFS=$'n' time(find "$FILES" -type f | while read -r file; do dp_analysis "$file" done) IFS=$oIFS
  • 21. Department of Internal Affairs Relationship Three: Contains HTTP:// • https://gist.github.com/ross- spencer/a6411a021afb7de7e3dc6dd713f7b520 • ~5059 parseable born-digital records • ~4800 lines contained hyperlinks
  • 22. Department of Internal Affairs Relationship Four: Contains CMS Reference echo -e $(catdoc "$file" | grep -e "A[0-9]{6}" • Matches the Archway catalogue reference number, e.g. A204050; A123456; and not AZ12345 • CMS reference could be sent alongside transfer metadata for such searches. • Flag existence (at least) - FYI to the end user – be that the transfer archivist, to the agency, to the researcher
  • 23. Department of Internal Affairs Relationship Five: Contains Embedded Object $ java -jar tika-app1.13.jar -z <filename> --extract- dir=<dirname>
  • 24. Department of Internal Affairs Relationship Six: Contains Intra-Item Record
  • 25. Department of Internal Affairs Relationship Seven: Contains Object Reference A digital preservation risk...
  • 26. Department of Internal Affairs Relationship Seven: Contains Object Reference Extract files from PPT OLE2 -> Read PowerPoint Document Obect -> Look for:
  • 27. Department of Internal Affairs Relationship Eight: Item Mentions Dictionary: Helen Clark Helen Elizabeth Clark John Key United Nations Prime Minister University of Auckland Jenny Shipley Labour Party
  • 28. Department of Internal Affairs Relationship Eight: Item Mentions
  • 29. Department of Internal Affairs Discussion • Data structures – support needed in catalogue, and digital preservation system... • Extensbile, flexible enough not to (need to) know what the future holds... • AS/NZS 5478:2015, Recordkeeping metadata property reference set (RMPRS) states: “The digital world is increasingly using networked relationships”.
  • 30. Department of Internal Affairs Discussion • Verhoeven (2016) – Devil’s Bridges! – Ontological, graph/network based infrastuctures – Vernacular ontologies – Understand, Make, Improve Quality of our Connections – redistribution of power and the possibilities of world making (and remaking) in the archive
  • 31. Department of Internal Affairs Providing the algorithms are transparent, what then provides a more objective view of the world than machine generated relations?
  • 32. Department of Internal Affairs Discussion • ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech): – Still a draft if 80% content is different from published? – Draft because it’s marked as such in metadata? – Draft when it has been delivered in the wild? • ICA... RiC-R4: ‘has Subject’ semantics (This Presentation): – Graph technologies? – Digtial preservation? – Processing of digital archives? – Binary trees?!
  • 33. Department of Internal Affairs “RiC-CM aspires to reflect both facets of the Principle of Provenance, as these have traditionally been understood and practiced, and at the same time recognize a more expansive and dynamic understanding of provenance. It is this more expansive understanding that is embodied in the word “Contexts.” RiC-CM is intended to enable a fuller, if forever incomplete, description of the contexts in which records emerge and exist, so as to enable multiple perspectives and multiple avenues of access.”
  • 34. Department of Internal Affairs Discussion • Impact for record keeping; transfer; digital preservtion, discovery... • Digital preservation – linked objects, hyperlinks, embedded objects... • Not all geekery! • Remember the content of these records... • Remember the connections... • Remember use-cases for digital preservation, it does not operate, in and of itself!
  • 35. Department of Internal Affairs Conclusion • “Computer forensic examiners are often overwhelmed with data. Modern hard drives contain more information that cannot be manually examined in a reasonable time period creating a need for data reduction techniques.” - Kornblum (2006) • So how do we begin? One relation at a time...
  • 36. Department of Internal Affairs Links • Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101 • SSCOMPARE: https://github.com/exponential-decay/sscompare • TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments • Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop • Apache Tika: https://tika.apache.org/ • Full Paper: Hopefully in Archives and Manuscripts sometime soon!