Call Girls Service Race Course Road Just Call 7001305949 Enjoy College Girls ...
Binary Trees? Automatically identifying the links between born-digital records
1. Australian Society of Archivists
Conference 2016, Parramatta
Session 5: Description and Innovation
Binary Trees? Automatically Identifying
the links between born-digital records.
Ross Spencer
Digital Preservation Analyst
Systems Strategy and Standards team
4. Department of Internal Affairs
But that looks like a network graph?!
• It is!
• Records (Items) connected across many recordkeeping and
archival contexts
• Across functions; People; Agency; Subject; Context; References;
Subject... Date, File Format...
• No boundaries!
@ArvhivesNZ:ItemA -> references -> @DOC:ItemB
5. Department of Internal Affairs
We know this...
• Continuum model (Multiple contexts over space and time)
• ICA Draft Conceptual Model (RiC)
• 73 Record Relations RiC-R1 to RiC-R73
• Three of which we might be able to (more easily) automate?
• Has Copy; Is Copy Of; Has Part
• Wherein (I suggest) lies the issue...
6. Department of Internal Affairs
Archives NZ Context
2011 Archives New Zealand developed its new conceptual model and metadata
schema for archival description.
Designed to accommodate description of born-digital records.
much discussion among archivists about the practicalities of describing relationships
between items.
It was acknowledged that, given the volumes of digital records likely to be in each
transfer, neither agency nor Archives staff were likely to examine the content of items
visually one-by-one to determine which other items they referred to...
~ Talei Masters
7. Department of Internal Affairs
What then do we do?
• Mathematical properties of digital files...
• Signals ->
• Numbers ->
• Encoding Schemes (UTF8, ASCII) - >
• Data Structures ->
• File Formats -> User Content.
• Reduce again to a series of numbers that we can interpret to use
numerical properties:
• Greater than; less than; equal to; not equal to...
8. Department of Internal Affairs
In the relationship between numbers we can find the
relationships between records
9. Department of Internal Affairs
Relations we might be able to create...
• Relationship One: Is Identical
• Relationship Two: Is Similar
• Relationship Three: Contains Hyperlink
• Relationship Four: Contains CMS Reference
• Relationship Five: Contains Embedded Digital Objects
• Relationship Six: Contains Intra-Item Relationships
• Relationship Seven: Contains Object References
• Relationship Eight: Item Mentions
10. Department of Internal Affairs
Relationship One: Is Identical
●
We often have checksums available in digital repository
●
First comparison in a digital transfer...
Does Checksum A still equal Checksum A?
●
If yes, accept, continue to transfer...
●
If no... reject! Inspect!
●
Expose this information in the catalogue and compare; what
happens?
11. Department of Internal Affairs
Relationship One: Is Identical
Archival Context A
Record Keeping System
A
Archival Context B
Record Keeping System B
12. Department of Internal Affairs
Relationship Two: Is Similar
• MD5 (Rivest, 1992):
• File A (Zero changes):
8c69dc0668c4c73092a7042df45e756adb170742
• File B (1 Byte Removed):
6b75b8f235c148efd1b03d9c113664895b5aa7cd
13. Department of Internal Affairs
Relationship Two: Is Similar
• SSDEEP (Kornblum, 2006):
• File A (Zero changes):
1536:tLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:q1aYpYTESSgM2CwQGt9Z
BB1U6hP
• File C (First 250 Bytes Removed*):
1536:CLQy16aYRCWYTESSg3yDuBCwclnHpQ/B4k
CK7ZBEY0t5vykp6CYP:B1aYpYTESSgM2CwQGt9
ZBB1U6hP
*Less than two tweets (140 bytes)
14. Department of Internal Affairs
Relationship Two: Is Similar
• First experiments, SSDEEP (Kornblum), TLSH* (Oliver et al.)
• Oliver et al. (2014) Thresholds should be tuned for each application
• Fiirst application is item level sentencing during transfer feasibility
investigations
• Manually sentence... 10 records per hour
• Follow links to those not of archival value...
* Trend Micro Locality Sensitivity Hash!
18. Department of Internal Affairs
Relationship Two: Is Similar
You liked this record... you might also like...
19. Department of Internal Affairs
Relationship Three: Contains HTTP://
• Burnhill et al. (2015)
• 64,000 e-theses, 46,000 pointed out to external sources
• Websites, external files, etc.
20. Department of Internal Affairs
Relationship Three: Contains HTTP://
#!/bin/bash
set -e
#FILES LOCATION
FILES='/home/digital-preservation/accessions'
dp_analysis ()
{
echo -e $(catdoc "$file" | grep "http://") | tr -d '[:cntrl:]'
echo
}
# Find loop...
oIFS=$IFS
IFS=$'n'
time(find "$FILES" -type f | while read -r file; do
dp_analysis "$file"
done)
IFS=$oIFS
21. Department of Internal Affairs
Relationship Three: Contains HTTP://
• https://gist.github.com/ross-
spencer/a6411a021afb7de7e3dc6dd713f7b520
• ~5059 parseable born-digital records
• ~4800 lines contained hyperlinks
22. Department of Internal Affairs
Relationship Four: Contains CMS Reference
echo -e $(catdoc "$file" | grep -e "A[0-9]{6}"
• Matches the Archway catalogue reference number, e.g.
A204050; A123456; and not AZ12345
• CMS reference could be sent alongside transfer metadata
for such searches.
• Flag existence (at least) - FYI to the end user – be that the
transfer archivist, to the agency, to the researcher
23. Department of Internal Affairs
Relationship Five: Contains Embedded Object
$ java -jar tika-app1.13.jar -z <filename> --extract-
dir=<dirname>
25. Department of Internal Affairs
Relationship Seven: Contains Object Reference
A digital preservation risk...
26. Department of Internal Affairs
Relationship Seven: Contains Object Reference
Extract files from PPT OLE2 -> Read PowerPoint Document Obect ->
Look for:
27. Department of Internal Affairs
Relationship Eight: Item Mentions
Dictionary:
Helen Clark
Helen Elizabeth Clark
John Key
United Nations
Prime Minister
University of Auckland
Jenny Shipley
Labour Party
29. Department of Internal Affairs
Discussion
• Data structures – support needed in catalogue, and digital
preservation system...
• Extensbile, flexible enough not to (need to) know what the
future holds...
• AS/NZS 5478:2015, Recordkeeping metadata property
reference set (RMPRS) states:
“The digital world is increasingly using networked
relationships”.
30. Department of Internal Affairs
Discussion
• Verhoeven (2016) – Devil’s Bridges!
– Ontological, graph/network based infrastuctures
– Vernacular ontologies
– Understand, Make, Improve Quality of our Connections
– redistribution of power and the possibilities of world
making (and remaking) in the archive
31. Department of Internal Affairs
Providing the algorithms are transparent, what then
provides a more objective view of the world than machine
generated relations?
32. Department of Internal Affairs
Discussion
• ICA... RiC-R7: ‘is Draft Of’ semantics (A Speech):
– Still a draft if 80% content is different from published?
– Draft because it’s marked as such in metadata?
– Draft when it has been delivered in the wild?
• ICA... RiC-R4: ‘has Subject’ semantics (This Presentation):
– Graph technologies?
– Digtial preservation?
– Processing of digital archives?
– Binary trees?!
33. Department of Internal Affairs
“RiC-CM aspires to reflect both facets of the Principle of Provenance, as
these have traditionally been understood and practiced, and at the same
time recognize a more expansive and dynamic understanding of
provenance. It is this more expansive understanding that is embodied in
the word “Contexts.” RiC-CM is intended to enable a fuller, if forever
incomplete, description of the contexts in which records emerge and exist,
so as to enable multiple perspectives and multiple avenues of access.”
34. Department of Internal Affairs
Discussion
• Impact for record keeping; transfer; digital preservtion,
discovery...
• Digital preservation – linked objects, hyperlinks, embedded
objects...
• Not all geekery!
• Remember the content of these records...
• Remember the connections...
• Remember use-cases for digital preservation, it does not
operate, in and of itself!
35. Department of Internal Affairs
Conclusion
• “Computer forensic examiners are often overwhelmed with
data. Modern hard drives contain more information that
cannot be manually examined in a reasonable time period
creating a need for data reduction techniques.” - Kornblum
(2006)
• So how do we begin?
One relation at a time...
36. Department of Internal Affairs
Links
• Checksum 101: http://www.slideshare.net/RossSpencer/checksum-101
• SSCOMPARE: https://github.com/exponential-decay/sscompare
• TLSH Experiments: https://github.com/exponential-decay/tlsh-experiments
• Parrallel Lines Workshop: https://github.com/andreakb/parallel-lines-workshop
• Apache Tika: https://tika.apache.org/
• Full Paper: Hopefully in Archives and Manuscripts sometime soon!