SlideShare a Scribd company logo
1 of 52
Download to read offline
pro-iBiosphere Markup Workshop

Efforts and plans towards
Markup of the BHL Content
William Ulate R.
BHL Technical Director
Missouri Botanical Garden
Berlin, Feb. 10, 2014
BHL Mission and Vision
More Online Content
Pages (Millions) and Volumes (in Thousands)
included in BHL
140
130.68
120.09

120
105.85

100

94.6
84.86

80
60
40

40.00

31.8
20

22.00
9.2

Oct-08

35.4

38.9

41.942.6
Volumes (K)

16.4
Pages (M)

Oct-09

Oct-10

Oct-11

Oct-12

Oct-13
Subjects
New Types of Content
New Types of Content
Scientific Name Extraction
• TaxonFinder algorithm in production since
2008
– More than 100 million candidate name strings
– More than 1.5 million unique, verified names
– Available through UI, APIs, Data Exports & Internet
Archive

• New collaboration with Global Names project
– Improved algorithm, better precision & recall
– More data with TaxonFinder and Neti Neti!
– http://gnrd.globalnames.org/
Taxon Names
BEFORE

Name Instances
Unique Names
Verified Names
EOL Names
EOL Pages

101,591,803
7,498,554
1,905,507
63,130,350
13,579,868

101,288,804
7,464,924
1,902,803
62,963,582
13,532,684

151,222,182
29,246,382
10,153,165
87,791,695
15,466,713

150,066,425
29,091,767
10,109,540
87,135,089
15,342,867

AFTER
Name Instances
Unique Names
Verified Names
EOL Names
EOL Pages
BHL Markup Efforts and Plans
BHL Markup Efforts and Plans
Article-level metadata
Chapter-level metadata
Treatment-level metadata

Part-level metadata
Articles in the BHL UI
BHL Markup Efforts and Plans
See also:
Related Titles
Global Replication & Serving
Replicated Data Center

Portal Application
BHL-Europe Term Expansion
Taxonomic Literature II (TL-2)
BioStor articles marked up with JATS
Art of Life
Art of Life
Art of Life
BHL Markup Efforts and Plans
Art of Life
Macaw

https://github.com/cajunjoel/macaw-book-metadata-tool
Reviewing Metadata
Reviewing Metadata
BHL Markup Efforts and Plans
Manually built:
1,693 sets
87,879 images
BHL Markup Efforts and Plans
The Art of Life schema: describing and providing access to natural history
illustrations from the Biodiversity Heritage Library (BHL)
by William Ulate, Trish Rose-Sandler, Gaurav Vaidya, Robert Guralnick
Example of illustration described using Art of Life schema
Title

Stictospiza formosa

Type

Illustrations

Date

Publication: 1898

Agent

Description
Subjects

Inscriptions
Source

Rights

Author: Arthur G. Butler (1844-1925)
Illustrator: F.W. Frohawk (1861-1946)

A pair of finches with green and yellow bodies resting on reeds
Scientific name: Amandava formosa (Latham, 1790)
Vernacular Name: Green Avadavat or Green Munia
Accepted Name: Amandava formosa (Latham, 1790)
Birds, finches

bottom center: Green Amaduvade Waxbill (Stictospiza formosa)
Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and
Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage
Library, and is available online at biodiversitylibrary.org/page/17195895
Public domain

Art of Life schema elements required in Red
Element

Agents

Definition

person or corporate entity involved in
the creation, design, production, or
publication of a visual resource.

Examples

Repea
t

<vra:agent>
<vra:name type="personal" vocab="LCNAF" refid="89015596>
Curtis,John</vra:name>
<vra:dates type="life">
<vra:earliestDate>1791</vra:earliestDate>
<vra:latestDate>1862</vra:latestDate>
</vra:dates>
<vra:role vocab="AAT" refid="300025574">publisher</vra:role>
</vra:agent>

Y

Copyright

The copyright status of the visual
resource.

Date

Date or range of dates associated with
the creation or publication of the visual
resource.

<vra:date type="creation">
<vra:earliestDate>1945</vra:earliestDate>
<vra:latestDate>1955</vra:latestDate>
</vra:date>

Y

Description

A free-text note about content of the
image, including comments, description,
or interpretation, that gives additional
information not recorded in other
categories.

<vra:description>This illustration shows a scale, coloured illustration
of Sepsis annulipes (now known as Encita annulipes) beside the
Trifolium ochroleucum plant. Several dissections from Sepsis
cylindrica Fab. (all these details are provided on the next page of this
book and the subsequent page).</vra:description>

Y

Inscriptions

All marks, caption, or written words
added to the object at the time of
production or in its subsequent history,
including signatures, dates, dedications,
texts, and colophons, as well as marks,
such as the stamps of silversmiths,
publishers, or printers.

<vra:inscription>
<vra:position>bottom</vra:position>
<vra:text>Radula of L. souleyetianum on a more
reduced scale</vra:text>
</vra:inscription>

Y

Source

A citation for the book, journal or
resource that hosts the visual resource

<vra:source><vra:name type=”book”>Butler, Arthur Gardiner.
Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd
edition). </vra:name>
<vra:refid
type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid>
</vra:source>

N

Subject

Terms or phrases that describe, identify,
or interpret the visual resource.

<vra:subject><vra:term type=”personalName”>Carl
Linnaeus</vra:term></vra:subject>

Y

<vra:rights refid=”http://creativecommons.org/licenses/bync/2.0/deed.en”>Creative Commons Attribution-NonCommercial 2.0
Generic (CC BY-NC 2.0)
</vra:rights>

N

<dwc:scientificName>Plant: Picea abies</dwc:scientificName>
<dwc:acceptedName>Plant: Picea abies</dwc:acceptedName>
<dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName>

Title

The title or identifying phrase given to an
Image

<vra:title xml:lang=”la”>Sepsis annulipes</vra:title>
<vra:title type=“alternate”>Orangutan</vra:title>

Y

We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
OCR Improvements
• Gaming
• Transcription
OCR Improvements
• Transcription
• Purposeful Gaming
• Looking at…
– Crowdsource Markup
Purposeful Gaming
DIGITALKOOT

• Joint project run by the National
Library of Finland and Microtask to
index the library's enormous archives
so that they are searchable on the
Internet for easier access to the
Finnish cultural heritage.

.
Purposeful Gaming
DIGITALKOOT
• Launched on Feb 8 2011, nearly 110 000
participants completed over 8 million word
fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate
in this fixing work by playing games.

• .
Purposeful gaming and BHL:
engaging the public in improving and
enhancing access to digital texts
• IMLS Grant Program:
National Leadership Grants for Libraries
• Partners:
–
–
–
–

Missouri Botanical Garden
Harvard University
Cornell University
New York Botanical Garden

• P.I.: Trish Rose-Sandler, Missouri Botanical Garden
• Dates: Dec 2013 – Nov. 2015
Project objectives and benefits
• Test new means of crowdsourcing to support the
enhancement of content in BHL
• Demonstrate if digital games are an effective tool for
analyzing and improving digital outputs from OCR and
transcription
• Benefits of gaming include:
– improved access to content by providing richer and more
accurate data;
– an extension of limited staff resources; and
– exposure of library content to communities who may not
know about the collections otherwise.
OCR Improvements

German text interpreted by the OCR process as:
“unb auf ben ©elnrgen be6 fublic{)en”
OCR Improvements
IA OCR

OCR 2

Transcription
1

Transcription
2

1

unb

und

und

und

Ok

2

den

ben

den

den

Ok

3

©elnrgen

©ebirgen

Bebirgen

Gebirgen

X

4

be6

des

de5

des

Chk

5

fublic{)en

fublichen

Füdlichen

Südlichen

X

6

£)eittfc{)(anb6

Deutfchlanbs

Deutfchlands

Deutschlands

X

Different resulting texts from parsing the phrase:
“und auf den Gebirgen des südlichen Deutschlands”
(“and on the mountains of southern Germany”)
Purposeful Gaming
iDigBio’s aOCR Hackathon
• Improve OCR parsing of labels with clear metrics
(datasets, output formats, scoring algorithm)
• Libraries of regular expr. to clean up each field
(different error correction for latitude/longitude
coordinates than personal names or herbarium
catalog numbers)
• Tool for classifying segments of the image before
submitting to OCR

• Do a first pass of OCR to clean images before
sending them to a second, 'real' pass of OCR
iDigBio’s CITScribe Hackathon
1. Interoperability betweenpublic participation
tools and biodiversity data systems,
2. Transcription quality assessment/quality
control (QA/QC) and the reconciliation of
replicatetranscriptions,
3. Integration of optical character recognition
(OCR) into thetranscription workflow
4. User engagement
NfN & iDigBio’s CITScribe Hackathon
• Jason Best’s DarwinScore
• Ben Brumfield’s Handwriting Gibberish Detector
• Dictionaries to improve crowdsourcing consensus
(e.g., names of collectors, scientific names)
• Word Clouds created using n-gram scoring,
faceting, and Solr for indexing + Carrot2 for
specimen selection (visualize and explore of the use
with a word of interest from the word cloud) and a
data cleaning step (highlight infrequent words by
the system).
NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
NESCent EOL-BHL Research Sprint
Assessing Risk Status of Mexican Amphibians Through Data
Mining.
Esther Quintero and Bárbara Ayala
National Commission for Knowledge and Use of Biodiversity
(CONABIO)
and
Anne Thessen
Marine Biological Laboratory and Arizona State University
NESCent EOL-BHL Research Sprint
Evolution in the usage of anatomical concepts in the
biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng
University of North Carolina at Chapel Hill
MiBIO: Mining Biodiversity
• Mining Biodiversity: Enriching Biodiversity Heritage
with Text Mining and Social Media
• One of the international projects that won in the
third round of the 2013 Digging Into Data Challenge
• Promote the development of innovative
computational techniques to apply into big data in
the humanities and social sciences
– The National Centre for Text Mining (UK)
– Missouri Botanical Garden (US)
– Dalhousie University's Big Data Analytics
Institute (Canada)
– Social Media Lab (Canada)
MiBIO: Mining Biodiversity
1.

Automatic error correction of OCR text errors.

2.

Crowdsource annotation of legacy texts with semantic metadata.

3.

Adapt text mining techniques to extract terminology, entities and
significant events automatically and to track terminology evolution
over time.

4.

Use Interactive visualization techniques to help users manage
search results through next generation browsing capabilities,
assisted by a semantic similarity network of important terms and
entities.

5.

Design of a social media layer, serving as an environment for
diverse users to interact and collaborate on science, public
education, awareness and outreach.
MiBIO: Mining Biodiversity
•
Crowdsource Markup
Display text

Species Profile Model category

General/summary

TaxonBiology

Geographic range

Distribution

Habitat

Habitat

Food sources and feeding behavior

TrophicStrategy

Physical description (general)

Description

Physical description (detailed morphology) DiagnosticDescription
Thank you
William Ulate
Global BHL Project Manager / Technical Director
Missouri Botanical Garden
william.ulate@mobot.org
Skype: william_ulate_r

More Related Content

Similar to BHL Markup Efforts and Plans

BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014William Ulate
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkGeorgi Kobilarov
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataThomas Meehan
 
Beyond MARC: MARC, linked data, and Bibframe
Beyond MARC: MARC, linked data, and BibframeBeyond MARC: MARC, linked data, and Bibframe
Beyond MARC: MARC, linked data, and BibframeThomas Meehan
 
The future importance of bibliographic data
The future importance of bibliographic dataThe future importance of bibliographic data
The future importance of bibliographic dataPatrick Danowski
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataThomas Meehan
 
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Bradley Allen
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesChris Freeland
 
LOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data ApproachesLOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data ApproachesAdrian Stevenson
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”bridgingworlds2008
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked DataJane Stevenson
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaChris Freeland
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseDavid Shorthouse
 
Mapping concepthubberlin
Mapping concepthubberlinMapping concepthubberlin
Mapping concepthubberlinAlan Lavintman
 
Describing Moving Images: PBCore
Describing Moving Images: PBCoreDescribing Moving Images: PBCore
Describing Moving Images: PBCorec_e_michael
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDARenette Davis
 

Similar to BHL Markup Efforts and Plans (20)

BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
Beyond MARC: MARC, linked data, and Bibframe
Beyond MARC: MARC, linked data, and BibframeBeyond MARC: MARC, linked data, and Bibframe
Beyond MARC: MARC, linked data, and Bibframe
 
The future importance of bibliographic data
The future importance of bibliographic dataThe future importance of bibliographic data
The future importance of bibliographic data
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
 
LOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data ApproachesLOCAH Project and Considerations of Linked Data Approaches
LOCAH Project and Considerations of Linked Data Approaches
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - Shorthouse
 
Mapping concepthubberlin
Mapping concepthubberlinMapping concepthubberlin
Mapping concepthubberlin
 
Describing Moving Images: PBCore
Describing Moving Images: PBCoreDescribing Moving Images: PBCore
Describing Moving Images: PBCore
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDA
 

More from William Ulate

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxWilliam Ulate
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendlyWilliam Ulate
 
Expanding Access to Biodiversity Literature. Mining Biodiversity.
Expanding Access to Biodiversity Literature. Mining Biodiversity.Expanding Access to Biodiversity Literature. Mining Biodiversity.
Expanding Access to Biodiversity Literature. Mining Biodiversity.William Ulate
 
Text Mining Biodiversity 20160127
Text Mining Biodiversity 20160127Text Mining Biodiversity 20160127
Text Mining Biodiversity 20160127William Ulate
 
Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...William Ulate
 
Engaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLEngaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLWilliam Ulate
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHLWilliam Ulate
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateWilliam Ulate
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHLWilliam Ulate
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...William Ulate
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013William Ulate
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to contentWilliam Ulate
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectWilliam Ulate
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects UpdatesWilliam Ulate
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...William Ulate
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceWilliam Ulate
 

More from William Ulate (18)

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptx
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendly
 
Expanding Access to Biodiversity Literature. Mining Biodiversity.
Expanding Access to Biodiversity Literature. Mining Biodiversity.Expanding Access to Biodiversity Literature. Mining Biodiversity.
Expanding Access to Biodiversity Literature. Mining Biodiversity.
 
Text Mining Biodiversity 20160127
Text Mining Biodiversity 20160127Text Mining Biodiversity 20160127
Text Mining Biodiversity 20160127
 
Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...
 
Engaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLEngaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHL
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHL
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical Update
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHL
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to content
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life project
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects Updates
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable Resource
 

Recently uploaded

P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 

Recently uploaded (20)

P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 

BHL Markup Efforts and Plans

  • 1. pro-iBiosphere Markup Workshop Efforts and plans towards Markup of the BHL Content William Ulate R. BHL Technical Director Missouri Botanical Garden Berlin, Feb. 10, 2014
  • 3. More Online Content Pages (Millions) and Volumes (in Thousands) included in BHL 140 130.68 120.09 120 105.85 100 94.6 84.86 80 60 40 40.00 31.8 20 22.00 9.2 Oct-08 35.4 38.9 41.942.6 Volumes (K) 16.4 Pages (M) Oct-09 Oct-10 Oct-11 Oct-12 Oct-13
  • 5. New Types of Content
  • 6. New Types of Content
  • 7. Scientific Name Extraction • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names project – Improved algorithm, better precision & recall – More data with TaxonFinder and Neti Neti! – http://gnrd.globalnames.org/
  • 8. Taxon Names BEFORE Name Instances Unique Names Verified Names EOL Names EOL Pages 101,591,803 7,498,554 1,905,507 63,130,350 13,579,868 101,288,804 7,464,924 1,902,803 62,963,582 13,532,684 151,222,182 29,246,382 10,153,165 87,791,695 15,466,713 150,066,425 29,091,767 10,109,540 87,135,089 15,342,867 AFTER Name Instances Unique Names Verified Names EOL Names EOL Pages
  • 12. Articles in the BHL UI
  • 16. Global Replication & Serving Replicated Data Center Portal Application
  • 19. BioStor articles marked up with JATS
  • 31. The Art of Life schema: describing and providing access to natural history illustrations from the Biodiversity Heritage Library (BHL) by William Ulate, Trish Rose-Sandler, Gaurav Vaidya, Robert Guralnick Example of illustration described using Art of Life schema Title Stictospiza formosa Type Illustrations Date Publication: 1898 Agent Description Subjects Inscriptions Source Rights Author: Arthur G. Butler (1844-1925) Illustrator: F.W. Frohawk (1861-1946) A pair of finches with green and yellow bodies resting on reeds Scientific name: Amandava formosa (Latham, 1790) Vernacular Name: Green Avadavat or Green Munia Accepted Name: Amandava formosa (Latham, 1790) Birds, finches bottom center: Green Amaduvade Waxbill (Stictospiza formosa) Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage Library, and is available online at biodiversitylibrary.org/page/17195895 Public domain Art of Life schema elements required in Red Element Agents Definition person or corporate entity involved in the creation, design, production, or publication of a visual resource. Examples Repea t <vra:agent> <vra:name type="personal" vocab="LCNAF" refid="89015596> Curtis,John</vra:name> <vra:dates type="life"> <vra:earliestDate>1791</vra:earliestDate> <vra:latestDate>1862</vra:latestDate> </vra:dates> <vra:role vocab="AAT" refid="300025574">publisher</vra:role> </vra:agent> Y Copyright The copyright status of the visual resource. Date Date or range of dates associated with the creation or publication of the visual resource. <vra:date type="creation"> <vra:earliestDate>1945</vra:earliestDate> <vra:latestDate>1955</vra:latestDate> </vra:date> Y Description A free-text note about content of the image, including comments, description, or interpretation, that gives additional information not recorded in other categories. <vra:description>This illustration shows a scale, coloured illustration of Sepsis annulipes (now known as Encita annulipes) beside the Trifolium ochroleucum plant. Several dissections from Sepsis cylindrica Fab. (all these details are provided on the next page of this book and the subsequent page).</vra:description> Y Inscriptions All marks, caption, or written words added to the object at the time of production or in its subsequent history, including signatures, dates, dedications, texts, and colophons, as well as marks, such as the stamps of silversmiths, publishers, or printers. <vra:inscription> <vra:position>bottom</vra:position> <vra:text>Radula of L. souleyetianum on a more reduced scale</vra:text> </vra:inscription> Y Source A citation for the book, journal or resource that hosts the visual resource <vra:source><vra:name type=”book”>Butler, Arthur Gardiner. Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd edition). </vra:name> <vra:refid type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid> </vra:source> N Subject Terms or phrases that describe, identify, or interpret the visual resource. <vra:subject><vra:term type=”personalName”>Carl Linnaeus</vra:term></vra:subject> Y <vra:rights refid=”http://creativecommons.org/licenses/bync/2.0/deed.en”>Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) </vra:rights> N <dwc:scientificName>Plant: Picea abies</dwc:scientificName> <dwc:acceptedName>Plant: Picea abies</dwc:acceptedName> <dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName> Title The title or identifying phrase given to an Image <vra:title xml:lang=”la”>Sepsis annulipes</vra:title> <vra:title type=“alternate”>Orangutan</vra:title> Y We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb
  • 32. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 34. OCR Improvements • Transcription • Purposeful Gaming • Looking at… – Crowdsource Markup
  • 35. Purposeful Gaming DIGITALKOOT • Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage. .
  • 36. Purposeful Gaming DIGITALKOOT • Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012 • DigiTalkoot enabled volunteers to participate in this fixing work by playing games. • .
  • 37. Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts • IMLS Grant Program: National Leadership Grants for Libraries • Partners: – – – – Missouri Botanical Garden Harvard University Cornell University New York Botanical Garden • P.I.: Trish Rose-Sandler, Missouri Botanical Garden • Dates: Dec 2013 – Nov. 2015
  • 38. Project objectives and benefits • Test new means of crowdsourcing to support the enhancement of content in BHL • Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription • Benefits of gaming include: – improved access to content by providing richer and more accurate data; – an extension of limited staff resources; and – exposure of library content to communities who may not know about the collections otherwise.
  • 39. OCR Improvements German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”
  • 40. OCR Improvements IA OCR OCR 2 Transcription 1 Transcription 2 1 unb und und und Ok 2 den ben den den Ok 3 ©elnrgen ©ebirgen Bebirgen Gebirgen X 4 be6 des de5 des Chk 5 fublic{)en fublichen Füdlichen Südlichen X 6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X Different resulting texts from parsing the phrase: “und auf den Gebirgen des südlichen Deutschlands” (“and on the mountains of southern Germany”)
  • 42. iDigBio’s aOCR Hackathon • Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm) • Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers) • Tool for classifying segments of the image before submitting to OCR • Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR
  • 43. iDigBio’s CITScribe Hackathon 1. Interoperability betweenpublic participation tools and biodiversity data systems, 2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions, 3. Integration of optical character recognition (OCR) into thetranscription workflow 4. User engagement
  • 44. NfN & iDigBio’s CITScribe Hackathon • Jason Best’s DarwinScore • Ben Brumfield’s Handwriting Gibberish Detector • Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names) • Word Clouds created using n-gram scoring, faceting, and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).
  • 45. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  • 46. NESCent EOL-BHL Research Sprint Assessing Risk Status of Mexican Amphibians Through Data Mining. Esther Quintero and Bárbara Ayala National Commission for Knowledge and Use of Biodiversity (CONABIO) and Anne Thessen Marine Biological Laboratory and Arizona State University
  • 47. NESCent EOL-BHL Research Sprint Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng University of North Carolina at Chapel Hill
  • 48. MiBIO: Mining Biodiversity • Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media • One of the international projects that won in the third round of the 2013 Digging Into Data Challenge • Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences – The National Centre for Text Mining (UK) – Missouri Botanical Garden (US) – Dalhousie University's Big Data Analytics Institute (Canada) – Social Media Lab (Canada)
  • 49. MiBIO: Mining Biodiversity 1. Automatic error correction of OCR text errors. 2. Crowdsource annotation of legacy texts with semantic metadata. 3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time. 4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities. 5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  • 51. Crowdsource Markup Display text Species Profile Model category General/summary TaxonBiology Geographic range Distribution Habitat Habitat Food sources and feeding behavior TrophicStrategy Physical description (general) Description Physical description (detailed morphology) DiagnosticDescription
  • 52. Thank you William Ulate Global BHL Project Manager / Technical Director Missouri Botanical Garden william.ulate@mobot.org Skype: william_ulate_r

Editor's Notes

  1. I would especially like you to prepare a presentation of about 10-15 min about markup efforts at BHL (US and global), including some remarks on Rod Page&apos;s proposed workflow. A representative of NCBI will attend the workshop.Markup efforts at BHLWhat use cases for markup of biodiversity literature do you see as promising but not yet well explored?Where should resources for markup of biodiversity literature - or biodiversity research materials more generally - be directed in the coming years?Who is doing interesting work around markup of biodiversity literature but not present here?What aspects of markup would be suitable projects for the pro-iBiospherehackathon in March?Do you have ideas around markup of biodiversity literature for which you are looking for a partner?Anything else you&apos;d like to communicate about markup of biodiversity literature or the workshop?Identify use cases for structured data generated on the basis of biodiversity literatureOutline of the structure and focus of D3.3.2 - report on progress during the coordination process of partners and non consortium partners“Connecting the Plazi repository with BHL would close the loop toget even more contextual information and continue the literature research. The basic workflow isalready in place, but it is not fully automated, due to the technical challenges involved inautomatically recognising text at the high accuracy required for taxonomic work.”
  2. The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.”
  3. Loosely based on Library of Congress Subject Headings
  4. New types of content:- Handwritten Text from Field Notes, Letters
  5. New types of content:- Special formatted content like CommercialSeed Catalogues and Seed Lists
  6. Global Names recognition and discovery tools and servicesFind scientific names on web pages, PDFs, Microsoft Office documents, images, or in freeform text. Encrypted or image-based PDFs and image files first pass through an OCR routine using Tesseract prior to using the excellent TaxonFinder and NetiNetinames discovery engines. The language of incoming content is determined using unsupervised language detection. If found to be other than English, TaxonFinder is preferentially used. Found names can be optionally resolved against a number of resources.
  7. On legacy literature, what your plans are with BHL, and especially your move into content?
  8. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
  9. BHL-Europe has used several webservices for search term expansion to facilitate searches for more than just species names. The Virtual International Authority Files (VIAF) webservice makes different spellings of author names accessible to facilitate the search for author names (e.g. Linne, Linnaeus, etc). The webservice of the Zeitschriftendatenbank (ZDB, serial database) is used to also extend the search for serial abbreviations.
  10. The Smithsonian Institution Libraries (BHL partner), in collaboration with the International Association for Plant Taxonomy (IAPT), have produced an online version of the Taxonomic Literature II (TL-2), a guide to the systematic botany literature published until 1940. Initially, a basic website is being offered that is searchable by keyword, author name, title identification number, author name abbreviation and title abbreviation. (http://www.sil.si.edu/DigitalCollections/TL-2/)In a second step, the entire TL-2 dataset is to be provided as Linked Open Data. This way, each author and publication will get a permanent and authoritative URI on the web. These URIs will contain information in both human-readable form (via HTML) and computer-readable form (via RDF/XML). A SPARQL endpoint will also be provided for querying the linked data.
  11. Towards BioStor articles marked up using Journal Archiving Tag SetI&apos;ve made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:BioStor articles need to be archived somewhere. At the moment they live on my server, and metadata is also served by BHL (as the &quot;parts&quot; you see in a scanned volume). Long term maybe PubMed Central is a possibility (BHL essentially becomes a publisher). Imagine PubMed Central becoming the primary archival repository for biodiversity literature.BioStor articles could be more useful if the OCR text was cleaned up and marked up (e.g., highlighting taxon names, localities, extracting citations, etc.).If BioStor articles were marked up to same extent as ZooKeys then we could use tools developed for ZooKeys (see Towards an interactive taxonomic article: displaying an article from ZooKeys) for a richer reading experience.Cleaned OCR text could also be used to generate searchable PDFs, which are still the most popular way for people to read articles (see Why do scientists tend to prefer PDF documents over HTML when reading scientific journals?). BioStor already generates PDFs, but these are simply made by wrapping page images in a PDF. Searchable PDFs would be much friendlier.For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article &quot;Microsporidian encephalitis of farmed Atlantic salmon (Salmosalar) in British Columbia&quot; which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).
  12. Initially, software tools will help discover visual resources (illustrations, maps, and other works of art) in BHL’s corpus, and basic metadata will be recorded. These resources will then be shared on multiple image delivery systems, including Flickr and the Wikimedia Commons, where citizen scientists will be able to add further annotations. Because of the wide diversity of information that a citizen scientist can add to any image, a comprehensive yet manageable schema is needed to help standardize inputs and enable synchronization and seamless import back into the BHL databases.
  13. Initially, software tools will help discover visual resources (illustrations, maps, and other works of art) in BHL’s corpus, and basic metadata will be recorded. These resources will then be shared on multiple image delivery systems, including Flickr and the Wikimedia Commons, where citizen scientists will be able to add further annotations. Because of the wide diversity of information that a citizen scientist can add to any image, a comprehensive yet manageable schema is needed to help standardize inputs and enable synchronization and seamless import back into the BHL databases.
  14. Initially, software tools will help discover visual resources (illustrations, maps, and other works of art) in BHL’s corpus, and basic metadata will be recorded. These resources will then be shared on multiple image delivery systems, including Flickr and the Wikimedia Commons, where citizen scientists will be able to add further annotations. Because of the wide diversity of information that a citizen scientist can add to any image, a comprehensive yet manageable schema is needed to help standardize inputs and enable synchronization and seamless import back into the BHL databases.
  15. Demo Account        macaw.joelrichard.com        User: demo Password: demohttp://macawup01.up.ac.zai macaw.mobot.org
  16. Reviewing Metadata        Thumbnail view of the pages (06-enter-metadata.png)        Large version of the image has a magnifier
  17. Reviewing Metadata        List view to see more metadata at once (07-metadata.png)        Standard Metadata is suitable for BHL use        No additional metadata modules are needed
  18. The authors have worked on the development of an effective metadata schema for such natural history illustrations, but instead of developing yet another schema from scratch, they have identified existing schemas that meet the needs of the project and integrated a solution that combines the best in biodiversity informatics and image curation standards and best practices. This schema needs to support three main objectives:  (1) to enable the discovery, description and use of the identified images by artists, biologists, humanities scholars, and educators;  (2) to make BHL’s metadata and images available to other platforms; and  (3) to import crowdsourced metadata generated in other platforms back into BHL..A preliminary schema version will be presented to the TDWG community, explaining how we addressed metadata challenges specific to biodiversity data, in order to obtain feedback on the final version.
  19. Natural history illustrations from the Biodiversity Heritage Library seem to leap across boundaries while being catalogued, emerging simultaneously as history, science and art. As historic documents, they paint a vibrant picture of the first time European scientists and explorers encountered exotic plants and animals in the 17th and 18th centuries, drawn by some of the finest illustrators of the world.   Also, as biodiversity records, they provide valuable documentation of when, where, and who first observed a species, and some of them are our only surviving representations of extinct species.  Finally, as aesthetic elements, they communicate human emotions and other values toward nature by exemplifying the mimesis in art and providing a vivid expression of human creativity and imagination.This year, the Missouri Botanical Garden received a grant from the National Endowment for the Humanities (NEH) to support a project called The Art of Life: Data Mining and Crowdsourcing the Identification and Description of Natural History Illustrations from the Biodiversity Heritage Library (BHL).
  20. Title:The Art of Life Schema: describing and providing access to natural history illustrations form the Biodiversity Heritage Library (BHL) Authors:William Ulate (Missouri Botanical Garden): William.Ulate@mobot.orgTrish Rose-Sandler (Missouri Botanical Garden); trish.rose-sandler@mobot.orgGaurav Vaidya (University of Colorado Boulder): gaurav@ggvaidya.comRobert Guralnick (University of Colorado): robgur@gmail.com 
  21. You can see from this slide that accuracy goes way down when processing older blackletter-type typefaces.
  22. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
  23. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
  24. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
  25. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
  26. The Missouri Botanical Garden and partners at Harvard University, Cornell University, and New York Botanical Garden will test new means of crowdsourcing to support the enhancement of content in the Biodiversity Heritage Library (BHL). The BHL is an international consortium of the world’s leading natural history libraries that have collaborated to digitize the public domain literature documenting the world’s biological diversity, resulting in the single, largest, open-licensed source of biodiversity literature. The project will demonstrate whether or not digital games are an effective tool for analyzing and improving digital outputs from optical character recognition and transcription. The anticipated benefits of gaming include improved access to content by providing richer and more accurate data; an extension of limited staff resources; and exposure of library content to communities who may not know about the collections otherwise.
  27. - The Missouri Botanical Garden and partners at Harvard University, Cornell University, and New York Botanical Garden will test new means of crowdsourcing to support the enhancement of content in the Biodiversity Heritage Library (BHL). - The project will demonstrate whether or not digital games are an effective tool for analyzing and improving digital outputs from optical character recognition and transcription. - The anticipated benefits of gaming include improved access to content by providing richer and more accurate data; an extension of limited staff resources; and exposure of library content to communities who may not know about the collections otherwise.
  28. (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do.  Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR.  My own project required a first pass of OCR to clean images before sending them to a second, &apos;real&apos; pass of OCR.Jason Best’s DarwinScoreBen Brumfield’s Handwriting Gibberish Detector
  29. Here we develop a framework that connects the general needs for survival and reproduction with the descriptions of habitats of species using text mining approaches of EOL data and BHL literature to map species specific-habitat relationships within and across lineages, a comparison of established controlled vocabularies/ontologies to better understand how the location aspect of habitats can be defined, and spatial data queries of species observation records to quantify which habitats species use.
  30. Amphibians are one of the most threatened groups of vertebrates, while at thesame time a keystone to the ecosystem. Therefore, it is important to assess their risk of extinction to take appropriate steps towards their conservation. In this project we propose to use data mined from the EOL and BHL projects in order to obtain data to statistically assess the species of Mexican amphibians in a cost-effective method to identify their risk status.
  31. Observations on the morphology, anatomy and other properties of biological taxa have been accumulating in the scientific literature for centuries. Our understanding of the taxonomic and phylogenetic diversity in the natural world has changed dramatically over this time. How much has scientific usage of the anatomical concepts that are accepted today evolved, as well? We propose to address this by comparison of word frequency in the vicinity of anatomical terms as sampled from texts in the Biodiversity Heritage Library.  We will extract the frequency of words (not including common ‘stop’ words) in the neighborhood of (stemmed) words that have matches in a large ontology of vertebrate skeletal terms [1].  Vectors of word frequencies (aka ‘context vectors’) have limitations as summaries of the meaning of a text, but are amenable to statistical comparison and are the workhorse of state-of-the-art word sense disambiguation algorithms [2]. Some of the questions we would like to address with these context vectors in hand include: (1) How rapidly does the context of an anatomical term drift?  (2) How much variation in drift is there among concepts?  (3) How is the context affected by the taxonomic focus?  (4) by the geographic origin of the author? (4) Is there more or less variation in older or contemporary usage? (5) How has the frequency with which different terms are used evolved?
  32. Automatic correction of errors in text extracted automatically from legacy biodiversity literature via optical character recognition (OCR).Development of a crowdsourcing facility that will encourage users to annotate legacy texts with semantic metadata.Adaptation of text mining technologies to extract metadata (i.e., terminology, entities and significant events) automatically and to track terminology evolution over time. This will facilitate semantic search, allowing users to explore search results according to multiple information dimensions or facets.Interactive visualisation techniques will be used to help users to make sense of search results through the integration of next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities.Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  33. Automatic correction of errors in text extracted automatically from legacy biodiversity literature via optical character recognition (OCR).Development of a crowdsourcing facility that will encourage users to annotate legacy texts with semantic metadata.Adaptation of text mining technologies to extract metadata (i.e., terminology, entities and significant events) automatically and to track terminology evolution over time. This will facilitate semantic search, allowing users to explore search results according to multiple information dimensions or facets.Interactive visualisation techniques will be used to help users to make sense of search results through the integration of next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities.Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  34. Aims to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project integrates novel text mining (TM) methods, visualisation, crowdsourcing and social media into the BHL. The resulting digital resource will provide fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.
  35. On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?