SlideShare a Scribd company logo
1 of 20
text + annotations
represented for
slicing and dicing - collaboration - versioning
Dirk Roorda
SBL San Antonio
2016-11-18
...
<clause id="3" rela="Objc">
<phrase id="27" type="VP">
<word id="153" sp="verb" lex="BR&gt;[" phono="bārˈā"
lex_heb="‫א‬ ָּ‫ר‬ ָּ‫"ב‬ trans="B.@R@74&gt;"
gloss="create" partofspeech="verb"
trailer=" ">‫א‬ ָָּ֣‫ר‬ ָּ‫/<ב‬word>
<word ...>...</word>
...
</phrase>
<phrase ...>...</phrase>
...
</clause>
...
straightforward XML
- bloated
- hierarchy problems
- inferior tooling (xslt)
- no separation of concerns
- a horrible mess
- even more bloated
+ graph (nodes/edges)
+ LAF-Fabric (python)
+ separation of concerns
~ an awful load of megabytes
LAF-Fabric binary
~ reasonably compact
- a bit opaque
+ column storage
+ python (pickle + gzip)
+ separation of concerns
~ starting to look decent
Text Fabric
+ compact
- transparent
+ column storage
+ pure python
+ separation of concerns
~ with the syntax
bureaucracy out of the way,
we can start to look into the
real problems:
* slicing and dicing
* collaboration
* versioning
B
R>CJT/
BR>[
>LHJM/
>T
H
CMJM/
W
>T
H
>RY/
W
H
>RY/
HJH[
THW/
W
BHW/
W
XCK/
first 20 words from 200000
200000 BN/
KL/
H
JWM/
B
JWM/
PC<[
>DWM/
MN
TXT/
JD/
JHWDH/
W
MLK[
<L
MLK/
W
<BR[
JWRM/
Y<JR=/
Text Fabric (inside)
LAF-Fabric, with the LAF replaced by Text
Inside TF, the LAF-Fabric-API just works:
for n in NN():
if F.otype.v(n) == 'verse':
label = T.passage(n)
text = T.words(
L.d('word', n),
fmt='pf',
)
print('{}t{}n'.format(
label, text,
))
Text Fabric (skeleton)
A TF dataset is a bunch of text files
with at least otype and monads
otype
0-426580 word
426581-514580 clause
514581-605142 clause_atom
605143-858316 phrase
858317-1125831 phrase_atom
1125832-1189401 sentence
1189402-1253740 sentence_atom
1253741-1367532 subphrase
1367533-1367571 book
1367572-1368500 chapter
1368501-1413680 half_verse
1413681-1436893 verse
monads
1367533 1-28762
1367534 28763-52510
1367572 1-673
1367573 674-1167
1413681 1-11
1413682 12-31
1368501 1-4
1368502 5-11
1125832 1-11
1125833 12-18
1189402 1-11
1189403 12-18
426581 1-11
426582 12-18
514581 1-11
514582 12-18
605143 1-2
605144 3
858317 1-2
858318 3
1253741 5-7
1253742 9-11
This is the
skeleton:
• the positions
• the containment
of all text objects
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
parent
0-1,858317605143
2,858318 605144
3,858319 605145
4-10,858320 605146
426581,1189402 1125832
514581,605143-605146426581
information content
trailer
_
_
_
_
_
_
‫_׃‬
_
_
_
_
_
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
Text Fabric (flesh)
parent
0-1,858317605143
2,858318 605144
3,858319 605145
4-10,858320 605146
514581,605143-605146426581
426581,1189402 1125832
Edges (c't'd)
0 1 2 3 4 5 6 7 8 9 10 word
858317 858318 858319 858320 phrase_atom
bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ
605143 605144 605145 605146 phrase
426581 clause
1125832 sentence
flesh
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
trailer
_
_
_
_
_
_
‫_׃‬
_
_
_
_
_
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
typ
PP
NP
VP
NP
-
Objc
Resu
skeleton
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
monads
600000 0-1
600001 2
600002 3
600003 4-10
400001 0-10
400002 11-21
400003 22-34
vertical:
feature
selection
horizontal:
object
selection
both:
modules
slicing 'n dicing
collaborationflesh
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
skeleton
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
monads
600000 0-1
600001 2
600002 3
600003 4-10
400001 0-10
400002 11-21
400003 22-34
module:
feature strong
for words
module
strong
8675
7225
1254 a
430
853
8676
8064
8678
853
8676
776
8678
8676
776
1961
8414
8678
922
8678
2822
dependency:
on the implicit
monad order!
@20170101 @20161118
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
10 9
11 10
12 11
13 12
14 13
15 14
16 15
17 16
18 17
19 18
20 19
21 20
600001 600000
600002 600001
600003 600002
600004 600003
400002 400001
400003 400002
400004 400003
⇐
flesh
@20170101
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
‫ש‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
skeleton
@20170101
monads
9
600001 0-1
600002 2
600003 3
600004 4-11
400002 0-11
400003 12-22
400004 23-35
strong
8675
7225
1254 a
430
853
8676
8064
8678
853
8676
776
8678
8676
776
1961
8414
8678
922
8678
2822
module
@20161118versioning
old module
works as is on
new data
the end of the beginning
etcbc Stephen Ku David van
Acker
James
Cuénod
skeleton
@20160101 <=
@20170101
module strong
@20161118
flesh
@20160101
flesh
@20170101
module accent
@20170202
skeleton
@20170101 <=
@20180101
module strong
@20171118
flesh
@20180101
module assoc
@20180303
skeleton
@20180101 <=
@20190101
flesh
@20190101
skeleton
@20190101 <=
@20200101
dirk.roorda@dans.knaw.nl shebanq@ancient-data.org

More Related Content

What's hot

The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our WaysKevlin Henney
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL codeSimon Hoyle
 
Web Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 EnumerationWeb Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 EnumerationWebsecurify
 
Representing Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in OmekaRepresenting Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in OmekaArden Kirkland
 
Xcode Survival Guide
Xcode Survival GuideXcode Survival Guide
Xcode Survival GuideKristina Fox
 

What's hot (11)

The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our Ways
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
 
Artsy ♥ ASCII ART
Artsy ♥ ASCII ARTArtsy ♥ ASCII ART
Artsy ♥ ASCII ART
 
Auxiliary
AuxiliaryAuxiliary
Auxiliary
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
 
Sass
SassSass
Sass
 
Introduction to tibbles
Introduction to tibblesIntroduction to tibbles
Introduction to tibbles
 
Web Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 EnumerationWeb Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 Enumeration
 
Representing Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in OmekaRepresenting Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in Omeka
 
php string-part 2
php string-part 2php string-part 2
php string-part 2
 
Xcode Survival Guide
Xcode Survival GuideXcode Survival Guide
Xcode Survival Guide
 

Similar to Text fabric

Open course(programming languages) 20150225
Open course(programming languages) 20150225Open course(programming languages) 20150225
Open course(programming languages) 20150225JangChulho
 
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe
 
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoopMathias Herberts
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text ProcessingRodrigo Senra
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/ibrettflorio
 
Python tutorial
Python tutorialPython tutorial
Python tutorialnazzf
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)Dirk Roorda
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanWei-Yuan Chang
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitWojciech Gawroński
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析Takeshi Arabiki
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePeter Solymos
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospectivechenge2k
 
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBDavid Golden
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbienRomain Francois
 

Similar to Text fabric (20)

Open course(programming languages) 20150225
Open course(programming languages) 20150225Open course(programming languages) 20150225
Open course(programming languages) 20150225
 
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
 
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
PEP 498: The Monologue
PEP 498: The MonologuePEP 498: The Monologue
PEP 498: The Monologue
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the code
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
 
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDB
 
Lettering js
Lettering jsLettering js
Lettering js
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbien
 

More from Dirk Roorda

General Missives
General MissivesGeneral Missives
General MissivesDirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-FabricDirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysisDirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsDirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew BibleDirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleDirk Roorda
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Dirk Roorda
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Dirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
General Missives
General MissivesGeneral Missives
General Missives
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 

Recently uploaded

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Recently uploaded (20)

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Text fabric

  • 1. text + annotations represented for slicing and dicing - collaboration - versioning Dirk Roorda SBL San Antonio 2016-11-18
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. ... <clause id="3" rela="Objc"> <phrase id="27" type="VP"> <word id="153" sp="verb" lex="BR&gt;[" phono="bārˈā" lex_heb="‫א‬ ָּ‫ר‬ ָּ‫"ב‬ trans="B.@R@74&gt;" gloss="create" partofspeech="verb" trailer=" ">‫א‬ ָָּ֣‫ר‬ ָּ‫/<ב‬word> <word ...>...</word> ... </phrase> <phrase ...>...</phrase> ... </clause> ... straightforward XML - bloated - hierarchy problems - inferior tooling (xslt) - no separation of concerns - a horrible mess
  • 9. - even more bloated + graph (nodes/edges) + LAF-Fabric (python) + separation of concerns ~ an awful load of megabytes
  • 10. LAF-Fabric binary ~ reasonably compact - a bit opaque + column storage + python (pickle + gzip) + separation of concerns ~ starting to look decent
  • 11. Text Fabric + compact - transparent + column storage + pure python + separation of concerns ~ with the syntax bureaucracy out of the way, we can start to look into the real problems: * slicing and dicing * collaboration * versioning B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ W H >RY/ HJH[ THW/ W BHW/ W XCK/ first 20 words from 200000 200000 BN/ KL/ H JWM/ B JWM/ PC<[ >DWM/ MN TXT/ JD/ JHWDH/ W MLK[ <L MLK/ W <BR[ JWRM/ Y<JR=/
  • 12. Text Fabric (inside) LAF-Fabric, with the LAF replaced by Text Inside TF, the LAF-Fabric-API just works: for n in NN(): if F.otype.v(n) == 'verse': label = T.passage(n) text = T.words( L.d('word', n), fmt='pf', ) print('{}t{}n'.format( label, text, ))
  • 13. Text Fabric (skeleton) A TF dataset is a bunch of text files with at least otype and monads otype 0-426580 word 426581-514580 clause 514581-605142 clause_atom 605143-858316 phrase 858317-1125831 phrase_atom 1125832-1189401 sentence 1189402-1253740 sentence_atom 1253741-1367532 subphrase 1367533-1367571 book 1367572-1368500 chapter 1368501-1413680 half_verse 1413681-1436893 verse monads 1367533 1-28762 1367534 28763-52510 1367572 1-673 1367573 674-1167 1413681 1-11 1413682 12-31 1368501 1-4 1368502 5-11 1125832 1-11 1125833 12-18 1189402 1-11 1189403 12-18 426581 1-11 426582 12-18 514581 1-11 514582 12-18 605143 1-2 605144 3 858317 1-2 858318 3 1253741 5-7 1253742 9-11 This is the skeleton: • the positions • the containment of all text objects
  • 14. text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ parent 0-1,858317605143 2,858318 605144 3,858319 605145 4-10,858320 605146 426581,1189402 1125832 514581,605143-605146426581 information content trailer _ _ _ _ _ _ ‫_׃‬ _ _ _ _ _ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs Text Fabric (flesh)
  • 15. parent 0-1,858317605143 2,858318 605144 3,858319 605145 4-10,858320 605146 514581,605143-605146426581 426581,1189402 1125832 Edges (c't'd) 0 1 2 3 4 5 6 7 8 9 10 word 858317 858318 858319 858320 phrase_atom bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ 605143 605144 605145 605146 phrase 426581 clause 1125832 sentence
  • 16. flesh text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ trailer _ _ _ _ _ _ ‫_׃‬ _ _ _ _ _ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs typ PP NP VP NP - Objc Resu skeleton otype word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause monads 600000 0-1 600001 2 600002 3 600003 4-10 400001 0-10 400002 11-21 400003 22-34 vertical: feature selection horizontal: object selection both: modules slicing 'n dicing
  • 17. collaborationflesh text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs skeleton otype word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause monads 600000 0-1 600001 2 600002 3 600003 4-10 400001 0-10 400002 11-21 400003 22-34 module: feature strong for words module strong 8675 7225 1254 a 430 853 8676 8064 8678 853 8676 776 8678 8676 776 1961 8414 8678 922 8678 2822 dependency: on the implicit monad order!
  • 18. @20170101 @20161118 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 17 16 18 17 19 18 20 19 21 20 600001 600000 600002 600001 600003 600002 600004 600003 400002 400001 400003 400002 400004 400003 ⇐ flesh @20170101 text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ‫ש‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ otype word word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause skeleton @20170101 monads 9 600001 0-1 600002 2 600003 3 600004 4-11 400002 0-11 400003 12-22 400004 23-35 strong 8675 7225 1254 a 430 853 8676 8064 8678 853 8676 776 8678 8676 776 1961 8414 8678 922 8678 2822 module @20161118versioning old module works as is on new data
  • 19. the end of the beginning etcbc Stephen Ku David van Acker James Cuénod skeleton @20160101 <= @20170101 module strong @20161118 flesh @20160101 flesh @20170101 module accent @20170202 skeleton @20170101 <= @20180101 module strong @20171118 flesh @20180101 module assoc @20180303 skeleton @20180101 <= @20190101 flesh @20190101 skeleton @20190101 <= @20200101