Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ContentMine Architecture
1. RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
6. Raw HTML
Not wellformed
Bad character
semantics
ScholarlyHTML
Well-formed
XHTML
PNG
Tagged
Sections
Captioned
Figures
Tables
Captioned
Tables
XML
HtmlTidy
Jsoup
HtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journal
Stylesheets
7. End points
• Norma -> CMDir(OpenSHTML-SVG)
• Norma -> CMDir(sHTML. sections) -> AMI ->
all text + species, chemistry, sequences)
• Norma -> CMDir(TXT (unsectioned)) -> AMI ->
bagOfWords, regex,
• Norma -> CMDir(PNG) -> AMI -> phylo,
bar/xy- plots,
• Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry
8. PDF
Non-Unicode
Pixel glyphs
No words
No structures
ScholarlyHTML
SVG
High-level
graphics
PDF2SVG
characters
Sentences
Paras
tables
PNG OCR
Tagged
Sections
SVGBuilder
Captioned
Figures
NORMA
XSLT1/2
9. NORMALIZE
Norma
Convert PDF,XML
To sHTML
Tag sections
Normalized
Scientific
Literature
AMI
Index
Transform
Extract
Search
PDF2SVG
XSL stylesheets
Taggers
normalization
Parameters
“Permanent”
Filestore
Temporary
Filestore
Extracted facts
indexes
Plugins
Regex