SlideShare a Scribd company logo
1 of 9
Download to read offline
Text as Data
TextPy Text-Fabric
⊂
Dirk Roorda

2021-02-25

1 year after the Lorentz Workshop "Processing Ancient Text Corpora"
How to Analyze Data with Python, Pandas &
Numpy - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals

• Lesson 2: Numpy for data processing

• Lesson 3: Pandas for working with tabular data

• Lesson 4: Visualization with Matplotlib and Seaborn

• Lesson 5: Exploratory Data Analysis: A Case Study

• Course Project - Exploratory Data Analysis

• Find a real-world dataset of your choice online

• Use Numpy & Pandas to parse, clean & analyze data

• Use Matplotlib & Seaborn to create visualizations

• Ask and answer interesting questions about the data
codecamp
How to Analyze Text with Python with TextPy and
Text-Fabric - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals

• Lesson 2: TextPy for text processing

• Lesson 3: Text-Fabric for working with annotated corpora

• Lesson 4: Visualization with Matplotlib and Seaborn

• Lesson 5: Exploratory Data Analysis: A Case Study

• Course Project - Exploratory Data Analysis

• Find a real-world corpus of your choice online

• Use Walker to convert data

• Use TextPy for quantitative analysis

• Use Text-Fabric to query text and find interesting pieces

• Use Matplotlib & Seaborn to create visualizations
tf-docs
What to expect
TextPy is not smart
• no linguistic knowledge

• no AI

• not an annotation tool

• not a citation finder / parallel
passage detector

• not a crowd source application
TextPy works with a text-oriented data
structure
• positions in a sequence

• embedding and overlap

• linking and connecting

• annotations

• efficient operations on this data structure
textpy
Example: NumPy vs OpenCV
• Image of Arabic text: open it with OpenCV

• Under the hood it is a NumPy 2-dimensional array of pixels

• Produce histograms and line boundaries by algorithms expressed in NumPy

• Show the results in the image with OpenCV
fusus
generous, because they do so
much work in so many
situations
Generous Python Modules
Basic models: set, list, tree, dictionary:

• standard library of the Python language

• flimsy operations

• ubiquitous use
Generic models: n-dim array, dataframe, RDF

• utility Python modules

• hard work inside the model

• usable where ever the domain can be expressed
in the model
Specific models: HTML, PDF, TEI, NLTK

• domain specific Python modules

• substantial operations

• only usable for that domain
A generic model for text
A text is

• a graph (basic)

with

• the first N nodes ordered in a
sequence (slots)

• all other nodes mapped to
subsets of slots

• any number of mappings
between nodes/edges and
values (annotations) tf-model
Supported operations
Micro
• high-speed walking through the textual
sequence

• navigating between embedders en
embeddees

• accessing feature values and weaving them
to text

• display text structures

• query on the combination of content and
spatial relationships
Macro
• convert from arbitrary XML / TEI

• convert from arbitrary TSV

• compose / modify corpora

• export - process - re-import
To do
To make it happen
• Split Text-Fabric into the
TextPy core and the Text-
Fabric additions

• Optimize TextPy (Cythonize,
indexing)

• distribute "wheels" for
Linux, MacOS, Windows

• Support Pandas-ish text
access

• F.gender.v(n) 

• becomes

• corpus.gender[n]
To build on it
• Add volume support:
working per volume in
a corpus

• Add operations that
address multiple
volumes

• Add operations that
address multiple
corpora

• intertextuality
262 KB
74 KB
90 KB
154 KB
168 KB
134 KB
35 KB
595 KB
322 KB
917 KB

More Related Content

What's hot

NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
 
Introduction to Python-1
Introduction to Python-1Introduction to Python-1
Introduction to Python-1Shuai Liu
 
ML Toolkit Share
ML Toolkit ShareML Toolkit Share
ML Toolkit Share志明 陳
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
 
Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009Enthought, Inc.
 
Python array API standardization - current state and benefits
Python array API standardization - current state and benefitsPython array API standardization - current state and benefits
Python array API standardization - current state and benefitsRalf Gommers
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
TXM background
TXM backgroundTXM background
TXM backgroundslheiden
 
AI & Topology concluding remarks - "The open-source landscape for topology in...
AI & Topology concluding remarks - "The open-source landscape for topology in...AI & Topology concluding remarks - "The open-source landscape for topology in...
AI & Topology concluding remarks - "The open-source landscape for topology in...Umberto Lupo
 
TXM import process
TXM import processTXM import process
TXM import processslheiden
 

What's hot (12)

Spry 2017
Spry 2017Spry 2017
Spry 2017
 
Numba
NumbaNumba
Numba
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
Introduction to Python-1
Introduction to Python-1Introduction to Python-1
Introduction to Python-1
 
ML Toolkit Share
ML Toolkit ShareML Toolkit Share
ML Toolkit Share
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009Scientific Computing with Python Webinar --- August 28, 2009
Scientific Computing with Python Webinar --- August 28, 2009
 
Python array API standardization - current state and benefits
Python array API standardization - current state and benefitsPython array API standardization - current state and benefits
Python array API standardization - current state and benefits
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
TXM background
TXM backgroundTXM background
TXM background
 
AI & Topology concluding remarks - "The open-source landscape for topology in...
AI & Topology concluding remarks - "The open-source landscape for topology in...AI & Topology concluding remarks - "The open-source landscape for topology in...
AI & Topology concluding remarks - "The open-source landscape for topology in...
 
TXM import process
TXM import processTXM import process
TXM import process
 

Similar to Textpy

Python indroduction
Python indroductionPython indroduction
Python indroductionFEG
 
ANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptxANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptxShahzadAhmadJoiya3
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxssuserf583ac
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxRohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxSreeVani74
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
 
Machine learning from software developers point of view
Machine learning from software developers point of viewMachine learning from software developers point of view
Machine learning from software developers point of viewPierre Paci
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptxKashishKashish22
 

Similar to Textpy (20)

Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
Python indroduction
Python indroductionPython indroduction
Python indroduction
 
ANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptxANN-Lecture2-Python Startup.pptx
ANN-Lecture2-Python Startup.pptx
 
Python
PythonPython
Python
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Python for Delphi Developers - Part 2
Python for Delphi Developers - Part 2Python for Delphi Developers - Part 2
Python for Delphi Developers - Part 2
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 
Session 2
Session 2Session 2
Session 2
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
Pa2 session 1
Pa2 session 1Pa2 session 1
Pa2 session 1
 
Kaggle tokyo 2018
Kaggle tokyo 2018Kaggle tokyo 2018
Kaggle tokyo 2018
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?
 
hpcpp.pptx
hpcpp.pptxhpcpp.pptx
hpcpp.pptx
 
Machine learning from software developers point of view
Machine learning from software developers point of viewMachine learning from software developers point of view
Machine learning from software developers point of view
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 

More from Dirk Roorda

General Missives
General MissivesGeneral Missives
General MissivesDirk Roorda
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-FabricDirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysisDirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsDirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew BibleDirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleDirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
General Missives
General MissivesGeneral Missives
General Missives
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 

Recently uploaded

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 

Recently uploaded (20)

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 

Textpy

  • 1. Text as Data TextPy Text-Fabric ⊂ Dirk Roorda 2021-02-25 1 year after the Lorentz Workshop "Processing Ancient Text Corpora"
  • 2. How to Analyze Data with Python, Pandas & Numpy - 10 Hour Course • Lesson 1: Python & Jupyter Fundamentals • Lesson 2: Numpy for data processing • Lesson 3: Pandas for working with tabular data • Lesson 4: Visualization with Matplotlib and Seaborn • Lesson 5: Exploratory Data Analysis: A Case Study • Course Project - Exploratory Data Analysis • Find a real-world dataset of your choice online • Use Numpy & Pandas to parse, clean & analyze data • Use Matplotlib & Seaborn to create visualizations • Ask and answer interesting questions about the data codecamp
  • 3. How to Analyze Text with Python with TextPy and Text-Fabric - 10 Hour Course • Lesson 1: Python & Jupyter Fundamentals • Lesson 2: TextPy for text processing • Lesson 3: Text-Fabric for working with annotated corpora • Lesson 4: Visualization with Matplotlib and Seaborn • Lesson 5: Exploratory Data Analysis: A Case Study • Course Project - Exploratory Data Analysis • Find a real-world corpus of your choice online • Use Walker to convert data • Use TextPy for quantitative analysis • Use Text-Fabric to query text and find interesting pieces • Use Matplotlib & Seaborn to create visualizations tf-docs
  • 4. What to expect TextPy is not smart • no linguistic knowledge • no AI • not an annotation tool • not a citation finder / parallel passage detector • not a crowd source application TextPy works with a text-oriented data structure • positions in a sequence • embedding and overlap • linking and connecting • annotations • efficient operations on this data structure textpy
  • 5. Example: NumPy vs OpenCV • Image of Arabic text: open it with OpenCV • Under the hood it is a NumPy 2-dimensional array of pixels • Produce histograms and line boundaries by algorithms expressed in NumPy • Show the results in the image with OpenCV fusus
  • 6. generous, because they do so much work in so many situations Generous Python Modules Basic models: set, list, tree, dictionary: • standard library of the Python language • flimsy operations • ubiquitous use Generic models: n-dim array, dataframe, RDF • utility Python modules • hard work inside the model • usable where ever the domain can be expressed in the model Specific models: HTML, PDF, TEI, NLTK • domain specific Python modules • substantial operations • only usable for that domain
  • 7. A generic model for text A text is • a graph (basic) with • the first N nodes ordered in a sequence (slots) • all other nodes mapped to subsets of slots • any number of mappings between nodes/edges and values (annotations) tf-model
  • 8. Supported operations Micro • high-speed walking through the textual sequence • navigating between embedders en embeddees • accessing feature values and weaving them to text • display text structures • query on the combination of content and spatial relationships Macro • convert from arbitrary XML / TEI • convert from arbitrary TSV • compose / modify corpora • export - process - re-import
  • 9. To do To make it happen • Split Text-Fabric into the TextPy core and the Text- Fabric additions • Optimize TextPy (Cythonize, indexing) • distribute "wheels" for Linux, MacOS, Windows • Support Pandas-ish text access • F.gender.v(n) • becomes • corpus.gender[n] To build on it • Add volume support: working per volume in a corpus • Add operations that address multiple volumes • Add operations that address multiple corpora • intertextuality 262 KB 74 KB 90 KB 154 KB 168 KB 134 KB 35 KB 595 KB 322 KB 917 KB