Towards TextPy, a module for processing text.
If we define annotated text as a graph with additional structure, we can make text processing more efficient, in the same way that Pandas makes processing dataframes more efficient.
Transaction Management in Database Management System
Textpy
1. Text as Data
TextPy Text-Fabric
⊂
Dirk Roorda
2021-02-25
1 year after the Lorentz Workshop "Processing Ancient Text Corpora"
2. How to Analyze Data with Python, Pandas &
Numpy - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals
• Lesson 2: Numpy for data processing
• Lesson 3: Pandas for working with tabular data
• Lesson 4: Visualization with Matplotlib and Seaborn
• Lesson 5: Exploratory Data Analysis: A Case Study
• Course Project - Exploratory Data Analysis
• Find a real-world dataset of your choice online
• Use Numpy & Pandas to parse, clean & analyze data
• Use Matplotlib & Seaborn to create visualizations
• Ask and answer interesting questions about the data
codecamp
3. How to Analyze Text with Python with TextPy and
Text-Fabric - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals
• Lesson 2: TextPy for text processing
• Lesson 3: Text-Fabric for working with annotated corpora
• Lesson 4: Visualization with Matplotlib and Seaborn
• Lesson 5: Exploratory Data Analysis: A Case Study
• Course Project - Exploratory Data Analysis
• Find a real-world corpus of your choice online
• Use Walker to convert data
• Use TextPy for quantitative analysis
• Use Text-Fabric to query text and find interesting pieces
• Use Matplotlib & Seaborn to create visualizations
tf-docs
4. What to expect
TextPy is not smart
• no linguistic knowledge
• no AI
• not an annotation tool
• not a citation finder / parallel
passage detector
• not a crowd source application
TextPy works with a text-oriented data
structure
• positions in a sequence
• embedding and overlap
• linking and connecting
• annotations
• efficient operations on this data structure
textpy
5. Example: NumPy vs OpenCV
• Image of Arabic text: open it with OpenCV
• Under the hood it is a NumPy 2-dimensional array of pixels
• Produce histograms and line boundaries by algorithms expressed in NumPy
• Show the results in the image with OpenCV
fusus
6. generous, because they do so
much work in so many
situations
Generous Python Modules
Basic models: set, list, tree, dictionary:
• standard library of the Python language
• flimsy operations
• ubiquitous use
Generic models: n-dim array, dataframe, RDF
• utility Python modules
• hard work inside the model
• usable where ever the domain can be expressed
in the model
Specific models: HTML, PDF, TEI, NLTK
• domain specific Python modules
• substantial operations
• only usable for that domain
7. A generic model for text
A text is
• a graph (basic)
with
• the first N nodes ordered in a
sequence (slots)
• all other nodes mapped to
subsets of slots
• any number of mappings
between nodes/edges and
values (annotations) tf-model
8. Supported operations
Micro
• high-speed walking through the textual
sequence
• navigating between embedders en
embeddees
• accessing feature values and weaving them
to text
• display text structures
• query on the combination of content and
spatial relationships
Macro
• convert from arbitrary XML / TEI
• convert from arbitrary TSV
• compose / modify corpora
• export - process - re-import
9. To do
To make it happen
• Split Text-Fabric into the
TextPy core and the Text-
Fabric additions
• Optimize TextPy (Cythonize,
indexing)
• distribute "wheels" for
Linux, MacOS, Windows
• Support Pandas-ish text
access
• F.gender.v(n)
• becomes
• corpus.gender[n]
To build on it
• Add volume support:
working per volume in
a corpus
• Add operations that
address multiple
volumes
• Add operations that
address multiple
corpora
• intertextuality
262 KB
74 KB
90 KB
154 KB
168 KB
134 KB
35 KB
595 KB
322 KB
917 KB