texttaglib’s documentation!¶
Warning
⚠️ THIS PROJECT HAS BEEN RENAMED AND ARCHIVED. ALL FUTURE DEVELOPMENT WILL BE ON speach LIBRARY ⚠️
Migration to speach¶
Migration from
texttaglib
to speach should be trivial
# just change import statements from something like
from texttaglib import elan
# to the new package name
from speach import elan
Installation
# change
pip install texttaglib
# into
pip install speach
For more information, please visit: https://speach.readthedocs.io/
(Legacy) Introduction¶
texttaglib is a Python library for managing and annotating text corpuses in different formats.
Main functions are:
Multiple storage formats (text files, JSON files, SQLite databases)
TTLIG - A human-friendly intelinear gloss format for linguistic documentation
Manipuling transcription files directly in ELAN Annotation Format (eaf)
Basic usage¶
>>> from texttaglib import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()
The script above will generate this corpus
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_concepts.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_links.txt
-rw-rw-r--. 1 tuananh tuananh 20 3月 29 13:10 mydoc_sents.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_tags.txt
-rw-rw-r--. 1 tuananh tuananh 58 3月 29 13:10 mydoc_tokens.txt
ELAN support¶
texttaglib library contains a command line tool for converting EAF files into CSV.
python -m texttaglib eaf2csv input_elan_file.eaf -o output_file_name.csv
For more complex analyses, texttaglib Python scripts can be used to extract metadata and annotations from ELAN transcripts, for example:
from texttaglib import elan
# Test ELAN reader function in texttaglib
eaf = elan.open_eaf('./data/test.eaf')
# accessing metadata
print(f"Author: {eaf.author} | Date: {eaf.date} | Format: {eaf.fileformat} | Version: {eaf.version}")
print(f"Media file: {eaf.media_file}")
print(f"Time units: {eaf.time_units}")
print(f"Media URL: {eaf.media_url} | MIME type: {eaf.mime_type}")
print(f"Media relative URL: {eaf.relative_media_url}")
# accessing tiers & annotations
for tier in eaf.tiers():
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier.annotations:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")
SQLite support¶
TTL data can be stored in a SQLite database for better corpus analysis. Sample code will be added soon.
Getting Started¶
Introduction to texttaglib
Installation¶
texttaglib is available on PyPI and can be installed using pip.
pip install --user texttaglib
Common Recipes¶
Here are code snippets for common usecases of texttaglib
Open an ELAN file¶
>>> from texttaglib import elan
>>> eaf = elan.open_eaf('./data/test.eaf')
>>> eaf
<texttaglib.elan.ELANDoc object at 0x7f67790593d0>
Parse an existing text stream¶
>>> from texttaglib import elan
>>> with open('./data/test.eaf') as eaf_stream:
>>> ... eaf = elan.parse_eaf_stream(eaf_stream)
>>> ...
>>> eaf
<texttaglib.elan.ELANDoc object at 0x7f6778f7a9d0>
Accessing tiers & annotations¶
for tier in eaf.tiers():
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier.annotations:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")
Accessing nested tiers in ELAN¶
eaf = elan.open_eaf('./data/test_nested.eaf')
# accessing nested tiers
for tier in eaf.roots:
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for child_tier in tier.children:
print(f" | {child_tier.ID} | Participant: {child_tier.participant} | Type: {child_tier.type_ref}")
for ann in child_tier.annotations:
print(f" |- {ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")
Converting ELAN files to CSV¶
texttaglib includes a command line tool to convert an EAF file into CSV.
python -m texttaglib eaf2csv my_transcript.eaf -o my_transcript.csv
texttaglib APIs¶
An overview of texttaglib modules.
ELAN supports¶
texttaglib supports reading and manipulating multi-tier transcriptions from ELAN directly.
TTL Interlinear Gloss Format¶
TTLIG is a human friendly interlinear gloss format that can be edited using any text editor.
TTL SQLite¶
TTL supports SQLite storage format to manage large scale corpuses.
Useful Links¶
pyInkscape documentation: https://texttaglib.readthedocs.io/
pyInkscape on PyPI: https://pypi.org/project/texttaglib/
Soure code: https://github.com/letuananh/texttaglib/