texttaglib’s documentation!

Warning

⚠️ THIS PROJECT HAS BEEN RENAMED AND ARCHIVED. ALL FUTURE DEVELOPMENT WILL BE ON speach LIBRARY ⚠️

Migration to speach

Migration from texttaglib to speach should be trivial

# just change import statements from something like
from texttaglib import elan
# to the new package name
from speach import elan

Installation

# change
pip install texttaglib
# into
pip install speach

(Legacy) Introduction

texttaglib is a Python library for managing and annotating text corpuses in different formats.

https://readthedocs.org/projects/texttaglib/badge/?version=latest&style=plastic https://img.shields.io/lgtm/alerts/g/letuananh/texttaglib.svg?logo=lgtm&logoWidth=18 https://img.shields.io/lgtm/grade/python/g/letuananh/texttaglib.svg?logo=lgtm&logoWidth=18

Main functions are:

  • Multiple storage formats (text files, JSON files, SQLite databases)

  • TTLIG - A human-friendly intelinear gloss format for linguistic documentation

  • Manipuling transcription files directly in ELAN Annotation Format (eaf)

Installation

texttaglib is availble on PyPI.

pip install texttaglib

Basic usage

>>> from texttaglib import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()

The script above will generate this corpus

-rw-rw-r--.  1 tuananh tuananh       0  3 29 13:10 mydoc_concepts.txt
-rw-rw-r--.  1 tuananh tuananh       0  3 29 13:10 mydoc_links.txt
-rw-rw-r--.  1 tuananh tuananh      20  3 29 13:10 mydoc_sents.txt
-rw-rw-r--.  1 tuananh tuananh       0  3 29 13:10 mydoc_tags.txt
-rw-rw-r--.  1 tuananh tuananh      58  3 29 13:10 mydoc_tokens.txt

ELAN support

texttaglib library contains a command line tool for converting EAF files into CSV.

python -m texttaglib eaf2csv input_elan_file.eaf -o output_file_name.csv

For more complex analyses, texttaglib Python scripts can be used to extract metadata and annotations from ELAN transcripts, for example:

from texttaglib import elan

# Test ELAN reader function in texttaglib
eaf = elan.open_eaf('./data/test.eaf')

# accessing metadata
print(f"Author: {eaf.author} | Date: {eaf.date} | Format: {eaf.fileformat} | Version: {eaf.version}")
print(f"Media file: {eaf.media_file}")
print(f"Time units: {eaf.time_units}")
print(f"Media URL: {eaf.media_url} | MIME type: {eaf.mime_type}")
print(f"Media relative URL: {eaf.relative_media_url}")

# accessing tiers & annotations
for tier in eaf.tiers():
    print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
    for ann in tier.annotations:
        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")

SQLite support

TTL data can be stored in a SQLite database for better corpus analysis. Sample code will be added soon.

Getting Started

Introduction to texttaglib

Installation

texttaglib is available on PyPI and can be installed using pip.

pip install --user texttaglib

Common Recipes

Here are code snippets for common usecases of texttaglib

Open an ELAN file

>>> from texttaglib import elan
>>> eaf = elan.open_eaf('./data/test.eaf')
>>> eaf
<texttaglib.elan.ELANDoc object at 0x7f67790593d0>

Parse an existing text stream

>>> from texttaglib import elan
>>> with open('./data/test.eaf') as eaf_stream:
>>> ...  eaf = elan.parse_eaf_stream(eaf_stream)
>>> ...
>>> eaf
<texttaglib.elan.ELANDoc object at 0x7f6778f7a9d0>

Accessing tiers & annotations

for tier in eaf.tiers():
    print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
    for ann in tier.annotations:
        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")

Accessing nested tiers in ELAN

eaf = elan.open_eaf('./data/test_nested.eaf')
# accessing nested tiers
for tier in eaf.roots:
    print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
    for child_tier in tier.children:
        print(f"    | {child_tier.ID} | Participant: {child_tier.participant} | Type: {child_tier.type_ref}")
        for ann in child_tier.annotations:
            print(f"    |- {ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")

Converting ELAN files to CSV

texttaglib includes a command line tool to convert an EAF file into CSV.

python -m texttaglib eaf2csv my_transcript.eaf -o my_transcript.csv

texttaglib APIs

An overview of texttaglib modules.

ELAN supports

texttaglib supports reading and manipulating multi-tier transcriptions from ELAN directly.

TTL Interlinear Gloss Format

TTLIG is a human friendly interlinear gloss format that can be edited using any text editor.

TTL SQLite

TTL supports SQLite storage format to manage large scale corpuses.

Indices and tables