Skip to content

Doc Model

Lenz Furrer edited this page May 24, 2021 · 6 revisions

Document Model

For converting documents and annotations from one format to another, bconv stores the information in a custom interchange representation. This representation is an attempt for a unified interface compatible with all formats. However, every format has different limitations and requirements, which cannot always easily be translated into a general representation.

In a simple A-to-B format conversion, users can ignore bconv's document representation. In some cases, however, accessing the Collection or Document objects may be useful, such as:

  • manipulating data during conversion, eg. to fix incompatibilities or remove unwanted elements;
  • using the loader only, eg. for reading annotated data into vectors for training a model;
  • using the formatter only, eg. to export automatically generated annotations.

Examples

Adjust entity annotations during reformatting

Load a collection in BioC JSON, prefix the concept identifiers with "MESH:", then store every document in Brat stand-off format.

import bconv

coll = bconv.load('path/to/collection.json', 'bioc_json')
for entity in coll.iter_entities():
    # Add a prefix to all IDs.
    entity.metadata['cui'] = 'MESH:{}'.format(entity.metadata['cui'])
for doc in coll:
    # Store annotations and text in separate files.
    bconv.dump(doc, '{}.ann'.format(doc.id), fmt='brat', cui='cui')
    bconv.dump(doc, '{}.txt'.format(doc.id), fmt='txt')

Load tokens and labels into vectors

Load a collection of annotated documents given in BioC JSON format and create two numpy arrays, a matrix of token indices and a vector of binary sentence-level labels.

import itertools as it
from collections import defaultdict
import numpy as np
import bconv

coll = bconv.load('path/to/training_data.json', 'bioc_json')
n_sents = sum(1 for _ in coll.units('sentence'))
sentences = np.zeros((n_sents, 100), dtype=int)  # matrix of token indices
labels = np.zeros(n_sents, dtype=int)            # vector of sentence labels
vocabulary = defaultdict(it.count().__next__)
for i, sentence in enumerate(coll.units('sentence')):
    # Note: you probably want to use the tokenizer and vocabulary associated
    # with your word embeddings instead.
    sentence.tokenize()
    tokens = [vocabulary[tok.text] for tok in sentence]
    sentences[i, :len(tokens)] = tokens
    labels[i] = any(sentence.iter_entities())  # True if there is any entity

Note: when vectorizing token-level annotations (eg. for training an NER system) it is probably easier to process the tabular data in CoNLL format rather than working with bconv's document objects.

Export automatic annotations to CoNLL

Construct Document objects to be passed to bconv.dump.

import re
from pathlib import Path
import bconv

sequence = re.compile(r'\b((?P<DNA>[GACT]{5,})|(?P<RNA>[GACU]{5,}))\b')
def nucleotide_sequences(text):
    """Regex-based tagger for literal DNA/RNA sequences."""
    for match in sequence.finditer(text):
        yield bconv.Entity(
            id=None,
            text=match.group(),
            spans=[match.span()],  # must be a list of start/end pairs
            type='DNA' if match.group('DNA') else 'RNA')

def make_document(path):
    with open(path, encoding='utf8') as f:
        title = f.readline()  # assume title is the first line
        body = f.read()
        doc = bconv.Document(path.stem)
        for text, type_ in ((title, 'title'), (body, 'body')):
            doc.add_section(
                type_, text, entities=list(nucleotide_sequences(text)))
    return doc

paths = Path('path/to/examples/').glob('*.txt')
coll = bconv.Collection.from_iterable(map(make_document, paths), id='example')
bconv.dump(coll, 'examples.conll')

API

Document Hierarchy

The document representation is implemented as a hierarchy of units corresponding to Python classes as follows:

Collection
Document
Section
Sentence
Token

Collections are always organized into documents, which are recursively divided into sections, sentences, and tokens.

Entity annotations are stored in Entity objects, which are always anchored at the sentence level (though they can be iterated over from any higher level). Relation objects can be anchored at document, section, or sentence level.

Public Methods and Attributes

Shared Interface

The following methods and attributes are shared by the Collection, Document, Section and Sentence units (with some exceptions as indicated).

TextUnit.__iter__() -> Iterator[SubUnit]
TextUnit.__len__() -> int
TextUnit.__getitem__(index: int) -> SubUnit

Immutable-sequence methods. Every unit is a sequence of subunits from the next-lower level, ie. a Collection is a sequence of Document objects, every Document is a sequence of Section objects etc. The sequences support iteration (for doc in collection), length check (len(collection)) and access by index (collection[1]).

TextUnit.units(level: str) -> Iterator[LevelType]

Iterate over units of the specified level, a case-insensitive string naming the desired unit type, eg. "sentence". The level can be the same or lower than that of this unit: collection.units("collection") yields just collection, whereas document.units("sentence") iterates over all sentences of all sections in document.

TextUnit.add_entities(entities: Iterable[Entity], offset: int = None)

Add entity annotations to this unit. If offset is None (the default), character offsets (spans) are recalculated relative to the beginning of the document. If the spans are already relative to the document origin, specify offset=0. Since entities are always anchored at the sentence level, the target sentence is identified based on the entity spans. Note: this method is not defined for the Collection unit.

TextUnit.iter_entities(
    split_discontinuous=False, avoid_gaps=None, avoid_overlaps=None
) -> Iterator[Entity]

Iterate over all Entity objects at this unit and all its subunits. Discontinuous and overlapping annotations may be flattened on the fly (non-permanently) with the avoid_gaps and avoid_overlaps parameters (see Entity-Flattening). The legacy flag split_discontinuous is kept for backwards compatibility; setting it to True is equivalent to specifying avoid_gaps="split". The entities are yielded in occurrence order; sorting is applied after flattening.

TextUnit.iter_relations() -> Iterator[Relation]

Iterate over all Relation objects at this unit and all its subunits. The iteration order is deterministic, but not connected to the relation members' position in the document.

TextUnit.text: str

Read-only attribute for the entire text of this unit in a single string.

TextUnit.metadata: Dict[str, str]

Read/write attribute for arbitrary metadata. Note that some keys are interpreted specially in some formats, eg. collection-level "source", "date", and "key" in BioC. In general, however, metadata are largely ignored in most formats.

TextUnit.relations: Iterable[Relation]

Read/write attribute for Relation objects anchored at this unit. Unlike iter_relations(), this does not touch relations from subunits. Note: this attribute is not defined for the Collection unit.

Collection

Collection(id: int|str, filename: str|Path = None, **metadata)

Constructor for an empty collection that may be populated later.

Collection.from_iterable(
    documents: Iterable[Document], id: int|str, filename: str|Path = None
) -> Collection

Classmethod for creating a collection from an iterable of Document objects. More documents may be appended later.

Collection.add_document(document: Document) -> Document

Append a Document object to the end of this collection.

Collection.get_document(id: int|str) -> Document

Retrieve a document by its ID. If IDs are non-unique, the last-added document will be returned. If there is no document with the given ID, a KeyError is raised.

Collection.id: int|str
Collection.filename: Optional[str|Path]

Read/write attributes corresponding to the constructor arguments.

Document

Document(id: int|str, filename: str|Path = None, **metadata)

Constructor for an empty document to be populated later.

Document.add_section(
    type: str,
    text: str|Iterable[str],
    offset: int = None,
    entities: Sequence[Entity] = (),
    entity_offset: int = None,
    **metadata) -> Section

Add a section to this document.
The section type is something like "Title", "Abstract", "Introduction" and is stored in the section's metadata.
The text can be given in a variety of types: If it is a single str, sentence splitting is performed by bconv (cf. the tokenization documentation). However, if the text has already been split into sentences, an iterable of str may be provided.
The start offset of the new section can be set through offset; if offset is None, it is set by bconv based on the length of the preceding sections.
A sequence of Entity objects can be provided as well, which will be added to the corresponding sentences.
The entity_offset argument works the same as offset in add_entities(): use it if the entity spans are not relative to the beginning of the added section text.
Arbitrary key-value pairs can be passed as metadata.

Document.id: int|str
Document.filename: Optional[str|Path]

Read/write attributes corresponding to the constructor arguments.

Section

The Section unit represents any division of a document, such as a paragraph or an article section. Do not directly instantiate Section objects from their constructor, but use Document.add_section() instead.

Section.start: int
Section.end: int

Read-only attributes for the text range (character offsets) relative to the document start.

Sentence

Sentences are automatically created from sections through sentence-boundary detection (cf. the tokenization documentation). To create Sentence units from a list of strings, pass it to Document.add_section() as the text parameter. Do not directly instantiate Sentence objects from their constructor.

Sentence.start: int
Sentence.end: int

Read-only attributes for the text range (character offsets) relative to the document start.

Token

Tokens are the smallest textual units (think: a word) created by bconv by splitting sentences at whitespace and before/after punctuation characters (cf. the tokenization documentation). Token units are minimal objects with a few attributes but none of the methods of the other units.

Token.text: str

Read-only attribute for the value of this token as a str.

Token.start: int
Token.end: int

Read-only attributes for the text range (character offsets) relative to the document start.

Entity

Entity annotations assign metadata (eg. a concept identifier or type) to a textual expression. The textual expression may contain gaps and be arbitrarily long, but it must not cross sentence boundaries.

Entity(
    id: Optional[int|str],
    text: str,
    spans: List[Tuple[int, int]],
    meta: Optional[Dict[str, str]],
    **metadata)

Constructor for an entity annotation.
The id is an identifier for each particular annotation instance within the enclosing document (i.e. it is not a concept identifier/reference to a controlled vocabulary). Its value is used by some output formats (eg. BioC), but ignored by others (eg. Brat, which requires consecutive IDs prefixed with "T"). If relations are present, the entity ID should be unique throughout the document.
The text value must exactly match the annotated span in the document (for multi-span entities, gap symbols like "..." are permitted).
The spans must be a sequence of start–end pairs, even for single-span entities.
Any additional information, such as concept identifier or entity type, can be passed as a mapping of key–value pairs to the meta parameter or directly as keyword arguments. bconv is agnostic wrt. the extent and spelling of metadata fields; however, many output formats may require some configuration to get the desired result (eg. the meta parameter for PubTator).

Entity.id: Optional[int|str]
Entity.text: str
Entity.spans: List[Tuple[int, int]]
Entity.metadata: Dict[str, str]

Read/write attributes corresponding to the constructor arguments. Note: once added to a document, do not alter an entity's spans value anymore.

Entity.start: int
Entity.end: int

Read-only attributes for the outer boundaries of an entity, ie. entity.spans[0][0] and entity.spans[-1][1].

Relation

Relations describe a connection between a number of members, which are either entities or other relations. The number of members is not restricted by bconv; like in Bioc, relations can have more than two members or even be unary or member-less. However, some formats are more restrictive; eg. PubAnnotation JSON only allows binary relations.

Relation(id: Optional[int|str], members: Iterable[Tuple[int|str, str]], **metadata)

Constructor for a relation. As for entities, the ID is optional in general, but needs to be defined and unique if this relation is referenced in another relation. The relation members are expected as an iterable of <RefID, Role> pairs.

Relation.add_member(refid: int|str, role: str) -> RelationMember

Add a member to this relation. The reference ID refid must refer to an existing entity or another relation. The role is a free-form string describing the function of this member within the relation (eg. "cause").

Relation.__iter__() -> Iterator[RelationMember]
Relation.__len__() -> int
Relation.__getitem__(index: int) -> RelationMember

Immutable-sequence methods: a relation is a sequence of RelationMember objects.

Relation.id: Optional[int|str]

Read/write attribute for the relation ID.

Relation.metadata: Dict[str, str]

Read/write attribute for arbitrary key–value pairs. Like for the text units, metadata are ignored by most formats.

RelationMember

RelationMember objects have two attributes, refid and role.

RelationMember.refid: int|str

Read-only attribute referencing an entity or another relation by ID.

RelationMember.role: str

Read-only attribute describing this member's role in the relation.