-
Notifications
You must be signed in to change notification settings - Fork 3
Doc Model
For converting documents and annotations from one format to another, bconv
stores the information in a custom interchange representation.
This representation is an attempt for a unified interface compatible with all formats.
However, every format has different limitations and requirements, which cannot always easily be translated into a general representation.
In a simple A-to-B format conversion, users can ignore bconv
's document representation.
In some cases, however, accessing the Collection
or Document
objects may be useful, such as:
- manipulating data during conversion, eg. to fix incompatibilities or remove unwanted elements;
- using the loader only, eg. for reading annotated data into vectors for training a model;
- using the formatter only, eg. to export automatically generated annotations.
Load a collection in BioC JSON, prefix the concept identifiers with "MESH:"
, then store every document in Brat stand-off format.
import bconv
coll = bconv.load('path/to/collection.json', 'bioc_json')
for entity in coll.iter_entities():
# Add a prefix to all IDs.
entity.metadata['cui'] = 'MESH:{}'.format(entity.metadata['cui'])
for doc in coll:
# Store annotations and text in separate files.
bconv.dump(doc, '{}.ann'.format(doc.id), fmt='brat', cui='cui')
bconv.dump(doc, '{}.txt'.format(doc.id), fmt='txt')
Load a collection of annotated documents given in BioC JSON format and create two numpy
arrays, a matrix of token indices and a vector of binary sentence-level labels.
import itertools as it
from collections import defaultdict
import numpy as np
import bconv
coll = bconv.load('path/to/training_data.json', 'bioc_json')
n_sents = sum(1 for _ in coll.units('sentence'))
sentences = np.zeros((n_sents, 100), dtype=int) # matrix of token indices
labels = np.zeros(n_sents, dtype=int) # vector of sentence labels
vocabulary = defaultdict(it.count().__next__)
for i, sentence in enumerate(coll.units('sentence')):
# Note: you probably want to use the tokenizer and vocabulary associated
# with your word embeddings instead.
sentence.tokenize()
tokens = [vocabulary[tok.text] for tok in sentence]
sentences[i, :len(tokens)] = tokens
labels[i] = any(sentence.iter_entities()) # True if there is any entity
Note: when vectorizing token-level annotations (eg. for training an NER system) it is probably easier to process the tabular data in CoNLL format rather than working with bconv
's document objects.
Construct Document
objects to be passed to bconv.dump
.
import re
from pathlib import Path
import bconv
sequence = re.compile(r'\b((?P<DNA>[GACT]{5,})|(?P<RNA>[GACU]{5,}))\b')
def nucleotide_sequences(text):
"""Regex-based tagger for literal DNA/RNA sequences."""
for match in sequence.finditer(text):
yield bconv.Entity(
id=None,
text=match.group(),
spans=[match.span()], # must be a list of start/end pairs
type='DNA' if match.group('DNA') else 'RNA')
def make_document(path):
with open(path, encoding='utf8') as f:
title = f.readline() # assume title is the first line
body = f.read()
doc = bconv.Document(path.stem)
for text, type_ in ((title, 'title'), (body, 'body')):
doc.add_section(
type_, text, entities=list(nucleotide_sequences(text)))
return doc
paths = Path('path/to/examples/').glob('*.txt')
coll = bconv.Collection.from_iterable(map(make_document, paths), id='example')
bconv.dump(coll, 'examples.conll')
The document representation is implemented as a hierarchy of units corresponding to Python classes as follows:
Collection
Document
Section
Sentence
Token
Collections are always organized into documents, which are recursively divided into sections, sentences, and tokens.
Entity annotations are stored in Entity
objects, which are always anchored at the sentence level (though they can be iterated over from any higher level).
Relation
objects can be anchored at document, section, or sentence level.
The following methods and attributes are shared by the Collection
, Document
, Section
and Sentence
units (with some exceptions as indicated).
TextUnit.__iter__() -> Iterator[SubUnit]
TextUnit.__len__() -> int
TextUnit.__getitem__(index: int) -> SubUnit
Immutable-sequence methods. Every unit is a sequence of subunits from the next-lower level, ie. a
Collection
is a sequence ofDocument
objects, everyDocument
is a sequence ofSection
objects etc. The sequences support iteration (for doc in collection
), length check (len(collection)
) and access by index (collection[1]
).
TextUnit.units(level: str) -> Iterator[LevelType]
Iterate over units of the specified
level
, a case-insensitive string naming the desired unit type, eg."sentence"
. Thelevel
can be the same or lower than that of this unit:collection.units("collection")
yields justcollection
, whereasdocument.units("sentence")
iterates over all sentences of all sections indocument
.
TextUnit.add_entities(entities: Iterable[Entity], offset: int = None)
Add entity annotations to this unit. If
offset
isNone
(the default), character offsets (spans) are recalculated relative to the beginning of the document. If the spans are already relative to the document origin, specifyoffset=0
. Since entities are always anchored at the sentence level, the target sentence is identified based on the entity spans. Note: this method is not defined for theCollection
unit.
TextUnit.iter_entities(
split_discontinuous=False, avoid_gaps=None, avoid_overlaps=None
) -> Iterator[Entity]
Iterate over all
Entity
objects at this unit and all its subunits. Discontinuous and overlapping annotations may be flattened on the fly (non-permanently) with theavoid_gaps
andavoid_overlaps
parameters (see Entity-Flattening). The legacy flagsplit_discontinuous
is kept for backwards compatibility; setting it toTrue
is equivalent to specifyingavoid_gaps="split"
. The entities are yielded in occurrence order; sorting is applied after flattening.
TextUnit.iter_relations() -> Iterator[Relation]
Iterate over all
Relation
objects at this unit and all its subunits. The iteration order is deterministic, but not connected to the relation members' position in the document.
TextUnit.text: str
Read-only attribute for the entire text of this unit in a single string.
TextUnit.metadata: Dict[str, str]
Read/write attribute for arbitrary metadata. Note that some keys are interpreted specially in some formats, eg. collection-level "source", "date", and "key" in BioC. In general, however, metadata are largely ignored in most formats.
TextUnit.relations: Iterable[Relation]
Read/write attribute for
Relation
objects anchored at this unit. Unlikeiter_relations()
, this does not touch relations from subunits. Note: this attribute is not defined for theCollection
unit.
Collection(id: int|str, filename: str|Path = None, **metadata)
Constructor for an empty collection that may be populated later.
Collection.from_iterable(
documents: Iterable[Document], id: int|str, filename: str|Path = None
) -> Collection
Classmethod for creating a collection from an iterable of
Document
objects. More documents may be appended later.
Collection.add_document(document: Document) -> Document
Append a
Document
object to the end of this collection.
Collection.get_document(id: int|str) -> Document
Retrieve a document by its ID. If IDs are non-unique, the last-added document will be returned. If there is no document with the given ID, a
KeyError
is raised.
Collection.id: int|str
Collection.filename: Optional[str|Path]
Read/write attributes corresponding to the constructor arguments.
Document(id: int|str, filename: str|Path = None, **metadata)
Constructor for an empty document to be populated later.
Document.add_section(
type: str,
text: str|Iterable[str],
offset: int = None,
entities: Sequence[Entity] = (),
entity_offset: int = None,
**metadata) -> Section
Add a section to this document.
The sectiontype
is something like "Title", "Abstract", "Introduction" and is stored in the section'smetadata
.
Thetext
can be given in a variety of types: If it is a singlestr
, sentence splitting is performed bybconv
(cf. the tokenization documentation). However, if the text has already been split into sentences, an iterable ofstr
may be provided.
Thestart
offset of the new section can be set throughoffset
; ifoffset
isNone
, it is set bybconv
based on the length of the preceding sections.
A sequence ofEntity
objects can be provided as well, which will be added to the corresponding sentences.
Theentity_offset
argument works the same asoffset
inadd_entities()
: use it if the entity spans are not relative to the beginning of the added section text.
Arbitrary key-value pairs can be passed asmetadata
.
Document.id: int|str
Document.filename: Optional[str|Path]
Read/write attributes corresponding to the constructor arguments.
The Section
unit represents any division of a document, such as a paragraph or an article section.
Do not directly instantiate Section
objects from their constructor, but use Document.add_section()
instead.
Section.start: int
Section.end: int
Read-only attributes for the text range (character offsets) relative to the document start.
Sentences are automatically created from sections through sentence-boundary detection (cf. the tokenization documentation).
To create Sentence
units from a list of strings, pass it to Document.add_section()
as the text
parameter.
Do not directly instantiate Sentence
objects from their constructor.
Sentence.start: int
Sentence.end: int
Read-only attributes for the text range (character offsets) relative to the document start.
Tokens are the smallest textual units (think: a word) created by bconv
by splitting sentences at whitespace and before/after punctuation characters (cf. the tokenization documentation).
Token units are minimal objects with a few attributes but none of the methods of the other units.
Token.text: str
Read-only attribute for the value of this token as a
str
.
Token.start: int
Token.end: int
Read-only attributes for the text range (character offsets) relative to the document start.
Entity annotations assign metadata (eg. a concept identifier or type) to a textual expression. The textual expression may contain gaps and be arbitrarily long, but it must not cross sentence boundaries.
Entity(
id: Optional[int|str],
text: str,
spans: List[Tuple[int, int]],
meta: Optional[Dict[str, str]],
**metadata)
Constructor for an entity annotation.
Theid
is an identifier for each particular annotation instance within the enclosing document (i.e. it is not a concept identifier/reference to a controlled vocabulary). Its value is used by some output formats (eg. BioC), but ignored by others (eg. Brat, which requires consecutive IDs prefixed with "T"). If relations are present, the entity ID should be unique throughout the document.
Thetext
value must exactly match the annotated span in the document (for multi-span entities, gap symbols like "..." are permitted).
Thespans
must be a sequence of start–end pairs, even for single-span entities.
Any additional information, such as concept identifier or entity type, can be passed as a mapping of key–value pairs to themeta
parameter or directly as keyword arguments.bconv
is agnostic wrt. the extent and spelling of metadata fields; however, many output formats may require some configuration to get the desired result (eg. themeta
parameter for PubTator).
Entity.id: Optional[int|str]
Entity.text: str
Entity.spans: List[Tuple[int, int]]
Entity.metadata: Dict[str, str]
Read/write attributes corresponding to the constructor arguments. Note: once added to a document, do not alter an entity's
spans
value anymore.
Entity.start: int
Entity.end: int
Read-only attributes for the outer boundaries of an entity, ie.
entity.spans[0][0]
andentity.spans[-1][1]
.
Relations describe a connection between a number of members, which are either entities or other relations.
The number of members is not restricted by bconv
; like in Bioc, relations can have more than two members or even be unary or member-less.
However, some formats are more restrictive; eg. PubAnnotation JSON only allows binary relations.
Relation(id: Optional[int|str], members: Iterable[Tuple[int|str, str]], **metadata)
Constructor for a relation. As for entities, the ID is optional in general, but needs to be defined and unique if this relation is referenced in another relation. The relation members are expected as an iterable of <RefID, Role> pairs.
Relation.add_member(refid: int|str, role: str) -> RelationMember
Add a member to this relation. The reference ID
refid
must refer to an existing entity or another relation. Therole
is a free-form string describing the function of this member within the relation (eg. "cause").
Relation.__iter__() -> Iterator[RelationMember]
Relation.__len__() -> int
Relation.__getitem__(index: int) -> RelationMember
Immutable-sequence methods: a relation is a sequence of
RelationMember
objects.
Relation.id: Optional[int|str]
Read/write attribute for the relation ID.
Relation.metadata: Dict[str, str]
Read/write attribute for arbitrary key–value pairs. Like for the text units, metadata are ignored by most formats.
RelationMember
objects have two attributes, refid
and role
.
RelationMember.refid: int|str
Read-only attribute referencing an entity or another relation by ID.
RelationMember.role: str
Read-only attribute describing this member's role in the relation.