-
Notifications
You must be signed in to change notification settings - Fork 3
CSV
Comma or tab-separated values (CSV/TSV) is a straight-forward tabular format for annotations.
The data are organised into rows delimited by line breaks and fields separated by a specific character, such as comma (CSV) or tab (TSV).
For each annotated entity, there is a separate row with information about its position in the document (section type, character offset) and metadata (entity type, identifier etc.).
CSV is well-understood by spreadsheet applications like MS Excel, whereas TSV plays well with Unix command-line tools like cut
and sort
.
Besides the tsv
format for annotations only, there is a text_tsv
variant which includes the remaining text of the documents.
It features a row for every text token that is not part on an annotation, including positional information, but empty fields for the entity metadata.
In case of overlapping entities and sub-word annotations, (parts of) tokens will be repeated.
This is the main difference compared to the CoNLL format, where the linearity of the input sequence is preserved, at the cost of annotations potentially being simplified.
The fields are as follows:
doc_id section sent_id entity_id start end term
Additional fields can be added by specifying keys for the entity metadata through the fields
parameter.
For formatting the comma-/tab-separated values, bconv
relies on the standard-library csv
module.
The csv
and text_csv
formats use the default settings, ie. commas for field delimiting and double quotes for protecting commas inside field values.
The tsv
and text_tsv
formats set the following formatting parameters:
lineterminator="\n"
,
delimiter="\t"
,
quotechar=None
,
ie. tab delimiters and no protection/escaping mechanism for potential separator characters inside values.
Instead, tab and newline characters inside annotations are (irreversibly) replaced with space characters.
All formats accept additional formatting parameters as keyword arguments, which are passed directly to the csv.writer()
constructor.
354896,Title,1,1,0,9,Lidocaine
354896,Title,1,2,18,34,cardiac asystole
354896,Abstract,2,3,90,99,lidocaine
354896,Abstract,2,4,142,152,depression
354896,Abstract,3,5,331,347,bradyarrhythmias
354896,Abstract,3,6,409,418,lidocaine
354896 Title 1 1 0 9 Lidocaine
354896 Title 1 2 18 34 cardiac asystole
354896 Abstract 2 3 90 99 lidocaine
354896 Abstract 2 4 142 152 depression
354896 Abstract 3 5 331 347 bradyarrhythmias
354896 Abstract 3 6 409 418 lidocaine
354896 Title 1 1 0 9 Lidocaine
354896 Title 1 9 10 -
354896 Title 1 10 17 induced
354896 Title 1 2 18 34 cardiac asystole
354896 Title 1 34 35 .
354896 Abstract 2 36 47 Intravenous
354896 Abstract 2 48 62 administration
354896 Abstract 2 63 65 of
...
An RFC Memo exists for the general CSV format.
A very short definition of the general TSV format is given by the IANA.
The specific selection of fields used by bconv
does not follow any standard.
- Document structure: Due to the document ID in the first field, the CSV/TSV format can be used for both single documents and multi-doc collections. The fields "section" (section type) and "sent_id" (sentence counter) provide information on document-internal structuring.
- Metadata: Only the document IDs and the section type (if available) are represented in the format.
-
Entity annotations: By default, only positional information and an annotation ID are given.
Through the
fields
parameter, additional entity information can be written to the output. - Whitespace: If annotations span more than one word, the "term" field may contain whitespace. Since line breaks and tab characters are not allowed in the TSV format, they are replaced with spaces. In the CSV format, they are protected with quoting or escaping as appropriate.
- Offsets: Character offsets are included in every row.
- Discontinuous spans: Annotations with multiple spans are subject to entity flattening. By default, sub-spans are split into separate rows, but with a shared entity ID (fourth column).
fmt | csv |
---|---|
supports text | no |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
fields | Sequence[str] | () |
keys in Entity.metadata for the additional fields |
include_header | bool | False |
add a header line |
avoid_gaps | str | 'split' |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |
**fmtparams | Dict[str, str] | {} |
keyword args directly passed to csv.writer()
|
fmt | tsv |
---|---|
supports text | no |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
fields | Sequence[str] | () |
keys in Entity.metadata for the additional fields |
include_header | bool | False |
add a header line |
avoid_gaps | str | 'split' |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |
**fmtparams | Dict[str, str] | {} |
keyword args directly passed to csv.writer()
|
fmt | text_csv |
---|---|
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
fields | Sequence[str] | () |
keys in Entity.metadata for the additional fields |
include_header | bool | False |
add a header line |
avoid_gaps | str | 'split' |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |
**fmtparams | Dict[str, str] | {} |
keyword args directly passed to csv.writer()
|
fmt | text_tsv |
---|---|
supports text | yes |
supports annotations | yes |
stream type | text |
name | type | default | purpose |
---|---|---|---|
fields | Sequence[str] | () |
keys in Entity.metadata for the additional fields |
include_header | bool | False |
add a header line |
avoid_gaps | str | 'split' |
suppress discontinuous spans |
avoid_overlaps | str | None |
suppress annotation collisions |
**fmtparams | Dict[str, str] | {} |
keyword args directly passed to csv.writer()
|