-
Notifications
You must be signed in to change notification settings - Fork 3
TXT
Plain text is a limited, yet universal and robust format for storing textual content.
Many of the formats supported by bconv
are technically plain-text files (e.g. PubTator, BioC JSON, CoNLL), but use some mark-up to denote document structure, metadata, or annotations.
The txt
format, however, holds only the contents of a document in plain text, precluding the encoding of metadata and annotations, and supporting document structure only to a very limited extent.
The txt.json
format is a simple wrapper for the txt
format.
It allows representing multiple documents in a single file and supports a document ID.
Lidocaine-induced cardiac asystole.
Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man ...
[
{
"id": "354896",
"text": "Lidocaine-induced cardiac asystole.\n\nIntravenous administration of ..."
}
]
The Wikipedia articles on text files and plain text as a format provide information and further reading about many aspects of the format.
-
Document structure: Plain-text files are interpreted as a single document.
Blank lines are interpreted as section boundaries, unless the
single_section
option is set, in which case the entire text is read as a single section. With thesentence_split
option, line breaks are interpreted/inserted as sentence boundaries (in this case,bconv
attempts no further sentence splitting when loading). Multiple documents per file can only be represented in thetxt.json
format. -
Metadata: The filename (if available) is used as a fallback for inferring the document ID, if none was provided to the
load()
call. -
Whitespace: Line breaks may be indicative of document structure, depending on the options
single_section
andsentence_split
, as described above. When serialising text alongside stand-off annotations (eg.bionlp
), do not use thesentence_split
option, as it does not guarantee to preserve character offsets.
fmt | txt |
---|---|
native type | Document |
lazy loading | no |
supports text | yes |
supports annotations | no |
stream type | text |
name | type | default | purpose |
---|---|---|---|
single_section | bool | False |
Conflate all content into a single section |
sentence_split | bool | False |
Interpret line breaks as given sentence boundaries |
fmt | txt.json |
---|---|
native type | Collection |
lazy loading | no |
supports text | yes |
supports annotations | no |
stream type | text |
name | type | default | purpose |
---|---|---|---|
single_section | bool | False |
Conflate all content into a single section |
sentence_split | bool | False |
Interpret line breaks as given sentence boundaries |
fmt | txt |
---|---|
supports text | yes |
supports annotations | no |
stream type | text |
| sentence_split | bool | False
| Separate sentences with line breaks |
fmt | txt.json |
---|---|
supports text | yes |
supports annotations | no |
stream type | text |
| sentence_split | bool | False
| Separate sentences with line breaks |