From text processor odt file, extract all possible information in semantic XML (TEI).
Doc (in French): http://resultats.hypotheses.org/267
Demo: https://obvil.huma-num.fr/odette/
Maybe used with command line
~/myrepos $ sudo apt install git php php-cli php-xml
~/myrepos $ git clone https://github.com/oeuvres/odette.git
~/myrepos $ cd odette
~/myrepos/odette $ php odette.php
php odette.php (options)? "teidir/*.xml"
Export odt files with styles as XML (ex: TEI)
Parameters:
globs : 1-n files or globs
Options:
-h, --help : show this help message
-f, --force : force deletion of destination file
-d destdir : destination directory for generated files
-t template : a specific template for export among:
delacroix, desc_chine, dramabib, galien, hauy, hurlus, merveilles17, rougemont
--tei : default, export odt as XML/TEI
--html : export odt as html
--odtx : export native odt xml (for debug)
Odette transpose some text processor direct formatting at paragraph level (left, right, center) and character level (italic, small caps…), but most of information is transmitted by user styles.
Text processor styles may be paragraph level (¶) or character level (@). Yous must ensure the level of your styles in your text processor if you want that Odette works well. Microsoft.Office may create linked styles, for example one style name for Quote, allowed for a full paragraph or for quotes of some words inline. This may confused an automat. It is good idea to conceive your template of styles in LibreOffice, you can record your template in docx format and edit texts with MS.Word (but you need to record files in odt at the end to transform it with Odette).
Example of Odette work, if you use the paragraph style <ab>, the para will be transformes in the xml
<ab type="ornament">My para</ab>
Below a list of normalized style name known, and their xml/tei transposition. Unknown styles are kept in a @rend attribute. Styles are here shown normalized as ascii lower case letter, but real life styles may contain capitals, accents, spaces, or punctuation. For example, quotesalute could appears as <Quote, Salute> for the user in its word processor (a style for a letter in a citation).
ab
<ab type="ornament">content ¶</ab>
address
<address>
<addrLine>content ¶</addrLine>
</address>
argument
<argument>
<p>content ¶</p>
</argument>
bibl
<bibl>content ¶</bibl>
byline
<byline>content ¶</byline>
camera
<camera>content ¶</camera>
caption
<caption>content ¶</caption>
castitem
<castList>
<castItem>content ¶</castItem>
</castList>
castlist
<castList>content ¶</castList>
closer
<closer>content ¶</closer>
dateline
<dateline>content ¶</dateline>
def
<entryFree>
<def>content ¶</def>
</entryFree>
desc
<desc>content ¶</desc>
docauthor
<docAuthor>content</docAuthor>
docimprint
<docImprint>content ¶</docImprint>
docdate
<docDate>content ¶</docDate>
eg
<eg>content ¶</eg>
epigraph
<epigraph>
<p rend="right italic…">content ¶</p>
</epigraph>
epigraphl
<epigraph>
<l>content ¶</l>
</epigraph>
entry
<entry>content ¶</entry>
fw
<fw>content ¶</fw>
index
<index>
<item>content ¶</item>
</index>
l
<l rend="center italic…">content ¶</l>
label
<label>content ¶</label>
labeldateline
<label type="dateline">content ¶</label>
labelhead
<label type="head">content ¶</label>
labelsalute
<label type="salute">content ¶</label>
labelspeaker
<label type="speaker">content ¶</label>
lg
<lg>
<l>content ¶</l>
</lg>
opener
<opener>content ¶</opener>
p
<p rend="right italic…">content ¶</p>
pb
<pb n="…"/>
postscript
<postscript>
<p>content ¶</p>
</postscript>
q
<q>content ¶</q>
quote
<quote>
<p rend="right, italic…">content ¶</p>
</quote>
quotedateline
<quote>
<dateline>content ¶</dateline>
</quote>
quotel
<quote>
<l>content ¶</l>
</quote>
quotesalute
<quote>
<salute>content ¶</salute>
</quote>
quotesigned
<quote>
<signed>content ¶</signed>
</quote>
role
<castItem>
<role>content ¶</role>
</castItem>
roledesc
<castItem>
<roleDesc>content ¶</roleDesc>
</castItem>
said
<said>content ¶</said>
salute
<salute>content ¶</salute>
salutation
<salute>content ¶</salute>
set
<set>
<p>content ¶</p>
</set>
signed
<signed>content ¶</signed>
speaker
<speaker>content ¶</speaker>
stage
<stage>content ¶</stage>
term
<index>
<term>content ¶</term>
</index>
trailer
<trailer>content ¶</trailer>
abbr
blah… <abbr>@ level</abbr> …blah
add
blah… <add>@ level</add> …blah
actor
blah… <actor>@ level</actor> …blah
author
blah… <author>@ level</author> …blah
affiliation
blah… <affiliation>@ level</affiliation> …blah
age
blah… <age>@ level</age> …blah
bibl
blah… <bibl>@ level</bibl> …blah
c
blah… <c>@ level</c> …blah
code
blah… <code>@ level</code> …blah
corr
blah… <corr>@ level</corr> …blah
date
blah… <date>@ level</date> …blah
del
blah… <del>@ level</del> …blah
distinct
blah… <distinct>@ level</distinct> …blah
blah… <email>@ level</email> …blah
emph
blah… <emph>@ level</emph> …blah
geogname
blah… <geogName>@ level</geogName> …blah
gloss
blah… <gloss>@ level</gloss> …blah
name
blah… <name>@ level</name> …blah
num
blah… <num>@ level</num> …blah
pb
blah… <pb>@ level</pb> …blah
persname
blah… <persName>@ level</persName> …blah
placename
blah… <placeName>@ level</placeName> …blah
stage
blah… <stage>@ level</stage> …blah
title
blah… <title>@ level</title> …blah