tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
tei2korapxml
is a script to convert TEI P5 and I5 based documents to the KorAP-XML format.
This program is usually called from inside another script.
TEI P5 formatted input with certain restrictions:
mandatory: text-header with integrated textsigle (or convertable identifier), text-body
optional: corp-header with integrated corpsigle, doc-header with integrated docsigle
All tokens inside the primary text may not be newline seperated, because newlines are removed (see KorAP::XML::TEI::Data) and a conversion of newlines into blanks between 2 tokens could lead to additional blanks, where there should be none (e.g.: punctuation characters like
,
or.
should not be seperated from their predecessor token). (see also code section~ whitespace handling ~
inscript/tei2korapxml
).Header types, like
<idsHeader [...] type="document" [...] >
need to be defined in the same line as the header tag.
zip file output (default on
stdout
) with utf8 encoded entries (which together form the KorAP-XML format)
tei2korapxml
requires libxml2-dev
bindings and File::ShareDir::Install to be installed. When these requirements are met, the preferred way to install the script is to use cpanm.
$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
In case everything went well, the tei2korapxml
tool will be available on your command line immediately.
Minimum requirement for KorAP::XML::TEI is Perl 5.16.
- --input|-i
-
The input file to process. If no specific input is defined and a single dash
-
is passed as an argument, data is read fromSTDIN
.Instead of using
-i
input files can also be defined as trailing arguments to the command:tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
- --output|-o
-
The output zip file to be created. If no specific output is defined, data is written to
STDOUT
. - --root|-r
-
The root directory for output. Defaults to
.
. - --help|-h
-
Print help information.
- --version|-v
-
Print version information.
- --tokenizer-korap|-tk
-
Use the standard KorAP/DeReKo tokenizer.
- --tokenizer-internal|-ti
-
Tokenize the data using two embedded tokenizers, that will take an aggressive and a conservative approach.
- --tokenizer-call|-tc
-
Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.
Texts are separated using
\x04\n
. The external process should add a new line per text.If the "--use-tokenizer-sentence-splits" option is activated, sentences are marked by offset as well in new lines.
To use Datok including sentence splitting, call
tei2korap
as follows:$ cat corpus.i5.xml | tei2korapxml -s \ $ -tc 'datok tokenize \ $ -t ./tokenizer.matok \ $ -p --newline-after-eot --no-sentences \ $ --no-tokens --sentence-positions -' - \ $ > corpus.korapxml.zip
- --no-tokenizer
-
Boolean flag indicating that no tokenizer should be used. This is meant to ensure that by default a final token layer always exists. If a separate tokenizer is chosen, this flag is ignored.
- --skip-inline-tokens
-
Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).
- --skip-inline-token-annotations
-
Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed). Can be negated with
--no-skip-inline-token-annotations
. -
Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.
- --auto-textsigle <textsigle>
-
Expects a text sigle thats serves as fallback if no text sigles are given in the input data. The auto text sigle will be incremented for each text processed.
Example:
tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \ < data.i5.xml > korapxml.zip
- --xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>
-
Expects a regular replacement expression (separated by @ between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by /).
Example:
tei2korapxml \ --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \ -tk - < t/data/icc_german_sample.p5.xml
Converts text id
ICC.German.DeReKo.WPD17.G11.00238
to sigleICCGER/DeReKo.WPD17/G11.00238
. - --inline-tokens <foundry>#[<file>]
-
Define the foundry and file (without extension) to store inline token information in. Unless
--skip-inline-token-annotations
is set, this will contain annotations as well. Defaults totokens
andmorpho
.The inline token data will also be stored in the inline structures file (see --inline-structures), unless the inline token foundry is prepended by an ! exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.
Example:
tei2korapxml --no-tokenizer --inline-tokens \ '!gingko#morpho' < data.i5.xml > korapxml.zip
- --inline-dependencies <foundry>#[<file>]
-
Define the foundry and file (without extension) to store inline dependency information in. Defaults to the layer of
dependency
and will be ignored if not set (which means, dependency attributes will be stored in the inline tokens file, if not skipped).The dependency data will also be stored in the inline token file (see --inline-tokens), unless the inline dependencies foundry is prepended by an ! exclamation mark, indicating that inline dependency data is stored exclusively in the inline dependencies file.
Example:
tei2korapxml --no-tokenizer --inline-dependencies \ 'gingko#dependency' < data.i5.xml > korapxml.zip
- --inline-structures <foundry>#[<file>]
-
Define the foundry and file (without extension) to store inline structure information in. Defaults to
struct
andstructures
. - --base-foundry <foundry>
-
Define the base foundry to store newly generated token information in. Defaults to
base
. - --data-file <file>
-
Define the file (without extension) to store primary data information in. Defaults to
data
. - --header-file <file>
-
Define the file name (without extension) to store header information on the corpus, document, and text level in. Defaults to
header
. - --use-tokenizer-sentence-splits|-s
-
Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.
- --tokens-file <file>
-
Define the file (without extension) to store generated token information in (either from the KorAP tokenizer or an externally called tokenizer). Defaults to
tokens
. - --log|-l
-
Loglevel for Log::Any. Defaults to
notice
.
- KORAPXMLTEI_DEBUG
-
Activate minimal debugging. Defaults to
false
. - KORAPXMLTEI_TOKENIZER_HEAP_SIZE
-
Set the heap size for the tokenizer process. Defaults to
512m
.
Copyright (C) 2021-2025, IDS Mannheim
Author: Peter Harders
Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
KorAP::XML::TEI is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.
This program is free software published under the BSD-2 License.