Skip to content

KorAP/KorAP-XML-TEI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME

tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

SYNOPSIS

cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip

DESCRIPTION

tei2korapxml is a script to convert TEI P5 and I5 based documents to the KorAP-XML format.

This program is usually called from inside another script.

FORMATS

Input restrictions

  • TEI P5 formatted input with certain restrictions:

    • mandatory: text-header with integrated textsigle (or convertable identifier), text-body

    • optional: corp-header with integrated corpsigle, doc-header with integrated docsigle

  • All tokens inside the primary text may not be newline seperated, because newlines are removed (see KorAP::XML::TEI::Data) and a conversion of newlines into blanks between 2 tokens could lead to additional blanks, where there should be none (e.g.: punctuation characters like , or . should not be seperated from their predecessor token). (see also code section ~ whitespace handling ~ in script/tei2korapxml).

  • Header types, like <idsHeader [...] type="document" [...] > need to be defined in the same line as the header tag.

Notes on the output

  • zip file output (default on stdout) with utf8 encoded entries (which together form the KorAP-XML format)

INSTALLATION

tei2korapxml requires libxml2-dev bindings and File::ShareDir::Install to be installed. When these requirements are met, the preferred way to install the script is to use cpanm.

$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

In case everything went well, the tei2korapxml tool will be available on your command line immediately.

Minimum requirement for KorAP::XML::TEI is Perl 5.16.

OPTIONS

--input|-i

The input file to process. If no specific input is defined and a single dash - is passed as an argument, data is read from STDIN.

Instead of using -i input files can also be defined as trailing arguments to the command:

tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
--output|-o

The output zip file to be created. If no specific output is defined, data is written to STDOUT.

--root|-r

The root directory for output. Defaults to ..

--help|-h

Print help information.

--version|-v

Print version information.

--tokenizer-korap|-tk

Use the standard KorAP/DeReKo tokenizer.

--tokenizer-internal|-ti

Tokenize the data using two embedded tokenizers, that will take an aggressive and a conservative approach.

--tokenizer-call|-tc

Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.

Texts are separated using \x04\n. The external process should add a new line per text.

If the "--use-tokenizer-sentence-splits" option is activated, sentences are marked by offset as well in new lines.

To use Datok including sentence splitting, call tei2korap as follows:

$ cat corpus.i5.xml | tei2korapxml -s \
$   -tc 'datok tokenize \
$        -t ./tokenizer.matok \
$        -p --newline-after-eot --no-sentences \
$        --no-tokens --sentence-positions -' - \
$        > corpus.korapxml.zip
--no-tokenizer

Boolean flag indicating that no tokenizer should be used. This is meant to ensure that by default a final token layer always exists. If a separate tokenizer is chosen, this flag is ignored.

--skip-inline-tokens

Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).

--skip-inline-token-annotations

Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed). Can be negated with --no-skip-inline-token-annotations.

--skip-inline-tags <tags>

Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.

--auto-textsigle <textsigle>

Expects a text sigle thats serves as fallback if no text sigles are given in the input data. The auto text sigle will be incremented for each text processed.

Example:

tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
< data.i5.xml > korapxml.zip
--xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>

Expects a regular replacement expression (separated by @ between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by /).

Example:

tei2korapxml  \
  --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
  -tk - < t/data/icc_german_sample.p5.xml

Converts text id ICC.German.DeReKo.WPD17.G11.00238 to sigle ICCGER/DeReKo.WPD17/G11.00238.

--inline-tokens <foundry>#[<file>]

Define the foundry and file (without extension) to store inline token information in. Unless --skip-inline-token-annotations is set, this will contain annotations as well. Defaults to tokens and morpho.

The inline token data will also be stored in the inline structures file (see --inline-structures), unless the inline token foundry is prepended by an ! exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.

Example:

tei2korapxml --no-tokenizer --inline-tokens \
  '!gingko#morpho' < data.i5.xml > korapxml.zip
--inline-dependencies <foundry>#[<file>]

Define the foundry and file (without extension) to store inline dependency information in. Defaults to the layer of dependency and will be ignored if not set (which means, dependency attributes will be stored in the inline tokens file, if not skipped).

The dependency data will also be stored in the inline token file (see --inline-tokens), unless the inline dependencies foundry is prepended by an ! exclamation mark, indicating that inline dependency data is stored exclusively in the inline dependencies file.

Example:

tei2korapxml --no-tokenizer --inline-dependencies \
  'gingko#dependency' < data.i5.xml > korapxml.zip
--inline-structures <foundry>#[<file>]

Define the foundry and file (without extension) to store inline structure information in. Defaults to struct and structures.

--base-foundry <foundry>

Define the base foundry to store newly generated token information in. Defaults to base.

--data-file <file>

Define the file (without extension) to store primary data information in. Defaults to data.

--header-file <file>

Define the file name (without extension) to store header information on the corpus, document, and text level in. Defaults to header.

--use-tokenizer-sentence-splits|-s

Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.

--tokens-file <file>

Define the file (without extension) to store generated token information in (either from the KorAP tokenizer or an externally called tokenizer). Defaults to tokens.

--log|-l

Loglevel for Log::Any. Defaults to notice.

ENVIRONMENT VARIABLES

KORAPXMLTEI_DEBUG

Activate minimal debugging. Defaults to false.

KORAPXMLTEI_TOKENIZER_HEAP_SIZE

Set the heap size for the tokenizer process. Defaults to 512m.

COPYRIGHT AND LICENSE

Copyright (C) 2021-2025, IDS Mannheim

Author: Peter Harders

Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

KorAP::XML::TEI is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.

This program is free software published under the BSD-2 License.