Skip to content

LogonProcessing_BatchTranslation

StephanOepen edited this page Nov 16, 2005 · 21 revisions

Overview

As is documented in the LOGON EAMT 2005 article (Oepen, et al., 2005), batch fan-out is a central facility in development, regression testing, and end-to-end evaluation. Internally, batch fan-out mode deploys [incr tsdb()] facilities in a layered fashion, i.e. the top layer is comprised of one or more [incr tsdb()] clients that perform translation, and each client internally uses [incr tsdb()] to decompose itself into the trinity of parsing, transfer, and generation clients. [incr tsdb()] failure detection and roll-over apply at both layers, i.e. as a component terminates unexpectedly, it will be automatically restarted. Depending on the value of the [incr tsdb()] parameter *process-client-retries* (default: 0), the item that caused failure may or may not be re-scheduled for processing, until the maximum number of retries is exhausted. Likewise, all standard [incr tsdb()] options for distributed and parallel processing apply to bach fan-out.

In the standard LOGON set-up, the [incr tsdb()] cpu definition that instantiates the top-level translation client is termed :logon. Thus, the command

  (tsdb :cpu :logon)

will create the four interconnected processes needed for batch translation, i.e. one parsing, transfer, and generation client each, plus the top-level controller. Once loaded, the client will register itself; e.g.

  wait-for-clients(): `ld.uio.no' registered as tid <40044> [1:40].

In order to run batch fan-out from the [incr tsdb()] podium, toggle Process | Switches to Translation and then execute one of the processing commands, e.g. Process | All Items. Fan-out depth, in this mode, is controlled by the value of Process | Variables | Analyses Limit, where a value of, say, 20 would restrict fan-out at each level to at most twenty hypotheses that are pursued in downstream processing (see the --limit command line option to the batch script).

A streamlined way of running [incr tsdb()] batch fan-out is by means of the LOGON batch script. The script resides in the top-level LOGON directory $LOGONROOT and is invoked from a command shell, e.g.

  $LOGONROOT/batch vei

The fan-out batch script requires a functional LOGON installation (see separate instructions; currently on the LOGON workspace, September 2005) and will first load up the [incr tsdb()] environment and then configure one or more translation clients. As a result of running the batch script, a new [incr tsdb()] profile will be stored in the [incr tsdb()] profile repository, and a fan-out log file will be generated in the user home directory. For the example command above, the profile will be called logon/vei/05-09-26 (assuming the current date was 26-sep-05), with its corresponding log file logon.vei.05-09-26.fan.

As of LOGON 0.5 (knut), the default behaviour of the batch script has changed slightly: as batch is now using [incr tsdb()] facilities internally, it assumes that by default the intention is to process an [incr tsdb()] skeleton. Thus, the command in the above example will operate on the vei skeleton (for which there is no ASCII fan-out input file, actually). To obtain the original functionality of the batch script, i.e. process an input file in ASCII fan-out input format (see below), do the following:

  $LOGONROOT/batch --ascii $LOGONROOT/ntnu/data/mrs.txt

There will be more to say on the new batch script, but a few (new) options may be relevant immediately: --count n will parallelize processing and start-up n full instantiations of the LOGON pipeline; --limit n will prune the fan-out space to at most n alternatives that, at each stage (i.e post-analysis and -transfer) get passed on downstream; and --suffix string will append suffix to the name for the newly created profile, e.g. when more than one run per day needs to be recorded. There actually is a default value for the --limit option of (currently) 25. Thus, in order to get full fan-out (if you think you have the cpu cycles :-), use --limit 0.

Note that option processing to the batch script is not very robust; hence, please make sure you get the option syntax exactly right.

Fan-Out ASCII Input

When using the --ascii option to the LOGON batch script or the [incr tsdb()] File | Import | Bi-Text Items command, processing will first construct the target profile from an ASCII fan-out input file, essentially a sentence-aligned bi-text.

Jointly with Torbjørn, we confirmed (August 2004) that the fan-out batch script allows running new data sets through the end-to-end system. We constructured the following input file:

  Vi skal møte Ask på mandag.
  Vi shall meet Abrams on Monday.
  We should meet Abrams on Monday.

  Ta båt til Ortnevik.
  Take the boat to Ortnevik.
  Take a boat to Ortnevik.

  Tar du båten til Ortnevik, kan du gå stien samme dagen.
  If you take the boat to Ortnevik, you can walk the path the same day.
  If you take the boat to Ortnevik, you can walk the path on the same day.

where the format is a sequence of blocks; each block has the Norwegian input on the first line, followed by zero or more lines with reference translations. Blocks are separated from each other by two consecutive newlines. Since we are running this on Unix, it is important to produce Unix-style linebreaks, i.e. either create the file in a Unix environment itself or make sure the linebreaks are ^L (linefeed) and not ^M (carriage return). Incidentally, we got reasonable coverage on the above baby test file and stellar BLEU scores.

Fan-Out Log File Format

I completed a mode of running the core demonstrator that exhaustively multiplies out ambiguous outputs from the three processing phases. I put a log of exhaustive batch processing the tur corpus into CVS as `tur.fan'. For each of the Norwegian sentences from the input file, there will be one block of lines like these:

  [17:37:37] (10) |Bergensområdet er tett befolket.| --- 1 (0.24|0.00:0.24 s) <:> () [0].
  |
  |-[0.39] # 0 --- 2 (0.08|0.00:0.00 s) <:9> (756.4K 10.7M = 11.5M) [0].
  | |
  | |-[0.92] # 0 --- 4 (0.17|0.00:0.00 s) <:294> {1335:307} (1.6M 53.7M = 55.3M) [1].
  | |   |the bergen area is densely populated| [1310.78]
  | |   |the bergen area is populated densely| [10070.39]
  | |   |the bergen area is populated densely| [10070.39]
  | |   |the bergen area densely is populated| [14501.42]
  | |
  | |-[1.49] # 1 --- 8 (0.17|0.00:0.00 s) <:255> {1312:302} (1.6M 47.1M = 48.7M) [1].
  | |   |the area around bergen is densely populated| [816.89]
  | |   |the area round bergen is densely populated| [1142.83]
  | |   |the area around bergen is populated densely| [4690.06]
  | |   |the area around bergen is populated densely| [4690.06]
  | |   |the area around bergen densely is populated| [6028.07]
  | |   |the area round bergen is populated densely| [6561.42]
  | |   |the area round bergen is populated densely| [6561.42]
  | |   |the area round bergen densely is populated| [8433.31]
  |
  |< |Bergensområdet er tett befolket.| (10) --- 9 [12]
  |> |the area around bergen is densely populated| [816.9] (0:1:4).
  |> |the area round bergen is densely populated| [1142.8] (0:1:6).
  |> |the bergen area is densely populated| [1310.8] (0:0:2).
  |> |the area around bergen is populated densely| [4690.1] (0:1:0).
  |> |the area around bergen densely is populated| [6028.1] (0:1:5).
  |> |the area round bergen is populated densely| [6561.4] (0:1:1).
  |> |the area round bergen densely is populated| [8433.3] (0:1:7).
  |> |the bergen area is populated densely| [10070.4] (0:0:0).
  |> |the bergen area densely is populated| [14501.4] (0:0:3).
  |= 10:0 of 10 {100.0 0.0}; 10:0 of 10:0 {100.0 0.0}; 9:0 of 10:0 {90.0 0.0} @ 9 of 10 {90.0}.

The first line is the input sentence, followed by --- and the number of readings returned by the analysis grammar. The remaining numbers on that line are timing and memory measures; see the [incr tsdb()] manual. Subsequent lines show results from running each output, in turn, through downstream components, using the `branch' lines and indentation to indicate the flow of control. The third line states that the first parsing output (# 0) had 2 transfer outputs, of which (in turn) the first gave rise to four generator outputs. Upon successfull completion of generation, all realizations are presented, one per line, each followed by their MaxEnt realization ranker score. For each new branch, the initial number in square brackets is the elapsed real time since the start of translating this sentence, i.e. at time [1.49] we started generation from the second transfer output for the first (and only) parsing result.

Once all combinatorics has been explored for one input, there follows a block of summary lines. The first (prefixed by |< repeats the Norwegian input string, followed by two numbers (9 [12] in this case): there were a total of 12 translations output from all branches, of which 9 are actually distinct strings. Next follow nine lines, ordered by cross-perplexity, presenting the various unique output translations, followed by an index into the branching process in terms of parse, transfer, and realization output identifiers. finally, the last line in the example above (prefixed by |=) is a running, accumulated coverage summary on the current input file:

  10:0 of 10 {100.0 0.0}; 10:0 of 10:0 {100.0 0.0}; 9:0 of 10:0 {90.0 0.0} @ 9 of 10 {90.0}

All i:j pairs of numbers are in terms of full vs. fragmented analyses, i.e. at this point (translating item # 10 from tur.txt), there were 10 full and 0 fragmented parser outputs for an analysis coverage of 100%. following are transfer and generation coverage, each relative to the number of available inputs (full or fragmented) to that component. In the above example, transfer succeeded on all 10 parser outputs, but for one of them we were unable to generate. the final number, following the @ sign, is accumulated end-to-end coverage, aka the product of the three individual coverage numbers. In other words, the number of inputs that went all the way through the system successfully.

BLEU Scores

The BLEU scoring script is integrated in the fan-out batch script (and in CVS). In the fan-out log, BLEU scores are printed in angle brackets, e.g.

  |> |be careful about use of an open fire in the backcountry| [71.6] <0.49> (0:4:1).

Additionally, the running summary lines include the average document-level BLEU score, once averaged over all inputs, once averaged over only those for which the system produced one or more outputs, e.g.

  |= 69:21 of 104 {66.3 20.2}; 54:8 of 69:21 {78.3 38.1}; 46:8 of 54:8 {85.2 100.0} @ 54 of 104 {51.9} <0.34 0.66>.

This is to say that, according to the above (August 2004), we produce outputs for 51.9 per cent of the tur items, our BLEU average over these is 0.66 (pretty good) and over the total set it droppes to 0.34, as the 50 items with no system output are counted as a BLEU score of 0 (we could probably increase this score by outputting a selection of high-frequency function words of English, e.g. `a the of').

Resource Consumption

The LOGON demonstrator is comprised of three heavy-duty components. The parser, transfer component, and generator, each, will at times grow to multiple gigabytes in process size. The system will likely fail miserably when computing resources, specifically main memory, are insufficient. The current (September 2005) release of the LOGON software should run on sufficiently modern x86 Linux installations with either 32- or 64-bit kernels (always using 32-bit mode, though). A site-specific Lisp license file is required (see the installation instructions). We would recommend against running the system on machines with less than, say, three gbytes of main memory. In order to use parallelization of batch processing (e.g. using the --count switch to the batch script), four to six gbytes of RAM and at least two cpus should be available.

Clone this wiki locally