Processing TMX with sed

Cryptic and contorted

This entry is a bit complex, but possibly some translators with a technical lean might find it useful.

Some time ago I wrote about the TMX format for translation memories. These TMs can be very big files, spanning many thousands or even millions of lines. Processing these files with traditional methods sometimes can become a problem. For example, most text editors load the whole file in memory before being ready to do any useful work. Additionally, performing a simple search-and-replace task on more than a couple of files, can quickly become tedious and error prone.

However, there are text editors that process files one line at a time, or just a handful of them, producing the output on the fly, immediately ready for the next line, freeing any resources on the spot. One of the most well known of these programs is sed (stream editor). It is available in any Unix system (including, of course, any Linux, and in particular Ubuntu) and there are versions for Windows too (like this one).

The syntax of the sed language is a bit cryptic, but this won’t dishearten people used to understanding Shakespeare, will it? Not only that, the logic of this particular script is slightly contorted. Cryptic and contorted, nice name for a coffee house in the bohemian district.

We will print out all segments of a TMX file that contains any of the words law, laws, Law or Laws. The pattern used will be the following: [Ll]aws?[^a-zA-Z] (I’ve written about patterns and regular expressions here). In plain English it means: “Lower or upper case ‘l’ followed by ‘aw’ followed by an optional ‘s’ followed by any non-letter”. The last restriction is to rule out lawsuit, for example.

So here we go:

# work only inside translation units/<tu[ >]/ {    # if the pattern is in the first line, go to the found label    /[Ll]aws?[^a-zA-Z]/b found    :next    # if we reach the end of the translation unit, we are done,    # and the pattern has not been found    /</tu>/b notfound    # we are not at the end of the translation unit yet    /[Ll]aws?[^a-zA-Z]/!{N;b next}    # due to the !, if we are here, the pattern has not been found yet    :found    # collect the rest of the lines until the end of the translation unit    /</tu>/!{N;b found}    # command to be executed if we found the pattern    p    b exit    :notfound    # command to be executed if we dont find the pattern    w not_found    :exit}

A quick explanation of the commands used:

  • b jumps into the label following it
  • /XYZ/ execute command that follows only if XYZ matches
  • ! inverts the meaning of the previous test
  • N reads next line and appends it to the current content
  • p prints the pattern space
  • w writes the pattern space into a file

Saving this script as find_in_tmx.sed, it can be used from the command line of a Unix system like this: sed find_in_tmx.sed big_memory.tmx

This script is just an example. But the basic mechanism can be adapted to various situations, for example finding segments that don’t contain certain patterns, etc.