Power and flexibility in XML

Magical mystery tags

XML has become a widely accepted standard for structuring and exchanging data. It combines power and flexibility, two qualities that usually compete with each other, but in XML have achieved a well-balanced equilibrium.

The format itself is deceptively simple: well nested tags with no pre-defined meaning and with optional attributes, which themselves can have any name. A typical tag (with its content)  looks like this:

<arbitraryName arbitraryAttribute=”hello”>Some text</arbitraryName>

When the data being manipulated is itself formatted as tags between angle brackets (for example HTML, XHTML or XML itself) we have a problem. How do we distinguish between XML structure and data? To get around this, angle brackets that belong to the data are transformed into their html-entity code: < goes into &lt; and > into &gt;  (they are abbreviations of “less than” and “greater than”, an easy way of remembering it). XML knows the meaning of these two entities, and documents that contain them are considered valid.

When we want to translate these files using some CAT tools, it is convenient to have them converted back. In this way the tags are interpreted as such instead of being displayed as part of the text. We get better segmentation, better context matches and they don’t show up in the sentences to be translated. In the finished translation they need to be converted again into entities to maintain the original structure.

Before importing the translated segments, say into a CMS, they need to be processed once again to convert the html entities into the corresponding characters, so the tags get interpreted correctly by the browser. In general the number and names of encoded tags in the original document coincide with the ones in the translated document. However we have had to deal with a situation in which tags were introduced in fields where no tags were present originally: “XIX century” is translated into French as “XIXe siècle”. In HTML this is marked with the “sup” tag: “XIX<sup>e</sup> siècle”. The CMS on the client side was not prepared to deal with the entities in these fields, and they were showing up in the text. Therefore it was necessary to further process the XML files, removing occurrences of the “sup” tag in fields with unsupported entity conversion.

This anecdote serves to illustrate that with great power and flexibility, there also comes the possibility of pitfalls. It is not enough to validate the XML files; it is also necessary to look at the final product and adjust the fine details accordingly in translation projects involving XML.