Character encoding in HTML

Text-based formats

For historical reasons, the English alphabet and many of its punctuation marks are encoded in electronic devices in a universal and unique way. This encoding is called ASCII (American Standard Code for Information Interchange). However as soon as we step outside this narrow character set, problems are waiting for the unwary.

Any letter that is not part of the English alphabet has to be represented in some code that extends ASCII. After a few years of different attempts the consensus that has emerged is to use UTF-8 (8-bit Unicode Transformation Format) as the standard encoding for these characters.

All HTML tags and scripts in a web page are ASCII characters. And if the content of the page (the actual text that ends up being displayed) is plain English, everything should be fine. However, if the content of the page has non-ASCII characters, the web browser needs to be told what encoding is being used for these extra characters.

Web pages have got a feature that distinguish them from most other text-based files. At the very beginning of the file, there is a tag which informs the browser what is the encoding being used. If you look at the source of a page you will see a tag that looks like this:

<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ />

When translating web pages from English, it is not unusual to find out that the encoding specified in the meta tag is not UTF-8, since for English pages, as we mentioned earlier, the encoding is irrelevant.

But when the target language uses an extended character set (and this happens practically all the time) it is important to carefully check that the codification of the page and the indication given in the meta tag match up. In case of a mismatch there are two possible routes that may be taken. The first one is to change the charset=UTF-8 bit to the actual encoding of the page. The other option is to re-code the page into the appropriate character set.

Remember that these encoding issues arise not only in the context of translating web pages. They have to be taken under account every time that the target language is not English. Only the target language is mentioned because if the source text is already in another language, it should already be in a particular encoding. Then the target, if needed, can be encoded likewise.