AUTHOR: Alexander E. Patrakov DATE: 2007-09-30 LICENSE: Public domain SYNOPSIS: Localized Manual Pages DESCRIPTION: Steps needed to view manual pages in languages other than English are described in more detail than it is done in the LFS book. This hint describes outdated programs and no longer applies for LFS. LFS-6.2 uses Man-DB, exactly because (as this hint shows) Man is too difficult to configure correctly. However, it may still be useful for people who want to understand why LFS switched to Man-DB, or for CLFS users. PREREQUISITES: LFS 6.0 or later, or any LFS with locale and terminal properly configured TODO: Japanese setup for Ghostscript (help needed) HINT: 1. INTRODUCTION Giving users the ability to read manuals in their native language is, without doubt, an important step in making the computer more user-friendly. However, it is not always the case that manual pages are readable out of the box. Possible causes of their unreadability and ways to eliminate them are discussed below. The author intends this hint to be written in such a way so that native English speakers can understand it and use this information e.g. when building a live CD. It is, however, probable that the author (a citizen of Russia) made some assumptions about the reader's knowledge that are true only for native non-English speakers. Such assumptions are bugs, please report them and send other suggestions related to this hint to patrakov@ums.usu.ru Disclaimer: this hint describes the current situation as it is, not as it should be. 2. ENCODING MISMATCH: DESCRIPTION OF THE PROBLEM The most frequently occurring internationalization-related problem is when some character data are encoded not in the same way as expected. E.g., this happens when a string in the CP1251 encoding is sent to a terminal that expects KOI8-R. Let's investigate that process in more detail. A misbehaving application wants to send "Cyrillic Small Letter Em". Since this letter is encoded with the 0xec byte in CP1251, the application sends that byte to the terminal. However, the terminal assumes KOI8-R, and interprets the incoming 0xec byte according to that encoding, in which it means "Cyrillic Capital Letter El", and happily prints that letter. All subsequent Cyrillic letters are mangled in the similar way. The result is not readable by any language specialist -- it's whfg pbeehcgrq (go figure out what this means -- a user will instead just call devil, deity, the local admin or whoever is responsible for this mess). It's much worse than seeing untranslated English messages. The rest of this hint discusses how to avoid encoding mismatches when viewing manual pages. It is assumed that the locale and the terminal are already set up properly, i.e. that the "ls --help" command produces readable output. 3. MESSAGES FROM MAN ITSELF Sometimes man prints error messages, like this: No manual entry for this_program If the "+lang language_list" has been passed to the configure script during the compilation of man and the language indicated by the LANG variable is in the list, man will use a translation of the message from its message catalog. However, man uses the old "catgets" translation mechanism instead of "gettext". This "catgets" mechanism does not do any recoding of translated messages and therefore works only if the translator's locale is the same as the user's locale. This assumption breaks e.g. if a user works in a UTF-8 based locale while the translator uses a traditional 8-bit locale. Consequences of encoding mismatch are that the user will not be able to read the translated error message. Thus, if UTF-8 based locales are allowed to be used, man has to be compiled with a switch "+lang none" passed to its configure script, thus disabling translated error messages at all. Fedora Core re-encoded the translations in UTF-8 instead, but that means that users who revert to 8-bit locales will not be able to read error messages from man on those systems. This "+lang none" switch doesn't prevent the user from viewing localized manual pages, but translated manual pages that come with man itself should be installed by hand in this case (see sections 6 and 8 below). 4. HOW MAN FINDS MANUAL PAGES The process is simple. First, man tries to figure out the wanted languages based on the values of the LC_ALL, LC_MESSAGES, LANG and LANGUAGE variables. For each locale found in those variables, it constructs its abbreviated forms. E.g. if LANG=ru_RU.KOI8-R, man constructs the following strings: "ru_RU.KOI8-R", "ru_RU" and "ru". To set up the ordered list of language preferences, use the LANGUAGE variable, like this: LANGUAGE="es:it" (it says that you prefer Spanish manual pages to Italian ones). Each of the constructed strings is appended to the value of each MANPATH statement in /etc/man.conf and to the directories found in the MANPATH environment variable. The result is a list like the following one: /usr/share/man/ru_RU.KOI8-R /usr/share/man/ru_RU /usr/share/man/ru /usr/local/man/ru_RU.KOI8-R /usr/local/man/ru_RU /usr/local/man/ru /usr/X11R6/man/ru_RU.KOI8-R /usr/X11R6/man/ru_RU /usr/X11R6/man/ru Finally, just the directories listed in the MANPATH statements and environment variable are appended to the list, e.g.: /usr/share/man /usr/local/man /usr/X11R6/man The manual page is searched in man1--man9 and mann subdirectories of directories in the above list. The first directory where it is found wins. E.g., the Italian manual page of "cp" lives in /usr/share/man/it/man1/cp.1 Thus, 1) localized manual pages have priority over English ones; 2) the language of the wanted manual pages is determined from locale variables. 5. WHEN EVERYTHING WORKS BY DEFAULT Man does not convert character data itself, it just constructs a pipeline of commands. Let's start from the end of this pipeline. At the end, there is a pager, specified by the PAGER environment variable or the PAGER statement in /etc/man.conf, typically /bin/less -isR. This pager doesn't convert characters before sending them to a terminal. Therefore, character data sent to the pager must be in the same encoding as the terminal expects. This means the encoding specified by the current locale. One step up the pipeline is the preprocessor. It is determined from the NROFF statement in /etc/man.conf in all locales except Japanese. In Japanese locales, JNROFF is used. By default, NROFF is /usr/bin/nroff -Tlatin1 -mandoc. JNROFF mentions -Tnippon switch for nroff, but that is supported only by patched versions of groff. Therefore, Japanese manual pages cannot be viewed by default in LFS or BLFS. As mentioned above, man calls /usr/bin/nroff -Tlatin1 -mandoc for formatting non-Japanese manual pages. Unmodified groff by default expects its input to be in the Latin-1 (aka ISO-8859-1) encoding. The "latin1" groff device produces output in the Latin-1 encoding if the input is a valid groff input. So the pipeline in the default setup works without encoding mismatch if the manual page is in ISO-8859-1 and the terminal accepts ISO-8859-1 (therefore, the locale must be ISO-8859-1 based). Starting with groff-1.19, there is a possibility to specify a different input encoding (ISO-8859-2 or ISO-8859-15). Details are available in the groff info page: info groff concept "input encoding" As explained there, it works well only with the "utf8" output device (described below). There is no clean way to make groff expect its input in ISO-8859-2 and produce ISO-8859-2 output (in other words, there is no "latin2" device). That's why this facility is not used by Linux distros for Central European countries, where ISO-8859-2 is the preferred encoding. So, from now on, we will ignore this feature and assume that groff input must be in ISO-8859-1. Besides "latin1", groff knows terminal devices "ascii" and "utf8". The "ascii" device accepts input in ISO-8859-1 and uses ASCII approximations for output. Since ASCII texts can be displayed on any terminal, the only requirement for this device to work properly is that the manual page is in ISO-8859-1. Accented characters lose their accents with this device. The "utf8" device in the unmodified versions of groff accepts input in ISO-8859-1 and produces valid UTF-8 output. Therefore it works properly in UTF-8 locales if the manual page is ISO-8859-1 encoded. There is nothing wrong in having text files (manual pages) on disk in the encoding different from that of the current locale as long as there is a program (in this case, man) that can properly display them. Thus, if one prepares an ISO-8859-1 encoded manual page, it can be viewed properly (modulo possibly missing accents) in any locale if the correct device parameter is passed to groff. There is a way to automatically pass the correct device parameter to groff based on the current locale: just don't pass the -T... argument to /usr/bin/nroff in /etc/man.conf. Look at the /usr/bin/nroff shell script to find out why and how this works. So here is the working method of making sure that all installed localized manual pages can be viewed (and printed via "man -t the_page | lpr" command) without any configuration beyond setting up the locale. 1) In /etc/man.conf, edit NROFF and JNROFF lines to become: NROFF /usr/bin/nroff -mandoc JNROFF /usr/bin/nroff -mandoc 2) Make sure that all installed manual pages are in ISO-8859-1 encoding. This means that one has to remove all manual pages for languages that cannot be represented using the ISO-8859-1 encoding, and make sure that they don't reappear. In locales corresponding to such languages, English manual pages and the "ascii" device will be used automatically in this setup. Thus, this setup is unfriendly to users of such locales. 3) Disable creation of cat pages by removing /usr/*/man/*/cat* directories if they exist. This is necessary so that a cat page created for an ISO-8859-1 based locale does not get reused then in UTF-8 based locales, thus creating encoding mismatch problems. The list of codes for languages that use the ISO-8859-1 encoding: "da", Danish "de", German "en", English "es", Spanish "fi", Finnish "fr", French "ga", Irish "gl", Galician "id", Indonesian "is", Icelandic "it", Italian "nl", Dutch "no", Norwegian "pt", Portuguese "sv", Swedish (the list is possibly incomplete) As explained in (2), manual pages for other languages have to be removed in order for this simple setup to work. 6. INSTALLATION OF LOCALIZED ISO-8859-1 ENCODED MANUAL PAGES Packages with localized manual pages are usually called manpages-ll or man-pages-ll where "ll" is a two-letter language code, and can be found in Google. Some of them come with an English or translated README or INSTALL file. If it doesn't exist, or if you can't read it, see the instructions below. Also some programs (e.g. Midnight Commander) come with translated manual pages. As explained above, care should be taken to ensure that installed manual pages are in ISO-8859-1 encoding. This may be not the case in the original tarballs with the manual pages because some of such tarballs are for Fedora Core systems only (Fedora Core uses a patched version of groff that accepts UTF-8 as the input encoding). People who know the language the manual page is in can say whether it is in ISO-8859-1 by opening it in a text editor and looking if the text is readable. Those who don't know the language but still want to install that manual page (e.g. distro-builders or live CD makers) can look at the changelog or README file or use the following method. First, check if the page uses only ASCII characters: cat /path/to/manual/page | iconv -f us-ascii -t us-ascii >/dev/null If it isn't, a warning is printed: iconv: illegal input sequence at position XXX ASCII-only manual pages are also valid ISO-8859-1 pages, and thus good. Then, if the page is not pure ASCII, check if it is in UTF-8: cat /path/to/manual/page | iconv -f UTF-8 -t UTF-8 >/dev/null If it isn't, this means that the page is in some 8-bit encoding. For the languages in the list at the end of the previous section, this 8-bit encoding is ISO-8859-1 (i.e. the page is good). There may be a few pages that are incorrectly identified as UTF-8 by this method, so look at the majority of the pages in the package. If a manual page is in UTF-8, it has to be converted to ISO-8859-1 before installation: iconv -f UTF-8 -t ISO-8859-1 /path/to/manual/page >file.tmp mv file.tmp /path/to/manual/page If the first command fails, try adding the "-c" switch that drops characters that can't be converted. After converting all manual pages in the package to ISO-8859-1, it is safe to copy them to their final destination. 7. HACKS The simple setup explained above works, but is unfriendly to people that can't use ISO-8859-1. Its use is recommended only when configuration steps other than setting the locale are to be avoided at all costs (e.g. on a live CD). The official position of the author of groff is as follows: source encodings other than ISO-8859-1 will not be supported well by the official groff 1.x package because groff is a text typesetting system, not just a manual page formatter. Adding such support would mean that a new formatting algorithm is needed, since in some languages (e.g. Japanese) spaces are not used for word separation, and lines can be broken almost anywhere. There are also different problems with Indic scripts. So groff 2.0 with promised Unicode support is probably in some rather distant future. So people who can't write manual pages for their language in the ISO-8859-1 encoding, or are worried about the loss of accents with the "ascii" device, have to use various hacks for now. The first hack described here is for people who use ISO-8859-15 based locales and are unhappy with groff losing accents with the "ascii" device. Using the "latin1" device results in encoding mismatch: a few bytes mean different characters in ISO-8859-1 and ISO-8859-15. Fortunately, they are not letters in ISO-8859-1, so the difference can be either ignored or taken into account. To ignore the difference (and get e.g. Latin Capital Letter Z With Caron instead of the Acute Accent, or the Latin Small Ligature OE instead of Vulgar Fraction One Half), pass the "-Tlatin1" switch to nroff in /etc/man.conf. To account for this difference and replace the affected characters with their approximate ASCII equivalents, save the following sed scriptlet as /etc/groff/lat1-to-lat9.sed: s@\xa4@x@g s@\xa6@|@g s@\xb4@'@g s@\xb8@,@g s@\xbd@1/2@g s@\xbc@1/4@g s@\sbd@3/4@g and use the following NROFF line: NROFF /usr/bin/nroff -Tlatin1 -mandoc | sed -f /etc/groff/lat1-to-lat9.sed Of course, with this hack, manual pages will display correctly only in locales based on ISO-8859-1 and ISO-8859-15 character sets, as opposed to all locales in the method without hacks. FIXME: it seems that it is possible to do the same by modifying files in /usr/share/groff/1.19.1/font/devlatin1 The second hack is for people who speak languages for which ISO-8859-1 does not contain all needed characters (e.g. Russian). Manual pages written in such languages are in language-specific 8-bit encodings. There are also Fedora-specific packages which use UTF-8 for the encoding of manual pages. Both cases clearly don't constitute valid groff input. Around year 2000, such manual pages in 8-bit encodings were processed by groff using the "latin1" device. This worked (and still works) because two instances of encoding mismatch happening in this case almost cancelled each other. Details: Assume that the manual page is encoded in the Russian KOI8-R encoding, and that the locale is ru_RU.KOI8-R. The manual page author writes the Cyrillic Capital Letter A using a text editor. Since the editor saves the file in the KOI8-R encoding, the 0xe1 byte gets written to the file instead of that letter. When this file is passed to groff, it reads that byte and (wrongly) interprets it as Latin Small Letter A With Acute (this letter is represented with the 0xe1 byte in ISO-8859-1). Then it prints this letter to standard output, assuming (wrongly) that its output is ISO-8859-1. That results in the 0xe1 byte. The pager copies this byte to the terminal and (since the terminal accepts KOI8-R) the Cyrillic Capital Letter A appears, as the author of the manual page intended. As you can see, this hack depends upon the following facts: 1) The source encoding of the manual page and the locale encoding are the same. 2) The formatting rules for this language and encoding are the same as for ISO-8859-1 based languages, i.e.: one byte represents one character and it occupies one cell; words are separated by spaces. It works (possibly with ignorable problems as in the ISO-8859-15 case above) in all 8-bit locales. This setup also has the following drawbacks: 1) One cannot print manual pages with "man -t manual_page | lpr" command -- the page will be full of ISO-8859-1 characters and thus unreadable. 2) Bullets are wrong since the 0xb7 byte that denotes a bullet in ISO-8859-1 means some other character in other encodings. Other mismatches are usually OK because authors know about this effect and try to avoid ISO-8859-1 specific characters in groff output. 3) This method does not work for Japanese because the formatting rules are different. 4) This method breaks when one tries to switch from a 8-bit locale to the UTF-8 based one, because the source encoding of the manual page and the locale encoding are no longer the same. Problem 1 is unsolvable until groff-2.0 comes out. Problems 2 and 3 don't exist in patched versions of groff-1.18.1.1 (see section 9). Problems 2 and 4 can be also solved by modifying the NROFF line in /etc/man.conf. Here is a sed script that replaces ISO-8859-1 specific non-letter characters with their approximate ASCII equivalents, thus partially solving Problem 2. s@\xa0@ @g s@\xa1@i@g s@\xa2@c@g s@\xa3@L@g s@\xa4@x@g s@\xa5@Y@g s@\xa6@|@g s@\xa7@S@g s@\xa8@"@g s@\xa9@(C)@g s@\xaa@a@g s@\xab@<<@g s@\xac@~@g s@\xad@-@g s@\xae@(R)@g s@\xaf@-@g s@\xb0@o@g s@\xb1@+-@g s@\xb2@2@g s@\xb3@3@g s@\xb4@'@g s@\xb5@mu@g s@\xb6@9|@g s@\xb7@o@g s@\xb8@,@g s@\xb9@1@g s@\xba@o@g s@\xbb@>>@g s@\xbc@1/4@g s@\xbd@1/2@g s@\xbe@3/4@g s@\xbf@c@g Save it as /etc/groff/remove-iso-chars.sed and edit your /etc/man.conf so that it contains the line: NROFF /usr/bin/nroff -Tlatin1 -mandoc | sed -f /etc/groff/remove-iso-chars.sed Warning: this script assumes that no letters and other useful characters are in the 0xa0--0xbf range in the 8-bit encoding. It replaces not only characters (e.g. bullets) "generated" by groff (that should be replaced), but also characters passed through by groff from the original manual page (that should not be replaced). Thus, it is potentially harmful, i.e. just ignoring Problem 2 may be better in some cases. A solution based on Debian-patched groff-1.18.1.1 is preferred, because that version of groff does not "generate" ISO-8859-1 specific characters when the "ascii8" device is used. The sed is not needed then. To solve Problem 4, one has to convert groff output to the locale encoding. This is achieved by use of this one long NROFF line: NROFF /usr/bin/nroff -Tlatin1 -mandoc | sed -f /etc/groff/remove-iso-chars.sed | iconv -c -f 8_BIT_ENCODING (use of the Debian-patched groff-1.18.1.1 and the "ascii8" device is preferred over "latin1" + optional sed). Replace 8_BIT_ENCODING above with the name of the encoding in which manual pages in your language are stored on disk. This line works for one (your) language only, but for both 8-bit and UTF-8 locales (in your 8-bit locale the iconv conversion is a no-op). Alternatively, you can put the iconv invocation in the PAGER statement, as described in the UTF-8 hint as of 2004-02-25. Although the /usr/bin/nroff script doesn't support this hack, there is a program that does: man-db, to be used instead of man. This is the default manual page viewer on Debian systems. It can be downloaded from: http://ftp.debian.org/debian/pool/main/m/man-db/ For the hack to work, you need both the man-db_2.4.2.orig.tar.gz tarball, the latest man-db_2.4.2-XX.diff.gz patch and Debian-patched groff-1.18.1.1. 8. INSTALLATION OF NON-ISO-8859-1 MANUAL PAGES The instructions are mostly the same as in Section 6. One has to ensure that manual pages are in the 8-bit encoding proper for their language (i.e. not in UTF-8) and copy them to their destination directory. The method for identifying UTF-8 encoded pages described there still works. The difference is the action to be taken when a tarball of UTF-8 manual pages is found. Instead of conversion to ISO-8859-1, one should convert to the encoding specified in the table below. "cs" (Czech): ISO-8859-2 "hr" (Croatian): ISO-8859-2 "hu" (Hungarian): ISO-8859-2 "ja" (Japanese): EUC-JP "ko" (Korean): EUC-KR "pl" (Polish): ISO-8859-2 "ru" (Russian): KOI8-R "sk" (Slovak): ISO-8859-2 "tr" (Turkish): ISO-8859-9 (table taken from the source of the man-db program) For Japanese and Korean, Debian-patched groff-1.18.1.1 is needed because of the "nippon" output device. 9. AVAILABLE GROFF PATCHES As explained above, unmodified groff supports only ISO-8859-1 source encoding well. Several patches for circumvention of this limitation exist in Linux distributions. The first one is the Debian patch, available (for groff-1.18.1.1) from: http://ftp.debian.org/debian/pool/main/g/groff/ The exact URL is not given because the patch version (groff_1.18.1.1-7.diff.gz at the time of this writing) changes frequently and old versions are deleted (but can still be downloaded from http://snapshot.debian.net/ ) To compile Debian-patched groff-1.18.1.1: zcat ../groff_1.18.1.1-7.diff.gz | patch -Np1 ./configure --prefix=/usr --enable-multibyte make -k make -k install The "-k" switch is needed because, if you have enough programs installed in order to build the PostScript documentation, the build will fail on some versions of glibc because of double-free detection. This switch allows the compilation to proceed even though the documentation failed to build. This patch adds the "ascii8" and (if the "--enable-multibyte" switch has been passed to the "configure" script) "nippon" devices. The "ascii8" device passes 8-bit characters through without modifications (as "latin1" does) and never produces non-ASCII characters (e.g. for bullets) by itself. Thus, it is usable by itself if and only if the source encoding of the manual page and the encoding expected by the terminal is the same 8-bit encoding. Essentially, it is a better version of the second hack described in the previous section. The "nippon" device is Japanese-specific. When the current LC_CTYPE locale is a Japanese one (i.e. begins with "ja"), this device accepts input in the locale encoding, and produces output in the same encoding as the input. Japanese formatting rules are respected. The "utf8" device is also modified: in Japanese locales, it accepts input in the locale encoding and produces UTF-8 output. For groff-1.19.1, a similar patch is available from: http://developer.momonga-linux.org/viewcvs/*checkout*/trunk/pkgs/groff/groff-1.19.1-japanese.patch To compile groff with this patch: patch -Np1 -i ../groff-1.19.1-japanese.patch autoheader autoconf ./configure --prefix=/usr --enable-japanese make make install rm -rf /usr/share/groff/1.19.1/font/devascii8 The /usr/share/groff/1.19.1/font/devascii8 directory is removed after installation in order to disable the "ascii8" device that doesn't work correctly in this version of groff. Note that for Japanese to show up properly, the PAGER line in /etc/man.conf should be changed to "less -isr", and for printing Japanese manuals, standard Japanese fonts must be available. With ghostscript, it seems like WadaLab fonts may be used. They are available from: ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/3rdparty/fonts/kanji/Font Unfortunately, I don't know exact installation instructions. If you know them, please mail to patrakov@ums.usu.ru RedHat also offers a patched groff-1.18.1.1, available in the source form as SRPM only. Get it from: http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/ The exact URL with a version number is not given for the same reasons as above. Their patch is based on the Debian one. The difference is that it always (as opposed to in Japanese locales only) accepts input in the locale encoding (i.e. UTF-8, the only supported encoding in Fedora Core) and that the only device that doesn't spit out tons of "can't find character" errors is "utf8", which produces UTF-8 encoded output. Since this device accepts UTF-8 input in this version of groff, all manual pages in Fedora Core are in UTF-8. I recommend to avoid this version of groff because of its bugs. It randomly inserts spaces in the middle of Russian words or glues them together, and breaks lines in unpredictable (and usually wrong) places. Those who want to build it anyway should read instructions from the groff.spec file inside the SRPM. In order to get a successful build, one should either downgrade texinfo to version 4.6 from 4.7 or pass the "-k" flag to "make" and "make install" commands to continue the compilation even though texinfo documentation fails to build. 10. CAVEAT Translated manual pages usually lag behind their English equivalents, so be careful while reading them. CHANGELOG: [2004-05-09] * Initial hint. [2007-09-30] * Updated the description.