cvs commit: hints utf-8.txt
tushar at linuxfromscratch.org
tushar at linuxfromscratch.org
Wed Mar 3 23:34:33 MST 2004
tushar 04/03/03 23:34:33
Modified: . utf-8.txt
Log:
Modified: utf-8.txt
Revision Changes Path
1.2 +412 -80 hints/utf-8.txt
Index: utf-8.txt
===================================================================
RCS file: /home/cvsroot/hints/utf-8.txt,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -u -r1.1 -r1.2
--- utf-8.txt 8 Nov 2003 06:34:28 -0000 1.1
+++ utf-8.txt 4 Mar 2004 06:34:33 -0000 1.2
@@ -1,19 +1,42 @@
AUTHOR: Alexander E. Patrakov <semzx at newmail.ru>
DATE: 2003-11-06
LICENSE: Public Domain
-SYNOPSIS: Using UTF-8 locales in LFS
+SYNOPSIS: Using UTF-8 locales in [B]LFS
DESCRIPTION:
-This hint explains what should be changed in the LFS instructions curent at the
-time of this writing in order to use locales such as ru_RU.UTF-8. Future
-versions (if any) will also deal with BLFS.
+This hint explains what should be changed in the LFS and BLFS instructions
+curent at the time of this writing in order to use locales such as ru_RU.UTF-8.
-PREREQUISITES: LFS 4.1 or later (because of bash-2.05b)
+PREREQUISITES: LFS 5.1-pre1 or later, good knowledge of C
+CONFLICTS: compressed manual pages
CHANGELOG:
2003-11-06: Initial submission
+2004-02-25: Added some BLFS packages
HINT:
+IMPORTANT INFORMATION
+
+Don't follow this hint unless you are prepared to fix broken things! I never
+had a full BLFS install, and of course because of that some packages that
+are broken in UTF-8 locales may well be missing from this hint.
+
+Also, please don't ask support questions related to this hint on mailing lists
+hosted on linuxfromscratch.org (and of course don't provide support yourself)
+if you can't answer the questions at the end of the hint.
+
+Also note that while our goal is to move to the international UTF-8 encoding,
+we have to disable internationalization completely in some older applications.
+So this hint really becomes an antihint: we gain nothing except compatibility
+with bleeding-edge RedHat-like distros in their default configuration, and
+lost... lost what we were aiming to get --- internationalization.
+
+Once again, don't follow this hint blindly.
+
+BIG WARNING: you will probably have to convert ALL your documents.
+
+Part 1. INTRODUCTION
+
1. Single-byte and double-byte encodings and UTF-8: what's wrong
Most Eropean languages have a relatively short alphabet (less than 40
@@ -61,86 +84,89 @@
multilingual texts. Such character set exists and it is named Unicode.
UTF-8 is a method of representing Unicode text with a stream of
-8-bit bytes. The resulting stream is both ASCII-compatible
+8-bit bytes. The resulting stream is both ASCII-compatible and
reverse-ASCII-compatible. A single character can occypy from 1 to 4 bytes. Many
current distributions of Linux configure locales using the UTF-8 character
encoding by default. This doesn't work with (B)LFS for the same reasons as with
double-byte encodings. However,
1) There is no framebuffer-based terminal that is capable of displaying the
-full range of Unicode characters. Fortunately, it is not needed in most cases.
-Linux console is capable of displaying Latin (including accented), Greek,
-Arabian and Cyrillic characters together even without framebuffer. Also, xterm
-compiled separately from XFree86 distribution works.
+full range of Unicode characters (if one doesn't count Debian-specific bterm
+from the "bogl" package, bogl = Ben's Own Graphics Library).
+Fortunately, it is not needed in most cases. Linux console is capable of
+displaying Latin (including accented), Greek, Arabian and Cyrillic
+characters together even without framebuffer. Also, xterm works just fine.
2) There is one more assumption that breaks with UTF-8. The relation of
on-screen width of a string to the number of bytes in it is very complex.
That's why e.g. Midnight Commander works with double-byte encodings, but
doesn't work with UTF-8.
-2. Suggested changes to the LFS book
+3) Many packages in UTF-8 locale fail to provide compatibility with older
+doculents saved in traditional single-byte or double-byte encoding.
+
+Part 1. LFS PACKAGES
+
+1. Suggested changes to the installation instructions
The following packages should be configured differently in Chapter 6:
-- ncurses (optionally)
+- ncurses
- vim
- man
-Modified Ncurses installation instructions (optional):
+1a. Modified Ncurses installation instructions
+
+First of all, you need NCurses 5.4. Get it from
+
+http://ftp.gnu.org/gnu/ncurses/ncurses-5.4.tar.gz
-The Ncurses has experimental support for wide characters. According to the
-output of ./configure --help, it is activated by passing --enable-widec
-argument to ./configure. I don't know what this support means: Vim works fine
-even with non-wide-character version. The resulting libraries are
-binary-incompatible with "normal" ncurses and therefore a letter "w" is appended
-automatically to their names: libncursesw.so.5.3. For compatibility, we will
-install two versions of ncurses:
-First install the normal version:
+The new Ncurses version has experimental support for wide characters.
+According to the output of ./configure --help, it is activated by passing
+the --enable-widec argument to ./configure. The resulting libraries are
+binary-incompatible with "normal" ncurses and therefore a letter "w" is
+appended automatically to their names: libncursesw.so.5.4. For compatibility
+with precompiled commercial applications, we will install two versions of
+ncurses.
+
+Now we are ready to install the non-wide-character version of ncurses, almost
+by the book:
-patch -Np1 -i ../ncurses-5.3-etip-2.patch
-patch -Np1 -i ../ncurses-5.3-vsscanf.patch
./configure --prefix=/usr --with-shared --without-debug
make
make install
-This installs /usr/lib/libncurses.so.5.3. We will move it to /lib later.
-Then (optionally) install a wide-character-enabled ersion on top of it:
+This installs /usr/lib/libncurses.so.5.4. We will move it to /lib later.
+Then install a wide-character-enabled version:
make distclean
./configure --prefix=/usr --with-shared --without-debug --enable-widec
make
make install
-This installs /usr/lib/libncursesw.so.5.3.
+This installs /usr/lib/libncursesw.so.5.4 and related libraries.
Move important libraries to /lib and correct permissions:
-chmod 755 /usr/lib/*.5.3
+chmod 755 /usr/lib/*.5.4
chmod 644 /usr/lib/libncurses++*.a
mv /usr/lib/libncurses.so.5* /lib
mv /usr/lib/libncursesw.so.5* /lib
-ln -sf ../../lib/libncurses.so.5 /usr/lib/libncurses.so
+
+Make the symbolic links:
+
+ln -sf ../../lib/libncursesw.so.5 /usr/lib/libncurses.so
ln -sf libncurses.so /usr/lib/libcurses.so
ln -sf ../../lib/libncursesw.so.5 /usr/lib/libncursesw.so
ln -sf libncursesw.so /usr/lib/libcursesw.so
-The installation of wide-character-enabled version of ncurses is considered
-optional because AFAIK nothing in LFS and BLFS links now against it, although
-I have not checked carefully. Debian provides the following packages linked
-against libncursesw:
-
-- centericq-utf8: A text-mode multi-protocol instant messenger client. The
- utf-8 version may be buggy
-- latrine: LaTrine is a curses-based LAnguage TRaINEr
-- mutt-utf8: Text-based mailreader. The utf-8 version may be buggy
-- dialog: Displays user-friendly dialog boxes from shell scripts
-- screen: A terminal multiplexor with VT100/ANSI terminal emulation
-- tin: A full-screen easy to use Usenet newsreader
-
-Dialog has a ./configure option to link with -lncursesw. This results in a
-dialog executable that has bugs when used with ru_RU.koi8r locale on Linux
-console. I have not tested other packages. I don't use them.
+Note the first command. Now all applications trying to link at compile time
+against -lncurses will actually link to the wide-character version,
+/lib/libncursesw.so.5. This works, because the two libraries are
+source-compatible. At runtime, the linker will happily resolve the dependency
+upon libncursesw.so.5. And for precompiled commercial applications that
+depend on the ordinary version of ncurses there is /lib/libncurses.so.5.
-Modified Vim instructions:
+1b. Modified Vim instructions
For Vim to work correctly in double-byte encodings and in UTF-8, the
--enable-multibye switch has to be added to the ./configure command line. Note
@@ -183,7 +209,7 @@
For more information, read /usr/share/vim/vim62/doc/mbyte.txt
-Modified Man instructions:
+1c. Modified Man instructions
Since Man internationalization does not work at all in UTF-8 locales (the
messages are still output in single-byte or double-byte encodings, appearing
@@ -192,27 +218,55 @@
prevent you from viewing manual pages in your native language. It just means
that messages like "What manual page do you want?" will remain untranslated.
-patch -Np1 -i ../man-1.5m2-manpath.patch
-patch -Np1 -i ../man-1.5m2-80cols.patch
+Install the "man" package with the followiing commands:
-We skipped the "-pager" patch because we will do the same manually and
-differently. Creating a modified patch is not an option because it will be
-different for each country.
+patch -Np1 -i ../man-<version>-manpath.patch
+patch -Np1 -i ../man-<version>-80cols.patch
+patch -Np1 -i ../man-<version>-pager.patch
DEFS="-DNONLS" ./configure -default -confdir=/etc +lang all
make
make install
-After installation of man, search for the line in /etc/man.conf that starts
-with "PAGER". Replace it with something like the following:
+Now we have to decide what to do with manual pages in your native language.
+They are provided with the corresponding packages in the single-byte or
+double-byte encoding, but not in UTF-8. Therefore, they won't display properly.
+There are two solutions to this problem.
+
+The first solution is to store them in the single-byte or double-byte encoding,
+i.e. as they come with the corresponding packages, and convert them into UTF-8
+on the fly. To do this, search for the line in /etc/man.conf that starts with
+"PAGER". Replace it with something like the following:
PAGER /usr/bin/iconv -c -f koi8-r | /usr/bin/less -isR
(replace koi8-r with your 8-bit or double-byte encoding). Note that this change
does not hurt you if you later switch back to the usual encoding: iconv will
-be a no-op.
+be a no-op. Unfortunately, this doesn't work well with graphical man page
+viewers like Yelp (from GNOME-2.4) or Konqueror, since they just ignore the
+"PAGER" variable in /etc/man.conf (if they read /etc/man.conf at all) and
+assume that manual pages are stored in the character set of the current locale.
-3. Actually setting up UTF-8 based locale
+The second solution would be to convert manual pages to UTF-8. Unfortunately,
+I had no success with this. RedHat provides some patches for groff-1.18.1.
+I tried to convert all manual pages into UTF-8 and changed man.conf to have
+the line
+
+# WRONG!
+NROFF /usr/bin/iconv -c -t koi8-r | /usr/bin/nroff -Tlatin1 -mandoc | /usr/bin/iconv -c -f koi8-r
+
+This didn't work well because some manual pages contain just
+
+.so filename
+
+and don't display properly.
+
+In fact, the *roff specification says that the input must be in iso8859-1
+encoding, there is no way to typeset anything except Latin and Greek according
+to the specification, and all localized manual pages (even in the single-byte
+Cyrillic KOI8-R encoding!) are really a hack and violate the specification.
+
+2. Setting up UTF-8 based locale and environment variables
Some UTF-8 locales (e.g. se_NO.UTF-8) are installed during the
@@ -225,16 +279,65 @@
The role of the -c switch is to continue the creation of the locale even though
warnings are issued. After the creation of the locale, it is needed to tell
-applications to use it. All that is needed is to set some environment
-variables. Add this to your /etc/profile:
+applications to use it. All that is required is to set some environment
+variables. An easy "solution" is to add this to your /etc/profile:
+# WRONG!!!
export LC_ALL=ru_RU.UTF-8
+export LANG=ru_RU.UTF-8
+
+This "solution" is wrong because these variables will be available to processes
+started from your login shell, but will not be available to the readline
+library that the shell uses. The readline library uses this information e.g. to
+determine how many bytes to remove from the input buffer (must be one UTF-8
+character) and how many character cells to erase on the screen (again, one
+full character) if you press Backspace or Delete key.
+
+Yes, if you _type_ export LC_ALL=ru_RU.UTF-8 in the login shell, then it will
+pass this setting to the readline library. But this doesn't work in the shell
+startup files. This is a bug in bash. So the correct LC_ALL variable must be
+already in the environment when the login shell starts.
+
+If one adds the above LC_ALL and LANG variables into /etc/environment, it will
+work for login shells started by the "login" program, but will not work for
+shells started by "su" or "sshd" programs. This approach also requires you to
+place these variables into /etc/profile so that they will be available from
+KDE (the "startkde" script from KDE 3.2.0 sources /etc/profile).
+
+Another approach is to make the login shell set the correct locale variables
+and reexecute itself. To accompilsh this, add the following snippet at the
+very beginning of your /etc/profile:
+
+if [ "x$LC_ALL" = "x" ]
+then
+ export LC_ALL=ru_RU.UTF-8
+ export LANG=ru_RU.UTF-8
+ if ( echo $- | grep -q i )
+ then
+ exec -a "$0" /bin/bash "$@"
+ fi
+fi
+
+The $- check is there because /etc/profile is sometimes sourced by other
+scripts that run in noninteractive shells. Such shells don't need to be
+reexecuted, since you don't want to replace a script that sourced /etc/profile
+with an instance of /bin/bash called with the same parameters as the script.
+
+Of course, you will have to replace ru_RU above with something more
+appropriate.
+
+If you are using xdm, you also want to include the following lines into the
+beginning of /etc/X11/xdm/Xsession:
+
+[ -r /etc/profile ] && . /etc/profile
+[ -r $HOME/.bash_profile ] && . $HOME/.bash_profile
-Of course, you will have to replace ru_RU with something more appropriate. If
-you are using X, you will also have to include that string in
-/etc/X11/xinit/xinitrc.
+Consult the documentation of other display managers for the means to set the
+environment in the started session.
-Then, we will modify the /etc/rc.d/init.d/loadkeys script.
+3. Setting up Linux console
+
+We will modify the /etc/rc.d/init.d/loadkeys script.
#!/bin/bash
# Begin $rc_base/init.d/loadkeys - Loadkeys Script
@@ -246,30 +349,41 @@
source /etc/sysconfig/rc
source $rc_functions
-echo "Loading keymap..."
-kbd_mode -u &&
-loadkeys ru 2>/dev/null &&
-dumpkeys -ckoi8-r | loadkeys --unicode 2>/dev/null
+echo -n "Setting screen font..."
+for console in /dev/tty[1-6]
+do
+ (
+ setfont
+ setfont LatArCyrHeb-16
+ )<$console >$console 2>&1
+done
evaluate_retval
-echo "Setting screen font..."
-setfont LatArCyrHeb-16
+echo -n "Loading keymap..."
+kbd_mode -u
+loadkeys ru1 2>/dev/null &&
+dumpkeys -c koi8-r | loadkeys --unicode
evaluate_retval
+
# End $rc_base/init.d/loadkeys
-Some comments about this script.
-1) The kbd package does not provide ready-to-use keymaps for UTF-8 locales,
-except for Ukrainian one. First, we load the now-wrong ru keymap (it contains
-numbers valid only in koi8-r character set), then we dump it replacing numbers
-with human-readable descriptions of characters (e.g.
-"cyrillic_small_letter_e"). The resulting keymap is usable in UTF-8 mode, so
-we load it with loadkeys --unicode.
+Some comments concerning this script.
+
+1) The empty "setfont" command works around a bug in 2.6 kernels.
+
2) We don't switch the console output to UTF-8 here. We will do that in
/etc/issue (the idea is stolen from "redhat-style-logon" hint). This is
necessary because otherwise this switching will affect only the first console.
As an alternative, you can write a "for" loop here sending <ESC>%G to all
virtual consoles.
+3) The kbd package does not provide ready-to-use keymaps for UTF-8 locales,
+except for Ukrainian one. First, we load the now-wrong ru1 keymap (the numeric
+character codes there are valid only for koi8-r character set), then we dump
+it replacing numeric codes with human-readable descriptions of characters (e.g.
+"cyrillic_small_letter_e"). The resulting keymap is usable in UTF-8 mode, so
+we load it with loadkeys --unicode.
+
Let's create /etc/issue:
echo -e '\033[2J\033[f\033%GWelcome to Linux From Scratch\n' >/etc/issue
@@ -292,11 +406,229 @@
From your next login, you will use UTF-8 based locale, with all its benefits
and drawbacks.
-5. Known bugs
+Known bugs:
+
+- The Caps Lock key does not work on Linux console for national characters.
+ The guilty package is kbd.
+
+- Some packages don't display line drawing characters in UTF-8 mode on Linux
+ console. This is a bug in the packages themselves. See ALSA section below
+ for more detailed discussion.
+
+Part 2. BLFS PACKAGES
+
+1. GnuPG
+
+The package itself is internationalized well and supports UTF-8 out of the box.
+Unfortunately, some applications (e.g. Enigmail) assume that the output of gpg
+is in iso8859-1. For applications that cannot be fixed easily, create the
+following script:
+
+#!/bin/sh
+export LC_ALL=C
+export LANG=C
+exec /usr/bin/gpg "$@"
+
+Save it as /usr/bin/gpg-nolocale, give it the "executable" bit and configure
+the offending application to use this script instead of the real gpg binary.
+
+2. Emacs
+
+I don't use Emacs at all, but your comments are welcome. Don't expect
+any console-based editor except Vim, Emacs and Yudit to work in UTF-8 locale.
+
+3. Slang
+
+Get the patch
+
+http://www.linuxfromscratch.org/patches/downloads/slang/slang-1.4.9-utf8.patch
+
+Install Slang using the following instructions:
+
+patch -Np1 -i ../lang-1.4.9-utf8.patch
+./configure --prefix=/usr
+make CFLAGS="-O2 -pipe -DUTF8"
+make install
+make CFLAGS="-O2 -pipe -DUTF8" ELF_CFLAGS="-O2 -pipe -DUTF8" elf
+make install-elf
+make install-links
+chmod 755 /usr/lib/libslang.so.1.4.9
+
+WARNING: you should pass -DUTF-8 in CFLAGS to all applications that depend
+on Slang.
+
+4. Aspell
+
+To be done.
+
+5. GPM
+
+GPM cannot cut/paste non-ASCII characters. It is really a limitation of Linux
+console. You can google for a kernel patch named
+
+unicode_copypaste_2.4.19.patch.gz
+
+but I would recommend against it. I had crashes and repeatable kernel panics
+with it.
+
+6. Zip/Unzip
+
+If you put a file with non-ASCII characters in its name into the archive, you
+will be unable to get that name correctly under Windows.
+
+7. Midnight Commander
+
+First, install Slang. Then, get the patch
+
+http://www.linuxfromscratch.org/patches/downloads/mc/mc-4.6.0-utf8.patch
+
+Install Midnight Commander with the following instructions:
+
+patch -Np1 -i ../mc-4.6.0-utf8.patch
+CFLAGS="-O2 -pipe -DUTF8" ./configure \
+ --prefix=/usr --with-screen=slang \
+
+ --what-else-you want, e.g.
+
+ --with-vfs --with-samba --enable-charset --without-ext2undel \
+ --with-configdir=/etc/samba --with-codepagedir=/usr/share/samba/codepages
+make
+make install
+
+Unfortunately, this patch is not sufficient. In particular, it is impossible
+to view and edit files containing non-ASCII characters using the internal
+viewer and editor. Configure Midnight Commander to use an external editor,
+e.g. Vim.
+
+8. w3m
+
+You need w3m-m17n, not just a bare w3m. Unfortunately, w3m-m17n-0.4.2 does not
+exist yet.
+
+9. Mutt, Pine
+
+I don't use them at all, but Debian has a patch for Mutt.
+Your comments are welcome.
+
+10. GTK+-1.2.10
+
+This package's default style files in /etc/gtk don't work in UTF-8 locales.
+Changing "koi8-r" to "iso10646-1" fonts in /etc/gtk/gtkrc.ru fixes the problem
+with improper fonts for Russians. Beware that KDE also sets GTK styles (in
+~/.kde/share/config/gtkrc and ~/.gtkrc), so these files also may need some
+manual editing.
+
+11. LessTif
+
+This package does not support UNICODE well.
+
+12. KDEMultimedia
+
+The players show ID3 tags with national characters improperly.
+
+13. Yelp
+
+The problems with manual pages have already been mentioned in Man section.
+
+14. ALSA
+
+Alsamixer 1.0.2 won't show the line drawing characters on Linux console in
+UTF-8 mode. This is a bug in alsamixer. The problem is that NCurses must
+know whether the Linux console is in UTF-8 mode or not. To do that, NCurses
+checks the current locale setting (in the order: LC_ALL, LC_CTYPE, LANG).
+Also, it has to compute how many cells a given character occupies. This
+requires a valid LC_CTYPE setting.
+
+But this means that a program that links to ncurses must call
+
+setlocale(LC_CTYPE, "")
+
+before initscr(). This patch fixes the issue in alsamixer
+
+http://www.linuxfromscratch.org/patches/downloads/alsa-utils/alsa-utils-1.0.2-locale.patch
+
+After reading the text above and looking at the alsamixer patch, you should
+be able to fix this kind of a problem with other packages. Please send
+patches to patches at linuxfromscratch.org.
+
+Don't send a patch for the "lxdialog" program that comes with the kernel
+sources and is used during "make menuconfig", since that will break Question
+2 in the quiz below and I will no longer be able to check whether others are
+ready to follow this hint.
+
+15. XMMS
+
+This package will not show ID3 tags properly out of the box, because they are
+usually in the windowsish single-byte or double-byte encoding and not in UTF-8.
+The patch from http://rusxmms.sourceforge.net/ helps.
+
+16. Dillo
+
+This package does not support UNICODE.
+
+17. XSane
+
+The gtk+-1.2.10 version is affected by a bug in gtk+ style support
+and does not work properly even in ru_RU.koi8r locale. To work around
+the problem, don't build the GIMP plugin --- then XSane will link against GTK2.
+
+18. Xpdf
+
+Since this package depends on LessTif, the support of UTF-8 in the GUI is
+rather poor. E.g., the filenames in the fileselector show improperly. But
+the non-GUI tool, pstotext, works flawlessly and can extract text in the UTF-8
+encoding from PDF files.
+
+19. A2ps
+
+This package does not support UNICODE.
+
+20. TeX
+
+To use UTF-8 as an input encoding with TeX, you should download the following
+package:
+
+http://www.unruh.de/DniQ/latex/unicode/unicode.tgz
+
+Just unpack it into /usr/share/texmf/tex, remove all files except
+
+ucs/*.sty, ucs/*.def, ucs/data/*
+
+and then run mktexlsr. Then you will be able to write
+
+\usepackage[utf-8]{inputenc}
+
+in the document preamble, but I doubt that anyone else will be able to TeX
+your documents.
+
+If you want someone else to be able to extract text in UTF-8 encoding from
+your PDF files generated by PDFTeX or dvipdfm, you should also install
+the "cm-super" font package from CTAN.
+
+Part 3. CONCLUSIONS
+
+Probably you understand from reading the above that UTF-8 causes more trouble
+than merit. If you followed this hint, I hope that I didn't damage your system
+irreversibly.
+
+Please post your deviations and report other broken packages to
+
+patrakov at ums.usu.ru
+
+APPENDIX A. QUIZ
+
+You should follow the hint only if you know all the answers.
+
+1) The non-wide character version of ncurses 5.4 uses poor-man line-drawing
+ characters on Linux console in UTF-8 mode.
-The relevant package is denoted in brackets
+ What other terminal type is affected by this? Where (which file and line)
+ is the check? Where is the piece of code that substitutes these poor-man
+ line-drawng characters instead of those which came from the terminfo
+ database? Where does ncurses 5.4 check the current locale?
-- The BLFS modifications are not in the hint yet
-- The Caps Lock key does not work on Linux console for national characters [kbd]
-- Applications cannot display line drawing characters on Linux console,
- although the font contains them; xterm works [ncurses]
+2) Linux kernel build process uses the "lxdialog" program during the
+ "make menuconfig" step. Unfortunately, lxdialog has the same bug as
+ alsamixer (see the hint).
+
+ Can you make a patch for lxdialog yourself?
More information about the hints
mailing list