All text that is handled by computers
must be encoded. Every letter in a text has to be represented
by a numeric value. For a long time, it was assumed that 7
bits would provide enough values to encode all necessary
letters; this was the basis for the ASCII character set.
However, with the spread of computers all over the world, it
became clear that this was not enough. A whole host of
different encodings were designed, varying from the obscure
(TISCII) to the pervasive (latin-1). Of course, this leads to
problems when you are trying to exchange texts. A
western-european latin-1 user cannot easily read a Russian
koi-8 text on his system. Another problem is that those small,
one-byte, eight-bit character sets don't have room for useful
stuff, such as extensive mathematical symbols. The solution
has been to create a monster character set consisting of at
least 65000 code-points including every possible character
someone might want to use. This is ISO/IED-10646. The Unicode
standard (http://www.unicode.org) is the official
implementation of ISO/IED-10646.
Unicode is an essential feature of any
modern application. Unicode is mandatory for every e-mail
client, for instance, but also for all XML processing, web
browsers, many modern programming languages, all Windows
applications (such as Word), and KDE 2.0 translation
files.
Unicode is not perfect, though. Some
programmers, such as Jamie Zawinski of XEmacs and Netscape
fame, lament the extra bytes that Unicode needs — two
bytes for every character instead of one. Japanese experts
oppose the unification of Chinese characters and Japanese
characters. Japanese characters are derived from Chinese
characters, historically, and even their modern meaning is
often identical, but there are some slight visual differences.
These complainers are often very vociferous, but Unicode is
the best solution we have for representing the wide variety of
scripts humanity has invented.
There are a few other practical problems
concerning Unicode. Since the character set is so very large,
there are no fonts that include all characters. The best font
available is Microsoft's Arial Unicode, which can be
downloaded for free. The Unicode character set also includes
interesting scripts such as Devanagari, a script where single
letters combine to from complicated ligatures. The total
number of Devanagari letters is fairly small, but the set of
ligatures runs into the hundreds. Those ligatures are not
defined in the character set, but have to be present in fonts.
Scripts like Arabic or Burmese are even more complicated. For
those scripts, special rendering engines have to be written in
order to display a text correctly.
From version 3, Qt includes capable rendering engines for
a number of scripts, such as Arabic, and promises to include
more. With Qt 3, you can also combine several fonts to form a
more complete set of characters, which means that you no
longer have use have one monster font with tens of thousands
of glyphs.
The next problem is inputting those
texts. Even with remappable keyboards, it's still a monster
job to support all scripts. Japanese, for instance, needs a
special-purpose input mechanism with dictionary lookups that
decide which combination of sounds must be represented using
Kanji (Chinese-derived characters) or one of the two syllabic
scripts, kana and katakana.
There are still more complications, that
have to do with sort order, bidirectional text (Hebrew going
from right to left, Latin from left to right) — then
there are vested problems with determining which language is
the language of preference for the user, which country he is
in (I prefer to write in English, but have the dates show up
in the Dutch format, for instance). All these problems have
their bearing upon programming using Unicode, but are so
complicated that a separate book should be written to deal
with them.
However, both Python strings and Qt
strings support Unicode — and both Python and Qt strings
support conversion from Unicode to legacy character sets such
as the wide-spread Latin-1, and vice-versa. As said above,
Unicode is a multi-byte encoding: that means that a single
Unicode character is encoded using two
bytes. Of course, this doubles memory requirements compared to
single-byte character sets such as Latin-1. This can be
circumvented by encoding Unicode using a variable number of
bytes, known as UTF-8. In this scheme, Unicode characters that
are equivalent to ASCII characters use just one byte, while
other characters take up to three bytes. UTF-8 is a
wide-spread standard, and both Qt and Python support
it.
I'll first describe the pitfalls of
working with Unicode from Python, and then bring in the Qt
complications.
Python and Unicode
Python actually makes a difference
between Unicode strings and 'normal' strings — that is,
strings where every byte represents one character. Plain
Python strings are often used as character arrays representing
immutable binary data. In fact, plain strings are semantically
very similar to Java's byte array, or Qt's
QByteArray class — they represent
a simple sequence of bytes, where every byte
may represent a character, but could also
represent something quite different, not a human readable text
at all.
Creating a Unicode string is a
bootstrapping problem. Whether you use BlackAdder's Scintilla
editor or another editor, it will probably not support Unicode
input, so you cannot type Chinese characters directly.
However, there are clever ways around this problem: you can
either type hex codes, or construct your strings from other
sources. In the third part of this book we will create a small
but fully functional Unicode editor.
String literals
You can create a Unicode string literal
by prefixing the string with the letter
u, or convert a plain string to Unicode
with the unicode keyword. You cannot,
however, write Python code using anything but ASCII. If you
look at the following script, you will notice that there is
a function defined in Chinese characters (yin4shua1 means
print), that tries to print the opening words of the Nala
—, a Sanskrit epos. Python cannot handle this, so all
actual code must be in ASCII.
A Python script written in Unicode.
Of course, it would be nice if we could
at least type the strings directly in UTF-8, as shown in the
next screenshot:
A Python script with the strings written in
Unicode.
Unfortunately, this won't work either.
Hidden deep in the bowels of the Python startup process, a
default encoding is set for all strings. This encoding is
used to convert from Unicode whenever the Unicode string has
to be presented to outside world components that don't talk
Unicode, such as print. By default this
is 7-bits ASCII. Running the script gives the following
error:
boudewijn@maldar:~/doc/opendoc/ch4 > python unicode2.py
Traceback (most recent call last):
File "unicode2.py", line 4, in ?
nala()
File "unicode2.py", line 2, in nala
print u"à¤à¤¸à¥à¤¦ राà¤à¤¾ नलॠनाम "
UnicodeError: ASCII encoding error: ordinal not in range(128)
The default ASCII encoding that Python
assumes when creating Unicode strings means that you cannot
create Unicode strings directly, without explicitly telling
Python what is happening. This is because Python tries to
convert from ASCII to utf8, and every byte with a value
greater than the maximum ASCII knows (127) will lead to the
above error. The solution is to use an explicit encoding.The
following script will work better:
Explicitly telling Python that a string
literal is in the utf-8 encoding.
If you run this script in a
Unicode-enabled terminal, like a modern xterm, you will see
the first line of the Nala neatly printed. Quite an
achievement!
You can find out which encodings your
version of Python supports by looking in the encodings
folder of your Python installation. It will certainly
include mainstays such as: ascii, iso8859-1 to iso8859-15,
utf-8, latin-1 and a host of MacIntosh encodings as well as
MS-DOS codepage encodings. Simply substitute a dash for
every underscore in the filename to arrive at the string you
can use in the encode() and
decode() functions.
The same problem will occur when
reading text from a file. Python has to be explicitly told
when the file is in an encoding different from the default
encoding. Python's file object reads files as bytes and
returns a plain string. If the contents are not encoded in
Python's default encoding (ASCII), you will have to be
explicit about it. Let's try reading the preceding script,
unicode3.py, which was saved in utf-8 format.
Example 8-6. Loading an utf-8 encoded text
#
# readutf8.py - read an utf-8 file into a Python Unicode string
#
import sys, codecs
def usage():
print """
Usage:
python readutf8.py file1 file2 ... filen
"""
def main(args):
if len(args) < 1:
usage()
return
files=[]
print "Reading",
for arg in args:
print arg,
f=open(arg,)
s=f.read()
u=unicode(s, 'utf-8')
files.append(u)
print
files2=[]
print "Reading directly as Unicode",
for arg in args:
print arg,
f=codecs.open(arg, "rb", "utf-8")
u=f.read()
files2.append(u)
print
for i in range(len(files)):
if files[i]==files2[i]:
print "OK"
if __name__=="__main__":
main(sys.argv[1:])
As you can see, you either load the
text in a string and convert it to a Unicode string, or use
the special open function defined in the
codecs module. The latter option
allows you to specify the encoding when opening the file,
instead of only when writing to the file.
Other ways of getting Unicode characters into
Python string objects
We've now seen how to get Unicode data
in our strings from either literal text entered in the
Python code or from files. There are several other ways of
constructing Unicode strings. You can build strings using
the Unicode escape codes, or from a sequence of Unicode
characters.
For this purpose, Python offers
unichr, which returns a Unicode string
of exactly one character wide, when called with a numerical
argument between 0 and 65535. This can be useful when
building tables. The resultant character can, of course,
only be printed when encoded with the right encoding.
Example 8-7. Building a string from single Unicode
characters
#
# unichar.py Building strings from single chars.
#
import string, codecs
CYRILLIC_BASE=0x0400
uList=[]
for c in range(255):
uList.append(unichr(CYRILLIC_BASE + c))
# Combine the characters into a string - this is
# faster than doing u=u+uniChr(c) in the loop
u=u"" + string.join(uList,"")
f=codecs.open("cyrillic1.ut8", "aw+", "utf-8")
f.write(u)
f.flush()
f=open("cyrillic2.ut8", "aw+")
f.write(u.encode("utf-8"))
f.flush()
Note that even if you construct your
Unicode string from separate Unicode characters, you will
still need to provide an encoding when printing (utf-8, to
be exact). Note also that when writing text to a file, you
will need to explicitly tell Python that you are not using
ASCII.
Another way of adding the occasional
Unicode character to a string is by using the
\uXXXX escape codes. Here XXXX is a
hexadecimal number between 0x0000 and 0xFFFF:
Python 2.1 (#1, Apr 17 2001, 20:50:35)
[GCC 2.95.2 19991024 (release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> u=u"\u0411\u0412
About codecs and locales: With all this messing about with
codecs you will no doubt have wondered why Python can't
figure out that you live in, say, Germany, and want the
iso-8950-1 codec by default, just like the rest of your
system (such as your mail client, your wordprocessor and
your file system) uses. The answer is twofold. Python
does have the ability to determine
from your system which codec it should use by default.
This feature, however, is disabled, because it is not
one-hundred percent reliable. You can enable that code, or
change the default codec system-wide, for all Python
programs you use, by hacking the
site.py file in your Python library
directory:
# Set the string encoding used by the Unicode implementation. The
# default is 'ascii', but if you're willing to experiment, you can
# change this.
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
...
if encoding != "ascii":
sys.setdefaultencoding(encoding)
Either change the line
encoding = "ascii" to the codec
associated with the locale you live in, or enable the
locale aware default string encodings by setting the line
if 0: to if
1:.
It would be nice if you could call
sys.setdefaultencoding(encoding) to
set a default encoding for your application, such as
utf-8. But, and you don't want to hear this, this useful
function is intentionally deleted from the
sys module when Python is started, just
after the file site.py is run on
startup.
What can one do? Of course, it's very
well to assume that all users on a system work with one
encoding and never make trips to other encodings; or to
assume that developers don't need to set a default
encoding per application, because the system will take
care of that, but I'd still like the power.
Fortunately, there's a solution. I'll
probably get drummed out of the regiment for suggesting
it, but it's so useful, I'll tell it anyway. Create a file
called sitecustomize.py as
follows:
Example 8-8. sitecustomize.py — saving a useful
function from wanton destruction
#
# sitecustomize.py - saving a useful function. Copy to the
# somewhere on the Python path, like the site-packages directory
#
import sys
sys.setappdefaultencoding=sys.setdefaultencoding
Make this file a part of your
application distribution and have it somewhere on the
Python path which is used for your application. This file
is run automatically before site.py
and saves the useful function
setdefaultencoding under another
name. Since functions are simply references to objects and
those objects are only deleted when the last reference is
deleted, the function is saved for use in your
applications.
Now you can set UTF-8 as the default
encoding for your application by calling the function as
soon as possible in the initialization part of your
application:
Example 8-9. uniqstring3.py - messing with Unicode strings
using utf-8 as default encoding
#
# uniqstring3.py - coercing Python strings into and from QStrings
#
from qt import QString
import sys
sys.setappdefaultencoding("utf-8")
s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"
print s
print u
Qt and Unicode
As mentioned earlier,
QString is the equivalent of a Python
Unicode string. You can coerce any Python string or any Python
Unicode object into a QString, and vice
versa: you can convert a QString to
either a Python string object, or to a Python Unicode
object.
If you want to create a plain Python
string from a QString object, you can
simply apply the str() function to it:
this is done automatically when you print
a QString.
Unfortunately, there's a snake in the
grass. If the QString contains characters outside the ASCII
range, you will hit the limits dictated by the default ASCII
codec defined in Python's site.py.
Example 8-10. uniqstring1.py - coercing Python strings into
and from QStrings
#
# uniqstring1.py - coercing Python strings into and from QStrings
#
from qt import QString
s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"
qs=QString(s)
qu=QString(u)
print str(qs)
print str(qu)
boud@calcifer:~/doc/opendoc/ch4 > python uniqstring1.py
A string that contains just ASCII characters
Traceback (most recent call last):
File "uniqstring1.py", line 13, in ?
print qu
File "/usr/local/lib/python2.1/site-packages/qt.py", line 954, in __str__
return str(self.sipThis)
UnicodeError: ASCII encoding error: ordinal not in range(128)
If there's a chance that there are non-ASCII characters
in the QString you want to convert
to Python, you should create a Python unicode object,
instead of a string object, by applying
unicode to the
QString.
Example 8-11. uniqstring2.py - coercing Python strings into and from
QStrings
#
# uniqstring2.py - coercing Python strings into and from QStrings
#
from qt import QString
s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"
qs=QString(s)
qu=QString(u)
print unicode(qs)
print unicode(qu)