A Look at EUC Encoding
The November Help Desk focused on the basics of PC character sets and
JIS and Shift-JIS encoding, and in December we looked at Unicode, a relatively
new 16-bit encoding method already implemented on Microsoft's Windows NT
operating system. This month, we round out our tutorial on character sets
and encoding methods by examining EUC (Extended UNIX Code), the encoding
method used in most UNIX environments.
by Steven Myers
You will recall from the November Help Desk that the high number of (often
lengthy) escape sequences used to signify a character set change in JIS
encoding makes for highly inefficient internal storage. The Shift-JIS encoding
method avoids this problem by designating certain byte values to always
be the first byte of a two-byte character code. A major drawback of Shift-JIS,
however, is that the available encoding space is severely limited, and thus
is used primarily to encode only the 6,355 kanji in the JIS X 0208-1990
character set.
The best of both worlds
EUC (Extended UNIX Code), the encoding standard found on most UNIX workstations,
attempts to capture the best of both worlds by providing for the inclusion
of four different character sets without requiring the use of escape sequences.
Furthermore, whereas JIS and Shift-JIS are Japanese-specific encoding methods,
EUC can be "localized" for a particular country simply by using
appropriate character sets from the language of that country. Code set 0
is always set to the local version of ASCII (this would be JIS-Roman for
Japan), while use of the remaining three code sets -- and their implementation
-- is left up to each country.
In this sense, EUC conforms to the code page model used on DOS/
Windows PCs. That is, the user can specify which "version" of
EUC to use, such as EUC-J for Japanese or EUC-KR for Korean. The EUC locale
tells the system which character sets to use for code sets 1, 2, and 3.
Whereas EUC-J would use code set 1 to encode the JIS X 0208-1990 character
set (Japanese kanji), for example, EUC-KR would employ the same code
set 1 for the Korean KS C 5601-1992 character set.
Note, however, that this scheme does not escape the problem of having to
deal with multiple code pages -- which can be a real headache for developers
and programmers, many of whom have long called for a "unified"
coding system such as Unicode. Yet efforts at making Unicode more widespread
are meeting with considerable resistance from countries that fear their
particular rendering of a character will die out, as it is merged with similar-looking
characters from other countries into a single code point. (For more discussion
of this, see "A Unicode Tutorial" (page 9) and "Will Unicode
Kill Japanese Kanji," (page 15) in our December issue.)
There is little doubt that the coding scheme of the future must be truly
international and allow for the inclusion of the characters from virtually
all languages within a fixed-width encoding space. In practical terms, though,
EUC appears at this point to be the most efficient of the "non-controversial"
encoding methods.
Japanese implementations of EUC
Two different methods are used to implement EUC in the encoding of Japanese
character sets. The most widely used of these is called "packed format";
it includes not only 1- and 2-byte characters, but also 3-byte values. Figure
1 shows the distribution of code space for Japanese packed format EUC. Note
that, like Shift-JIS, the value of a byte determines whether it is to be
taken as a single-byte character code, or as the first of a 2- or 3-byte
value. Also note that the 3-byte codes are used to encode characters from
the JIS X 0212-1990 standard (kanji that are encountered less frequently).
The other Japanese implementation of EUC, known as "complete two-byte
format," is much less common than packed format. Like Unicode, all
the values in complete two-byte-format EUC are 16 bits wide, even the JIS-Roman/ASCII
values of code set 0. Figure 2 shows the range of code values assigned to
Japanese complete two-byte format EUC.
Conclusion
Over a period of three months, we have examined the basic encoding methods
used to process Japanese text: JIS, Shift-JIS, EUC, and the Unicode international
encoding standard. Because the relative positions of characters in the Japanese
national character sets are kept somewhat consistent in the JIS, Shift-JIS
(excluding JIS X 0212), and EUC encoding methods, conversion among these
three methods is fairly straightforward (and has become a standard inclusion
in Japanese-capable applications). Conversion to/from Unicode is not quite
so simple, though, requiring the use of mapping tables (which can be obtained
from the Unicode Consortium).
At present, the main issue concerning the future of character sets and encoding
methods appears to be whether or not the benefits of using an international
scheme (such as Unicode) are strong enough to offset the potential conflicts
among countries that perceive a threat to their language and culture. The
efforts of Unicode supporters notwithstanding, this problem may not be completely
resolved until 32-bit character code values and a virtually unlimited encoding
area become more practical. Until then, users will likely have to make do
with existing, "partial-solution" methods.ç
(c) Copyright 1996 by Computing Japan magazine