Japanese Character Sets and
Encoding Methods for PCs
Computing Japan will soon present a series of articles that deal with
such technical topics as Japanese font technology, Unicode, front-end processors,
and the handling of kanji on the World Wide Web. In order to make this material
more accessible to our readers, this month's Help Desk offers part one of
a brief tutorial on some of the basic concepts of Japanese text processing.
by Steven Myers
For many otherwise Japanese-literate computer users, the process of trying
to use both Japanese and English software on the same machine can be a confusing
and frustrating experience. This tutorial introduces some of the basic concepts
involved in the process: an overview of Japanese character sets, the methods
used to encode these sets, and the relationships between the various encoding
methods.
Character sets
A character set is nothing more than a collection of characters, usually
all belonging to the same language. On most PCs, the characters that you
can display and print are determined by the character set that you select.
On DOS/Windows computers, for example, character sets are implemented using
a "code page" model (also called a "locale" model).
In this scheme, the user chooses the code page for a particular country
by using the "country" command, which sets the default for not
only the characters of the language, but also for culture-specific information
such as time and date format, currency, and decimal separators.
For users who use only English or other Western European languages, this
is a relatively simple matter. The total number of characters used in these
languages is relatively small, thus allowing each character to be represented
by a unique one-byte code. Not so for Japanese, which requires two-byte
codes to represent all of the kanji characters.
In each code page, the characters with code values 32 through 127 are the
same; these comprise the 7-bit set termed ASCII (American Standard Code
for Information Interchange). Characters that have code values greater than
127 are called extended characters; these vary from code page to code page,
as they are generally used to represent the special (such as "accented")
characters of a particular language. (Note that a single character set can
often support more than one language, and that these character sets all
include the ASCII character set. The character set used by Windows 3.1 and
Windows 95 to support Western European languages is called Latin 1, or ANSI.)
The two most important character sets used for processing Japanese are defined
in separate documents published by the Japanese Standards Association (JSA)
that define national character set standards: JIS X 0208-1990, which lists
6,879 characters (6,355 kanji), and JIS X 0212-1990, which lists 6,067 characters
(5,801 kanji not contained in JIS X 0208-1990). Each character is represented
as a single point on a 94x94-point grid, allowing for a total of 8,836 characters
to be represented in each document. (Why 94 points? To represent a printable
character, a byte value must fall within the range 21h to 7Eh, or 33 to
126 decimal; 126 - 32 = 94). Values below 21h are used to represent control
characters, such as carriage return and line feed.
A single character on the grid is referred to by its kuten (ãÊ"_),
or "row-column" value. The various character sets contained within
JIS X 0208-1990 are divided as follows:
Rows 1 and 2 symbols
Row 3 numerals and latin
alphabet characters
Row 4 hiragana
Row 5 katakana
Row 6 Greek alphabet
Row 7 Latin alphabet
Row 8 line-drawing characters
Rows 16-47 JIS Level 1 kanji (the most commonly used kanji)
Rows 48-84 JIS Level 2 kanji (less
frequently used kanji)
other rows unassigned
In order to make it easier to find individual characters within a document,
each row of the grid is usually represented as shown in the figure (which
presents all of the kanji characters listed in row 16. Note that this long
line of characters is wrapped into 5 subrows). The kuten value for each
character is a 4-digit number: the first two digits designate the row, the
last two the column. For example, ภ-- the last character on
the bottom line -- is in row 16; to get the column value, add the subrow
value (80) and the column heading value (14), for a column value of 94.
Concatenating the two numbers gives a kuten code of 1694.
Character encoding
With a fairly complete Japanese character set (the characters of JIS X 0208-1990)
and a way of referring to each individual character (kuten) that allows
a character to be represented using two bytes, processing Japanese on a
computer might seem relatively straightforward. In reality, however, it
is often inefficient and impractical for PCs to use two bytes for every
character; many characters can be represented with a single byte (such as
half-width hiragana and katakana, romaji, and numerals). Therefore, a method
of character encoding is used to allot one byte per character whenever possible,
and two bytes per character only when absolutely necessary.
JIS
JIS (Japanese Industrial Standard) encoding allows the use of several different
character sets in much the same way that DOS uses different code pages.
With JIS encoding, an escape sequence signals a change in character sets.
For example, to encode the phrase "two äøéö,"
a three-byte escape sequence is first stored to signal that the ASCII character
set is to be used. Next come the three ASCII-encoded byte values for the
characters "t," "w," and "o." This is followed
by another escape sequence to signal a change to the JIS X 0208-1990 character
set, which must precede the two double-byte values for "äøéö."
Finally, an escape sequence to signal a change back to a single-byte encoding
scheme must end the line if the last character is a double-byte character.
(This ensures that a transmission error in one line of the file does not
affect the following lines.)
Such coding based on escape sequences is called modal encoding, since the
escape sequences designate a change in modes (character sets). JIS encoding
uses only 7 bits of each byte; the eighth bit is still there, but isn't
used. This feature makes JIS encoding a good choice for data sent across
networks via e-mail, etc., since the eighth bit of each byte is often stripped
from transmitted characters.
Shift-JIS
The encoding method now used on most MS-DOS/Windows PCs is called Shift-JIS,
and was originally developed by Microsoft Corporation. Instead of using
escape sequences to signal a change in character sets, Shift-JIS encoding
determines which character set to use by checking the byte value. If the
value is in the range 21h to 7Eh (33 to 126 decimal), the ASCII/JIS-Roman
character set is used. Likewise, all half-width katakana characters have
values in the range A1h to DFh (161 to 223 decimal). A byte value that falls
in the range 81h to 9Fh or E0h to EFh is taken to be the first byte (or
lead byte) of a double-byte character. The following byte is then treated
as the second-byte, or trailing byte of the character.
Note that there is never a need to explicitly signal a character set change
in Shift-JIS encoding -- the character set is determined solely by the byte
value. Shift-JIS is thus a non-modal encoding scheme. However, since the
coding space is limited, Shift-JIS encoding does not allow the use of characters
defined in JIS X 0212-1990 (the 5,801 kanji that are less frequently used).
This problem does not exist with JIS encoding, since new character sets
can be added simply by defining a new escape sequence.
In closing
There are many different versions and flavors of the JIS and Shift-JIS encoding
schemes that have been defined by corporations such as NTT and NEC. Each
definition contains the same main set of characters, however, with only
slight additions/modifications. Keep in mind, too, that kuten is not a true
encoding method (although it has been used as such by some systems); rather,
it is a method used to index the JIS X 0208-1990 and JIS X 0212-1990 documents.ç
Next month: An introduction to EUC
encoding and Unicode
The ANSI Windows 3.1 character set
Row 16 of the 94 x 94-point grid
|