The Help Desk

Japanese Character Sets and

Encoding Methods for PCs

Computing Japan will soon present a series of articles that deal with such technical topics as Japanese font technology, Unicode, front-end processors, and the handling of kanji on the World Wide Web. In order to make this material more accessible to our readers, this month's Help Desk offers part one of a brief tutorial on some of the basic concepts of Japanese text processing.

by Steven Myers

For many otherwise Japanese-literate computer users, the process of trying to use both Japanese and English software on the same machine can be a confusing and frustrating experience. This tutorial introduces some of the basic concepts involved in the process: an overview of Japanese character sets, the methods used to encode these sets, and the relationships between the various encoding methods.

Character sets

A character set is nothing more than a collection of characters, usually all belonging to the same language. On most PCs, the characters that you can display and print are determined by the character set that you select. On DOS/Windows computers, for example, character sets are implemented using a "code page" model (also called a "locale" model). In this scheme, the user chooses the code page for a particular country by using the "country" command, which sets the default for not only the characters of the language, but also for culture-specific information such as time and date format, currency, and decimal separators.

For users who use only English or other Western European languages, this is a relatively simple matter. The total number of characters used in these languages is relatively small, thus allowing each character to be represented by a unique one-byte code. Not so for Japanese, which requires two-byte codes to represent all of the kanji characters.

In each code page, the characters with code values 32 through 127 are the same; these comprise the 7-bit set termed ASCII (American Standard Code for Information Interchange). Characters that have code values greater than 127 are called extended characters; these vary from code page to code page, as they are generally used to represent the special (such as "accented") characters of a particular language. (Note that a single character set can often support more than one language, and that these character sets all include the ASCII character set. The character set used by Windows 3.1 and Windows 95 to support Western European languages is called Latin 1, or ANSI.)

The two most important character sets used for processing Japanese are defined in separate documents published by the Japanese Standards Association (JSA) that define national character set standards: JIS X 0208-1990, which lists 6,879 characters (6,355 kanji), and JIS X 0212-1990, which lists 6,067 characters (5,801 kanji not contained in JIS X 0208-1990). Each character is represented as a single point on a 94x94-point grid, allowing for a total of 8,836 characters to be represented in each document. (Why 94 points? To represent a printable character, a byte value must fall within the range 21h to 7Eh, or 33 to 126 decimal; 126 - 32 = 94). Values below 21h are used to represent control characters, such as carriage return and line feed.

A single character on the grid is referred to by its kuten (ãÊ"_), or "row-column" value. The various character sets contained within JIS X 0208-1990 are divided as follows:

Rows 1 and 2 symbols

Row 3 numerals and latin
alphabet characters

Row 4 hiragana

Row 5 katakana

Row 6 Greek alphabet

Row 7 Latin alphabet

Row 8 line-drawing characters

Rows 16-47 JIS Level 1 kanji (the most commonly used kanji)

Rows 48-84 JIS Level 2 kanji (less
frequently used kanji)

other rows unassigned

In order to make it easier to find individual characters within a document, each row of the grid is usually represented as shown in the figure (which presents all of the kanji characters listed in row 16. Note that this long line of characters is wrapped into 5 subrows). The kuten value for each character is a 4-digit number: the first two digits designate the row, the last two the column. For example, à¸ -- the last character on the bottom line -- is in row 16; to get the column value, add the subrow value (80) and the column heading value (14), for a column value of 94. Concatenating the two numbers gives a kuten code of 1694.

Character encoding

With a fairly complete Japanese character set (the characters of JIS X 0208-1990) and a way of referring to each individual character (kuten) that allows a character to be represented using two bytes, processing Japanese on a computer might seem relatively straightforward. In reality, however, it is often inefficient and impractical for PCs to use two bytes for every character; many characters can be represented with a single byte (such as half-width hiragana and katakana, romaji, and numerals). Therefore, a method of character encoding is used to allot one byte per character whenever possible, and two bytes per character only when absolutely necessary.

JIS

JIS (Japanese Industrial Standard) encoding allows the use of several different character sets in much the same way that DOS uses different code pages. With JIS encoding, an escape sequence signals a change in character sets. For example, to encode the phrase "two äøéö," a three-byte escape sequence is first stored to signal that the ASCII character set is to be used. Next come the three ASCII-encoded byte values for the characters "t," "w," and "o." This is followed by another escape sequence to signal a change to the JIS X 0208-1990 character set, which must precede the two double-byte values for "äøéö." Finally, an escape sequence to signal a change back to a single-byte encoding scheme must end the line if the last character is a double-byte character. (This ensures that a transmission error in one line of the file does not affect the following lines.)

Such coding based on escape sequences is called modal encoding, since the escape sequences designate a change in modes (character sets). JIS encoding uses only 7 bits of each byte; the eighth bit is still there, but isn't used. This feature makes JIS encoding a good choice for data sent across networks via e-mail, etc., since the eighth bit of each byte is often stripped from transmitted characters.

Shift-JIS

The encoding method now used on most MS-DOS/Windows PCs is called Shift-JIS, and was originally developed by Microsoft Corporation. Instead of using escape sequences to signal a change in character sets, Shift-JIS encoding determines which character set to use by checking the byte value. If the value is in the range 21h to 7Eh (33 to 126 decimal), the ASCII/JIS-Roman character set is used. Likewise, all half-width katakana characters have values in the range A1h to DFh (161 to 223 decimal). A byte value that falls in the range 81h to 9Fh or E0h to EFh is taken to be the first byte (or lead byte) of a double-byte character. The following byte is then treated as the second-byte, or trailing byte of the character.

Note that there is never a need to explicitly signal a character set change in Shift-JIS encoding -- the character set is determined solely by the byte value. Shift-JIS is thus a non-modal encoding scheme. However, since the coding space is limited, Shift-JIS encoding does not allow the use of characters defined in JIS X 0212-1990 (the 5,801 kanji that are less frequently used). This problem does not exist with JIS encoding, since new character sets can be added simply by defining a new escape sequence.

In closing

There are many different versions and flavors of the JIS and Shift-JIS encoding schemes that have been defined by corporations such as NTT and NEC. Each definition contains the same main set of characters, however, with only slight additions/modifications. Keep in mind, too, that kuten is not a true encoding method (although it has been used as such by some systems); rather, it is a method used to index the JIS X 0208-1990 and JIS X 0212-1990 documents.ç

Next month: An introduction to EUC
encoding and Unicode

The ANSI Windows 3.1 character set

Row 16 of the 94 x 94-point grid