How Computers Speak Japanese
A review of Ken Lunde's Understanding Japanese Information Processing
The revolutionary shift by the Japanese PC market to the DOS/V standard
has at last provided a common, familiar platform for hardware and software
developers alike. As a (perhaps unintended) side effect, it has also opened
the door for foreign software firms to start developing and localizing their
products for the Japanese market. Programmers eyeing opportunities in the
Japanese market, as well as users who would like to know more about how
Japanese text is handled electronically, will find Ken Lunde's book a valuable
source of information and advice.
New opportunities in the Japanese market have created increased interest
in Japanese computing and the methods used to electronically process text
in the Japanese language. For those who are doing, or thinking about doing,
any kind of work in this area, Ken Lunde's Understanding Japanese Information
Processing provides a wealth of detailed information about the encoding
methods and processing tools/techniques that are necessary for this kind
of development. It also includes a valuable bibliography for those who want/need
to delve deeper into this complex topic. Japanese character sets
The first few chapters of Lunde's book provide an introduction to the Japanese
writing system and give a general overview of Japanese information processing
techniques. Some basic terminology is defined (see sidebar for a sampling,
and the evolution of Japanese character sets (both electronic and non-electronic)
is described in detail.
In addition to ASCII, four other electronic character sets (JIS-Roman, half-width
katakana, JIS X 0208-1990, and JIS X 0212-1990) are used heavily in Japan
and considered to be "national" character sets. Lunde covers each
of these sets in detail, as well as some other Asian character sets (primarily
those used in China and Korea). He also covers international sets, including
Unicode. (Unicode is essentially an amalgamation of character set standards
from all over the world into a single, unified set with 65,536 character
positions, which can be visualized as a 256-row by 256-cell matrix. Lunde
notes that the actual number of characters in Unicode is constantly changing,
and that further changes can be expected in the future.) Encoding methods
The second section of the book deals with the numerous encoding methods
used to process Japanese text. Lunde reduces these to three basic schemes:
JIS, Shift-JIS, and EUC. JIS (Japan Industrial Standard) encoding, which
uses seven bits to encode characters, is modal ó escape sequences
are used to signal a change between character sets, or modes. While not
very efficient for internal representation, this method is widely used for
electronic transmission such as e-mail. Also, because the seven bits used
for the actual data encoding correspond to the printable ASCII character
set, each kanji character can be represented as a sequence of two ASCII
characters. Lunde provides a specific coding illustration (see figure on
page 42) to show how JIS encoding with escape sequences works. Shift-JIS
encoding was developed by Microsoft Corporation and is implemented as the
internal code for a wide assortment of platforms. In contrast to JIS, Shift-JIS
(SJIS) is non-modal ó if the numeric value of a character falls within
a particular range (81-9F or E0-EF hex for SJIS encoding), then it is treated
as the first byte, and the next character as the second byte, of a double-byte
character. Because no escape sequences are necessary, this type of non-modal
encoding is generally considered much more efficient for internal processing.
(The figure on page 42 shows the same example in SJIS coding. Note that
the range of the first byte falls completely within the extended set of
ASCII characters, meaning that there is no standard representation because
the characters encoded in the ASCII extended set vary according to the implementation.)
EUC (Extended UNIX Code) encoding, sometimes called UNIXized JIS or UJIS,
was developed as a method for processing multiple character sets in general
ó not just Japanese. This scheme is used as the internal code for
most UNIX workstations that are set up to support Japanese. Like SJIS, EUC
is a non-modal encoding method and resembles SJIS in its internal representation.
It consists of four separate code sets: the primary code set (which is the
ASCII character set) and three additional code sets that can be specified
by the user (and are generally used for non-Roman characters). Japanese
input/output
Lunde goes on in the following section to discuss the software and hardware
used for Japanese input, including FEPs (Front End Processors) and the various
types of Japanese keyboards currently in use. One chapter focuses on Japanese
output and offers a detailed analysis of both printer output and display
monitor output. Processing techniques, such as code conversion algorithms
and text stream handling algorithms, are also covered. Lunde provides the
C source code for several routines of interest. Tools for Japanese text
processing
The final two sections of the book contain a survey of the Japanese text
processing tools currently available and a look at Japanese e-mail and network
domains. The chapter on processing tools covers a wide variety of existing
software: operating systems, input software, text editors, word processors,
page layout software, online dictionaries, machine translation software,
and terminal software. A good starting point
Understanding Japanese Information Processing is an excellent and well-organized
guide for anyone wishing to learn the basics of processing Japanese text
on a computer. Lunde's writing style is clear and concise, and he includes
several diagrams to aid in visualizing the various encoding methods. One
feature of the book that potential developers should find particularly useful
is an extensive set of appendices that contain code conversion tables, Japanese
corporate character-set standards, corporate encoding methods, character
lists and mapping tables, software and document sources, and mailing lists.
Lunde also provides a helpful "advice to developers" section at
the end of several chapters; these contain his personal recommendations
and tips on localizing products for the Japanese market.
As Lunde himself notes, readers should not expect to find much information
here on specific design or market issues. Developers hoping to design their
own Japanese applications will need to consult other sources for information
on internationalization and localization (some reference manuals of this
type are mentioned in the bibliography). Also, the often problematic cultural
aspects of software localization (issues such as kanji sorts and Japan's
date and time formats) are not addressed in this book. Lunde's focus is
not the Japanese software market, but rather the fundamentals of Japanese
text processing. This book is best considered a general starting point for
potential developers and others interested in Japanese information processing.
Publication information: Lunde, Ken. Understanding Japanese Information
Processing. Sebastopol, CA: O'Reilly & Associates, Inc., 1993. ISBN 1-56592-043-0.
A terminology sampler Selected terms from the glossary of Understanding
Japanese Information Processing
Bitmapped font. A font whose character shapes are defined by arrays of bits.
Code position. The numeric code within an encoding method that is used to
refer to a specific character.
Encoding. The correspondence between numerical character codes and the final
printable glyphs.
Escape character. The control character (0x1B) that is used as part of an
escape sequence. Escape sequences are used in JIS encoding to switch between
one- and two-byte-per-character modes.
Escape sequence. A string of characters that contains one or more escape
characters, and is used to signify a shift in mode of some sort. In the
case of the Japanese character set, they are used to shift between one-
and two-byte-per-character modes, and to shift between different character
sets or different versions of the same character set.
JIS. Japanese Industrial Standard. The name of the standards established
by JISC. Also the name of the encoding method used for the JIS X 0208-1990
and JIS X 0212-1990 character set standards.
JISC. Japanese Industrial Standards Committee. The name of the organization
that establishes the JIS standards.
JIS Level 1 kanji. The name given to the 2,965 characters that constitute
the first set of kanji in JIS X 0208-1990. Ordered by pronunciation.
JIS Level 2 kanji. The name given to the 3,390 characters that constitute
the second set of kanji in JIS X 0208-1990. Ordered by radical, then by
total number of strokes.
JIS X 0208-1990. The latest version of the document that describes the Japanese
character set standard; 6,879 characters are enumerated.
Outline font. A font whose characters are described mathematically in terms
of lines and curves. Often referred to as scaleable fonts, because they
can be scan converted to bitmaps of any desired size and orientation.
Wide character. A character represented by 16 bits.
In addition to providing a thorough, platform-independent discussion of
Japanese text-processing issues, Lunde also describes some as-yet-unsolved
problems involving Japanese output/text transmission. One of the more interesting
of these issues is the gaiji problem. Many companies and users have defined
their own characters and fonts, and they run into problems when they try
to transmit these user- and corporate-defined characters to systems that
do not support those characters. Lunde observes that while there is currently
no elegant solution to this problem, a necessary step in finding an answer
might be to "embed character data, both bitmapped and outline, into
files when they are transmitted. This includes a mechanism for detecting
which characters are user-defined." The first person who can offer
a viable, platform independent solution to this problem, says Lunde, will
be "rewarded well by the Japanese computer industry."
|