Interview: Dr. Nick Campbell

Back to Contents of Issue: September 2001


Research Director, Expressive Speech Processing Project, ATR Information Sciences Division.

by Alex Stewart

Dr. Nick Campbell is a very British researcher in a very green corner of Japan. Research director of the Expressive Speech Processing Project at the ATR Information Sciences Division, he first came to Japan in 1975 as an English language instructor. Before joining the ATR labs in 1991, he was a research fellow at the IBM (UK) Scientific Center, and then a Senior Linguist at Edinburgh University's Speech Technology Research Center. He became the first foreign head of an ATR department in 1997.

His research field is one of those "over-the-horizon" areas that could shape the future of telecommunications. His goal, broadly speaking, is to get computers to talk like humans, and to understand how intonation and voice quality express subtleties of meaning. Alex Stewart interviewed Dr. Campbell at the ATR Labs.

Can you start by giving some background on what kind of research you're doing at ATR?
Broadly, the goal of our work is to understand the intention of the information behind words. Technology can translate words, but it is much less able to translate the nuances of speech. For example, if I say, "Ye-e-e-e-s" in a long, interrogative way, it means, "Wait, I have to think about that," not "Yes."

What is CHATR? How can a company use it?
CHATR is a voice synthesis system. We manage the underlying data -- the phonemes -- and this enables us to recreate the voice. We control access to this data, so that the voice is not used in an improper way, which violates the rights of the owner of the voice. If we license the technology to private companies, they have to maintain the same controls.

We now have good enough technology to capture the nuances of speech of a person in their native language and retain those same nuances translated into another language.

What are some examples of applications for your technology?
The immediate applications are in things like automated call centers -- for example, hotel reservations. You can apply the technology quite easily because the conversation parameters are well defined. Another application is weather forecasting, where you can call an automated call center and conduct a conversation about the weather.

With CHATR, we used the voice of a famous dancer for our weather forecaster's personality voice on Hankyu Railway's Web page. We can synthesize a person's voice so that it will be instantly recognizable. We can't quite get the exact flow of natural speech yet, as the sounds get swallowed occasionally, but we are very close.

I can see, at a popular level, how kids who like to surf the mobile Web and listen to music or audio messages could enjoy hearing synthesized voices. For example, you can use the voices of well-known TV personalities and have them talk in different contexts, such as the weather.

If you can create conversations that are cute or familiar, it's likely to be a big success with young people. Another example is voice synthesis in car navigation systems, which is going to be really big.

You are clearly very enthusiastic about your work: What drives you?
Curiosity. I want to understand how speech really works. The information carried by speech is a wonderful, mysterious puzzle. I used to think linguistics was quite simple when I was a teacher. It was only when I started trying to make machines copy speech that I discovered how complicated it is! We don't yet have a true understanding of how speech signals communicate meaning. Despite this -- or because of this -- our goal here is to create speech that is absolutely human-like.

What do you expect to be able to show in five years' time?
We should have a better understanding of how meaning is expressed in speech, so I would expect some form of "intention recognition" -- in addition to speech recognition -- that filters the spoken words to add an extra layer of interpretation. I also expect that we'll hear more lively speech synthesis -- expressing hesitation and laughter, for example, and more able to reproduce the sounds of conversational speech, rather than being limited to an announcer style of speech, like the current machines.

I think that if we are going to live with Web-based information services, then some kind of voice access to that information will be essential -- and a friendly conversational style of synthesis will be needed if the technology isn't going to drive people crazy.

My Expressive Speech Processing project will have come to an end in five years' time, so we'll have some interesting demos and prototypes to show you, but it will be a lot longer before the technology is able to perform as well as the people on the street expect.

But if you look at the explosion in portable phone use, and at the rapid growth of the Internet as a source of information, and then think how many of those people would be happier using their phone than using a computer keyboard to access the information, you'll see how important it is that we crack the code of speech communication, and get some friendly and fun technology out there soon. There are a lot of people waiting for it!

Note: The function "email this page" is currently not supported for this page.