THE INTELLIEGENT WIRELESS WEB
Overview
The 21st Century is becoming ever more dependent upon wireless communication for
phones, tablets, and notebooks. Wireless delivery of video and Internet
service is rapidly expanding worldwide.
Ideally, the wireless communication process could start with a
user interface based on speech recognition where we merely talk to a personal
mobile device that recognizes our words and commands. The personal mobile device
would connect seamlessly to embedded devices in the
environment. The message would be relayed to a server residing on a network with
the necessary processing power and software to analyze the contents of the
message. The server would link to additional Web resources that could then draw
necessary supplemental knowledge from around the world through the Internet.
Finally, the synthesized message would be delivered to the appropriate parties
in their own language on their own personal mobile device.
This process requires us to explore the
following communications relationships:
- Connecting People to Devices (the user interface): Currently we rely on the
mouse, keyboard, or touch screen display. Speech recognition and software logic
for mobile devices may be the key for the future.
- Connecting Devices to Devices: Future
smart applications require improvement of wireless infrastructure and
intelligent machine-to-machine software.
- Connecting Devices to People. To deliver useful information to the globally
mobile user, systems require speech synthesis and language
translation. Information services
and communications may be transformed to broadband global delivery.
There are a number of challenges to the development and deployment of scalable,
intelligent wireless Web applications.
These include: device proliferation, bandwidth and interface limitations,
and applications with greater capabilities, and upgraded wireless standards.
Perhaps the most daunting challenge is the integration,
synthesis and interfacing of these elements. So, just how will the Web become
smart-enough to fulfill the vision of a robust global mobile system providing
increasingly relevant and intelligent applications? The development of the
physical components and software necessary to implement the Intelligent Wireless
Web requires insight into the compatibility, integration and synergy of the
following five emerging technology areas:
- User Interface - to transition from the mouse click, keyboard, and touch screen to speech as
the primary method of communication between people and devices;
- Personal Space - to transition from connection of devices by tangled wires to
multifunction wireless devices;
- Networks - to transition to an global
wireless system of interconnections;
- Protocols - transition to the new
Mobile IP; and
- Web Architecture - to transition from dumb and static applications to new
applications that are intelligent, dynamic and constantly improving.
As the Web matures, the information technology community seems to be viewing the
Web as a global database with a knowledge representation system. While a
database management system is simply a collection of procedures for retrieving,
storing, and manipulating data, it is also possible to view the Web in terms of
applied "learning algorithms" in which data is taken from a database as input
and, after performing appropriate algorithmic operations (based upon statistics,
experiment, or other approaches), returns an output statement that contains
enhanced data thereby representing learning. In building the Intelligent
Wireless Web, we are seeking to create a Web that learns, yielding continuously
improved applications and information. We will highlight
the innovative processes underway in each of these technological areas:
- · USER INTERFACE - from click to speech
- · PERSONAL SPACE - from wired to wireless
- · NETWORKS - from wired to integrated wired/wireless
- · PROTOCOLS - from IP to Mobile IP
- · WEB ARCHITECTURE - from dumb and static to intelligent and dynamic
From click to speech
Communication between humans and their machines has been the subject of a large
amount of technical research. Work on the "man-machine interface" has been
conducted since the development of complex machinery, such as locomotive trains,
automobiles and washing machines. The need to efficiently provide information to
machines, to control their functions, and receive information from machines to
inform the human operators of their status, has increased dramatically over many
decades. How should we converse with a computer, its connected devices and the
machines they may control? If talking is the most natural way humans communicate
- why not communicate with computers through ordinary language? After all, we
learn to speak before we learn to read and write. Speech is also a highly
efficient form of communications - people speak about five times faster than
they type.
Today, there are two basic approaches to deciphering spoken commands. One
approach uses matching algorithms that compare bit patterns of the spoken
commands to standard bit patterns stored in the speech recognition software's
library of patterns. The commands and related bit patterns are matched for
appropriate actions. This approach is most often used in discrete speech
applications. A library of bit patterns is created by averaging the patterns of
a large sampling of pronunciations for a specific vocabulary. In the second
approach users "train" the speech software by providing speech patterns. This
approach uses statistical modeling and libraries of word and grammar rules to
increase the accuracy and responsiveness of the speech software. Most of today's
speech recognition applications have a vocabulary database of up to 200,000
words with appropriate grammar rules Speech recognition (speech-to-text)
transforms human voice inputs into commands and characters. Speech recognition
would be a highly desirable capability for handheld, mobile devices, if the many
obstacles can be overcome, allowing machines to recognize the user's comments
and respond in context. There are two modes of speech recognition:
command-and-control and dictation. Voice input to a command- environment can
trigger specific actions and can be used to navigate an application. For
example, a user speaks the command "Call George" to launch the application that
automatically dials George's cell phone number. Rather than speaking a single
word or phrase to initiate an action, dictation transcribes the user's speech
into text. The text file can then be saved just like any other file and send it
as an attachment over the Web.
The Speech Interface Framework working group of the World Wide Web Consortium
(W3C) is developing standards to enable access to the Web using spoken language
(see www.w3.org/Voice). The Speech Synthesis Markup Language (SSML)
specification is part of a set of new markup specifications for voice browsers.
It is an XML-based markup language for assisting the generation of synthetic
speech on the Web. It provides authors content to synthesis in a standard way
and to control aspects of speech such as pronunciation, volume, pitch, and rate
across different synthesis-capable platforms. The VoiceXML Forum is an industry organization established to promote
VoiceXML as the standard for speech-enabled Web applications. Current computer
systems use video displays and the keyboard or mouse (to point and click) as the
primary methods for user interface. What is needed to transition from the point
and click method to the more natural use of human speech as a primary user
interface (though not the only means). The main technology requirements are
centered on speech recognition, speech understanding, converting text to speech,
language translation, speech synthesis, and speech markup language.
REQUIREMENTS: Speech Recognition - Language Understanding Text to Speech
Translation
Speech is also a highly efficient form of communications, since people speak
about five times faster than they type. When speech recognition reliability
begins to reach the necessary level of acceptable performance, we can expect to
see a rapid adaption of the technology.
For many different purposes, the use of natural speech greatly improves the
effectiveness of our communications; this is especially true for wireless
applications since mobile devices are small and awkward to manipulate, and, in
addition, have limited capacity for receiving and delivering text or graphic
information. Is this to say that future systems will rely exclusively on the
speech interface to wireless handheld devices? Of course not. Future systems
should be expected to implement a variety of user interfaces and terminal
devices. The mouse and its associated graphic displays will be with us for some
time, and there will always be the need to display highly data-intensive visual
outputs. Speech-based interfaces represent a new dimension that does not exclude
but rather builds upon the methods currently used to interact with computer
systems and networks.
The final element in the desired communication process for the Intelligent
Wireless web is delivering recognizable speech output to the recipient through
speech synthesis. Speech Synthesis is the automatic generation of speech from
text data or from other non-text applications. Text-to-Speech (TTS) refers to
audible responses from the computer. The actual voice responses can use recorded
human speech phrases or audio generators that produce a natural human sound.
However, a large amount of memory is needed to store the recorded voice
vocabulary.
Man-machine interaction applications will require a virtually unlimited
vocabulary of speech output and a wide-ranging sound analysis and generation
capability from text-to-speech (TTS) systems in order to produce ever more
human-sounding speech. At the same time, current applications place practical
constraints on TTS system parameters, with limitations on factors such as memory
size, software flexibility, and processor performance. For example, many
embedded systems require small (and therefore limited)
TTS systems for slower processors; some wireless applications require small
speech generation components on the wireless client coupled with low data
transmission rates. An increasing number of applications, however (e.g., dialog
systems, aids for disabled individuals, and Web page readers), will benefit from
control over voices, pitch, and other aspects of the speech output. TTS systems
today do not yet meet the fundamental goal of producing speech indistinguishable
from that of a human. A system that is natural-sounding in one area, such as
overall voice quality or the quality of individual speech sounds, may be
unnatural-sounding in another, such as prosody (pitch patterns and timing); or a
system that is generally more natural-sounding may be less intelligible. In some
cases, a system that excels in overall voice quality may require an unacceptable
amount of memory or execution time. The ultimate goal, then, is a system that
not only faithfully replicates human speech, but also meets the needs of
applications in terms of flexibility, memory usage, and performance. Thus the
most significant issues related to speech synthesis are the production of high
quality speech while minimizing the hardware and software requirements including
memory, algorithmic complexity and speed of computation.
From dumb & static to intelligent & dynamic :
We have said that fundamentally, our vision for the future of an Intelligent
Wireless Web is very simple - it is a network that provides anytime, anywhere
access to information resources with efficient user interfaces and applications
that learn and thereby provide increasingly useful services whenever and
wherever we need them. For the Web to learn, it requires learning algorithms and
mechanisms for self-organization of a hypertext network. It needs to develop
algorithms that would allow the Web to autonomously change its structure and
organize the knowledge it contains, by "learning" the ideas and preferences of
its users.
One way to move toward theses goals has been suggested by the World Wide Web
Consortium (W3C) through the use of better semantic information as part of web
documents, and of the use of next generation web languages like XML and RDF. The
Semantic Web Architecture will move from IP to Mobile IP combined with an XML
Layer, an RDF and Schema layer, and a Logic layer. Facilities to put
machine-understandable data on the Web are becoming a high priority for many
communities.
The Web can reach its full potential only if can process automated tools.
Tomorrow's programs must be able to share and process data even when designed
totally independently. The Semantic Web is a vision of having data on the web
defined and linked in a way that it can be used by machines not just for display
purposes, but for automation, integration and reuse of data.
So what are, the key needs to enable transition from the current dumb and static
systems to the intelligence and flexibility of the Intelligent Wireless Web? We
believe the key is in the Semantic Web. Key technology requirements include XML
schema, RDF schema, logic layering,
distributed Artificial Intelligence (AI) and AI server farms. In addition,
information registration and validation will also be an essential global service
to support activities, such as, financial transactions.
REQUIREMENTS: XML schema, RDF schema, Logic Layer
REFERENCE:
The Intelligent Wireless Web
by H. Peter Alesso and Craig F. Smith