The 21st Century is becoming ever more dependent upon wireless communication for phones, tablets, and notebooks. Wireless delivery of video and Internet service is rapidly expanding worldwide.

Ideally, the wireless communication process could start with a user interface based on speech recognition where we merely talk to a personal mobile device that recognizes our words and commands. The personal mobile device would connect seamlessly to embedded devices in the environment. The message would be relayed to a server residing on a network with the necessary processing power and software to analyze the contents of the message. The server would link to additional Web resources that could then draw necessary supplemental knowledge from around the world through the Internet. Finally, the synthesized message would be delivered to the appropriate parties in their own language on their own personal mobile device.

This process requires us to explore the following communications relationships:

  • Connecting People to Devices (the user interface): Currently we rely on the mouse, keyboard, or touch screen display. Speech recognition and software logic for mobile devices may be the key for the future.
  • Connecting Devices to Devices: Future smart applications require improvement of wireless infrastructure and intelligent machine-to-machine software.
  • Connecting Devices to People. To deliver useful information to the globally mobile user, systems require speech synthesis and language translation. Information services and communications may be transformed to broadband global delivery.

There are a number of challenges to the development and deployment of scalable, intelligent wireless Web applications.

These include: device proliferation, bandwidth and interface limitations, and applications with greater capabilities, and upgraded wireless standards.

Perhaps the most daunting challenge is the integration, synthesis and interfacing of these elements. So, just how will the Web become smart-enough to fulfill the vision of a robust global mobile system providing increasingly relevant and intelligent applications? The development of the physical components and software necessary to implement the Intelligent Wireless Web requires insight into the compatibility, integration and synergy of the following five emerging technology areas:

  • User Interface - to transition from the mouse click, keyboard, and touch screen to speech as the primary method of communication between people and devices;
  • Personal Space - to transition from connection of devices by tangled wires to multifunction wireless devices;
  • Networks - to transition to an global wireless system of interconnections;
  • Protocols - transition to the new Mobile IP; and
  • Web Architecture - to transition from dumb and static applications to new applications that are intelligent, dynamic and constantly improving.

As the Web matures, the information technology community seems to be viewing the Web as a global database with a knowledge representation system. While a database management system is simply a collection of procedures for retrieving, storing, and manipulating data, it is also possible to view the Web in terms of applied "learning algorithms" in which data is taken from a database as input and, after performing appropriate algorithmic operations (based upon statistics, experiment, or other approaches), returns an output statement that contains enhanced data thereby representing learning. In building the Intelligent Wireless Web, we are seeking to create a Web that learns, yielding continuously improved applications and information. We will highlight the innovative processes underway in each of these technological areas:

  • · USER INTERFACE - from click to speech
  • · PERSONAL SPACE - from wired to wireless
  • · NETWORKS - from wired to integrated wired/wireless
  • · PROTOCOLS - from IP to Mobile IP
  • · WEB ARCHITECTURE - from dumb and static to intelligent and dynamic
From click to speech

Communication between humans and their machines has been the subject of a large amount of technical research. Work on the "man-machine interface" has been conducted since the development of complex machinery, such as locomotive trains, automobiles and washing machines. The need to efficiently provide information to machines, to control their functions, and receive information from machines to inform the human operators of their status, has increased dramatically over many decades. How should we converse with a computer, its connected devices and the machines they may control? If talking is the most natural way humans communicate - why not communicate with computers through ordinary language? After all, we learn to speak before we learn to read and write. Speech is also a highly efficient form of communications - people speak about five times faster than they type.

Today, there are two basic approaches to deciphering spoken commands. One approach uses matching algorithms that compare bit patterns of the spoken commands to standard bit patterns stored in the speech recognition software's library of patterns. The commands and related bit patterns are matched for appropriate actions. This approach is most often used in discrete speech applications. A library of bit patterns is created by averaging the patterns of a large sampling of pronunciations for a specific vocabulary. In the second approach users "train" the speech software by providing speech patterns. This approach uses statistical modeling and libraries of word and grammar rules to increase the accuracy and responsiveness of the speech software. Most of today's speech recognition applications have a vocabulary database of up to 200,000 words with appropriate grammar rules Speech recognition (speech-to-text) transforms human voice inputs into commands and characters. Speech recognition would be a highly desirable capability for handheld, mobile devices, if the many obstacles can be overcome, allowing machines to recognize the user's comments and respond in context. There are two modes of speech recognition: command-and-control and dictation. Voice input to a command- environment can trigger specific actions and can be used to navigate an application. For example, a user speaks the command "Call George" to launch the application that automatically dials George's cell phone number. Rather than speaking a single word or phrase to initiate an action, dictation transcribes the user's speech into text. The text file can then be saved just like any other file and send it as an attachment over the Web.

The Speech Interface Framework working group of the World Wide Web Consortium (W3C) is developing standards to enable access to the Web using spoken language (see The Speech Synthesis Markup Language (SSML) specification is part of a set of new markup specifications for voice browsers. It is an XML-based markup language for assisting the generation of synthetic speech on the Web. It provides authors content to synthesis in a standard way and to control aspects of speech such as pronunciation, volume, pitch, and rate across different synthesis-capable platforms. The VoiceXML Forum is an industry organization established to promote VoiceXML as the standard for speech-enabled Web applications. Current computer systems use video displays and the keyboard or mouse (to point and click) as the primary methods for user interface. What is needed to transition from the point and click method to the more natural use of human speech as a primary user interface (though not the only means). The main technology requirements are centered on speech recognition, speech understanding, converting text to speech, language translation, speech synthesis, and speech markup language.

REQUIREMENTS: Speech Recognition - Language Understanding Text to Speech Translation

Speech is also a highly efficient form of communications, since people speak about five times faster than they type. When speech recognition reliability begins to reach the necessary level of acceptable performance, we can expect to see a rapid adaption of the technology.

For many different purposes, the use of natural speech greatly improves the effectiveness of our communications; this is especially true for wireless applications since mobile devices are small and awkward to manipulate, and, in addition, have limited capacity for receiving and delivering text or graphic information. Is this to say that future systems will rely exclusively on the speech interface to wireless handheld devices? Of course not. Future systems should be expected to implement a variety of user interfaces and terminal devices. The mouse and its associated graphic displays will be with us for some time, and there will always be the need to display highly data-intensive visual outputs. Speech-based interfaces represent a new dimension that does not exclude but rather builds upon the methods currently used to interact with computer systems and networks.

The final element in the desired communication process for the Intelligent Wireless web is delivering recognizable speech output to the recipient through speech synthesis. Speech Synthesis is the automatic generation of speech from text data or from other non-text applications. Text-to-Speech (TTS) refers to audible responses from the computer. The actual voice responses can use recorded human speech phrases or audio generators that produce a natural human sound. However, a large amount of memory is needed to store the recorded voice vocabulary.

Man-machine interaction applications will require a virtually unlimited vocabulary of speech output and a wide-ranging sound analysis and generation capability from text-to-speech (TTS) systems in order to produce ever more human-sounding speech. At the same time, current applications place practical constraints on TTS system parameters, with limitations on factors such as memory size, software flexibility, and processor performance. For example, many embedded systems require small (and therefore limited)

TTS systems for slower processors; some wireless applications require small speech generation components on the wireless client coupled with low data transmission rates. An increasing number of applications, however (e.g., dialog systems, aids for disabled individuals, and Web page readers), will benefit from control over voices, pitch, and other aspects of the speech output. TTS systems today do not yet meet the fundamental goal of producing speech indistinguishable from that of a human. A system that is natural-sounding in one area, such as overall voice quality or the quality of individual speech sounds, may be unnatural-sounding in another, such as prosody (pitch patterns and timing); or a system that is generally more natural-sounding may be less intelligible. In some cases, a system that excels in overall voice quality may require an unacceptable amount of memory or execution time. The ultimate goal, then, is a system that not only faithfully replicates human speech, but also meets the needs of applications in terms of flexibility, memory usage, and performance. Thus the most significant issues related to speech synthesis are the production of high quality speech while minimizing the hardware and software requirements including memory, algorithmic complexity and speed of computation.

From dumb & static to intelligent & dynamic :

We have said that fundamentally, our vision for the future of an Intelligent Wireless Web is very simple - it is a network that provides anytime, anywhere access to information resources with efficient user interfaces and applications that learn and thereby provide increasingly useful services whenever and wherever we need them. For the Web to learn, it requires learning algorithms and mechanisms for self-organization of a hypertext network. It needs to develop algorithms that would allow the Web to autonomously change its structure and organize the knowledge it contains, by "learning" the ideas and preferences of its users.

One way to move toward theses goals has been suggested by the World Wide Web Consortium (W3C) through the use of better semantic information as part of web documents, and of the use of next generation web languages like XML and RDF. The Semantic Web Architecture will move from IP to Mobile IP combined with an XML Layer, an RDF and Schema layer, and a Logic layer. Facilities to put machine-understandable data on the Web are becoming a high priority for many communities.

The Web can reach its full potential only if can process automated tools. Tomorrow's programs must be able to share and process data even when designed totally independently. The Semantic Web is a vision of having data on the web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data.

So what are, the key needs to enable transition from the current dumb and static systems to the intelligence and flexibility of the Intelligent Wireless Web? We believe the key is in the Semantic Web. Key technology requirements include XML schema, RDF schema, logic layering, distributed Artificial Intelligence (AI) and AI server farms. In addition, information registration and validation will also be an essential global service to support activities, such as, financial transactions.

REQUIREMENTS: XML schema, RDF schema, Logic Layer


   The Intelligent Wireless Web by H. Peter Alesso and Craig F. Smith