Digital Talking Book Requirements for MML

Markku Hakkinen, The Productivity Works
George Kerscher, Recording for the Blind and Dyslexic
Created: 4 August 1997 Revised: 9 September 1997


Audience / Beneficiaries

General Requirements for Digital Talking Books

Functional Requirements

How MML can Meet These Requirements


We have conducted an analysis of the current draft of the MML specification provided on 13 July 1997. Our analysis is based upon the requirements for the production and delivery of digital talking and hybrid books. These requirements arise directly out of the work of international development efforts in digital talking books, namely, DAISY, Recording for the Blind and Dyslexic, and Library of Congress NLS. These are active projects, all with specific time frames, ranging from the immediate to 18 months.

On 4 August 1997 in Boston, a meeting of 12 representatives from Royal National Institute for the Blind (RNIB), DAISY, W3C, RFB&D, NLS, Productivity Works, NCR Corporation, European Blind Union (EBU), and other organizations met to discuss digital talking book requirements and the need to provide input to W3C's multimedia working group. Over the last year, it has become evident to these groups that the web protocols, HTML, RTSP, and lately, the multimedia work group draft specification, are highly relevant to digital talking books, and in fact, two groups (DAISY and RFB&D) have accepted web-based protocols as the basis for their ongoing talking books efforts.



Digital Audio is an integral aspect of book and information delivery and has to be viewed simply as an additional method of publishing. This new method of publishing is typified by products such as Microsoft Encarta. The efforts in Digital Audio are directed at taking this new method of publishing information into the realms of educational materials, reference materials, and leisure materials, where mixed media and universal accessibility combine to provide significant new market places and opportunities for the publishers and content providers.

It is clear that this additional paradigm in publishing is here to stay, and that the usage of Web protocols provides the ideal mechanism for the multiple modalities of delivery platforms required. From CD playback-only devices, analogous to the Walkman, through low cost web access devices, to full fidelity PCs where audio and textual annotations can be added to the materials by the user, this new form of publishing hits an enormous potential marketplace.

The range of audience, then, for digital audio used as an integral part of book and information delivery, covers all schools, colleges, organizations, and individuals. It also covers all levels of aptitude and physical and visual capabilities within this vast potential user community. From the salesperson listening to competitive information produced by their company while exercising in the morning or driving to work, to a worker in a hands-busy and eyes busy environment using a reference manual, to a child at school who needs to call in on the telephone and quickly reference something for their homework, to a University student reviewing materials while they are cleaning their dorm room. All are simple uses of digital audio in this new format.

This, of course, does not even begin to address the uses for people learning a new language, people who have learning disabilities, and people that are artificially or fully illiterate. For the first time we have a chance through the Web and its protocols to reach the billion or more people in the world who are simply illiterate and to begin to deliver information and resources that will improve their lives.

The audience is everyone.


General Requirements of Digital and Hybrid Books

There are several classes of digital talking books:

Type of Book

Navigation Types Available

Audio Only

Time Based Navigation

Audio with Page, Section Structure

Time Based and Structure Navigation

Audio with Full Text*

Time Based, Structure-based, Text Search

Text Only (synthetic speech)

Structure-based and Text Search

* Full text may include multimedia elements, such as images, video, and audio.

In addition to the navigation methods, each of the book types has synchronization mechanisms:

Type of Book

Synchrhonization Available

Audio Only

At Beginning/At End Events

Audio with Page, Section Structure

Structure Passage Events

Audio with Full Text*

Structure down to word level events

Text Only (synthetic speech)

Structure down to word level events

Functional Requirements

Synchronization is critical to digital talking books for two basic reasons:


Maintaining Text Position during audio play back by structure or word

Example 1:

When listening to a science book (full audio and text available) a visually disabled student may want to query the spelling of a word just heard. Using her bookreader (which may be a web browser with speech, or a specialised talking book device) the student presses a "Spell Function" which pauses the natural spoken reading of the book, and places the user in a word by word navigation mode. Using cursor keys, the user selects the word in question, and presses the "Spell This" function, which spells the word using synthetic or recorded human speech. The user then resumes playback of the book.

Example 2:

A dyslexic reader is using a graphical browser on a multimedia PC to read a novel. As each word is read by the recorded human voice, it is highlighted on screen. The user may pause, step word by word, and resume normal reading at will. NOTE: you can easily replace dyslexic with "child learning to read" or "Person learning a new language".

MML synchronization specification must provide sufficient granularity to provide word level event generation to the browser (or playback device) for positioning and synchronized display.


Beginning synchronized playback from a specific position in the document structure

Example 1:

A visually disabled student conducts a text search looking for information on Cnidarians. A search engine returns a reference in an on-line version of Parade of Life: Animals, a hybrid digital book which includes full text, spoken audio, and images. The student selects the link, which opens page 23, paragraph 2, third sentence, and automatically begins audio playback at the sentence containing the word Cnidarians. (NOTE: this assumes an audio browser that includes automatic initiation of synchronized audio playback to be enabled, if available in the source document).

Example 2:

A user of a handheld book reader owns a "cook book", downloaded from a web site to his device. Pressing the next page button, he hears an ascending list of page numbers until he arrives at the one he desires. He then presses the goto page button and playback begins automatically at this location.

From a digital talking book file (ie., HTML document) there must be an efficient mechanism for linking to the MML file. In order to achieve the desired functionality, the user agent will have to support initiation and control of audio (or more correctly, multimedia) playback based upon the MML file contents and present document position.

If the user is positioned at a specific paragraph or list item, pressing a "begin playback" key will use the current position in the document as an index into the MML file for starting the synchronized audio playback.

How MML Can Meet Digital Talking Book Requirements

Before proposing specific examples of how MML can meet our requirements, we will review the present state of our development activities:

Physical Characteristics of the Talking Book with Synchronization

Let's examine the physical characteristics of a typical book in the RFB&D library (75,000 titles):

The text version of the book is in HTML format, and, along with the recorded audio, is split into multiple files, based on chapter.

What is the overhead of adding synchronization level information to both the HTML source and MML? For a 30 hour full text and audio book, the MML file will have to contain 70,000 synchronization references. The HTML source will have to contain 70,000 anchors which act as targets for the MML synchronization events and also serve as links back to the MML file for audio start up.

Current Implementation

We have explored several options over the last year for achieving this kind of functionality, based on a combination of pure HTML 3.2 tags and real audio synchronization files. There is a book available (Prentice Hall Parade of Life: Animals) that demonstrates structure-based synchronization down to the paragraph, page, and heading level. Lacking the ability to initiate playback at a given anchor, there are "Audio Start" hrefs at the start of each page. The book functions, using the real audio player, with Netscape, Internet Explorer, and the prototype RFB&D Digital Audio Browser.

Implementation with MML

Using the draft MML specification, we have created sample code that demonstrates our desired functionality, assuming that the user agents support the audio playback controls described previously. The Digital Audio Browser will serve as one reference implementation for this functionality.

Using the ID attribute, it is possible to define a shortname target in the HTML file that, if an MML file is available, also serves as the link target back into the MML file.

For example, in the HTML source:

<H1 ID=00223>South American River Fish</H1><p>

MML files, which are assumed as external files associated via a meta tag, can be linked on a one to one basis to each of the HTML files in a talking book, or there may be one MML file for each book.

In the linked MML file:

<text href=chapter3.htm#00223 begin=01:23:32 end=99:99:99 id=00223>

With audio playback underway, when time 1:23:32 is hit, a URL sync event (goto) is sent to the playback control container or specified browser, with the target being #00223.

For the user to initiate playback at a given point, it is assumed that the browser tracks the current active anchor/id for the user's position in the document. If the browser recognizes that the current loaded document has an associated MML definition, then when a play function is selected (such as play button or hot key), the player control uses the current anchor/id as an index into the MML file for starting playback.

When it comes to sub-paragraph level synchronization, ie., when there is no associated tag such as <P> at the synchronization point, then we need to consider an empty tag that can contain the ID tag, eg., <ID=00223>.


Pause Information

During playback of audio from digital talking books, the concept of inter-element pauses is important. In the DAISY project, books are navigated by phrases, which are identified and tagged during the production process. During playback, the presentation can be speeded up by compressing or discarding inter-phrase pauses. The pause, which physically occurs at the end of the phrase, can be described by its start point or duration. Adding this pause information to the MML as an attribute will enable player controls to implement pause compression. In the example below, we are using the PD attribute to represent the PauseDuration.

<text href=chapter3.htm#00223 begin=01:23:32 end=99:99:99 pd=00:00:02 id=00223>

Word Level Synchronization

Another requirement of digital talking books is the need for synchronized display of text down to the word level during audio playback. This is analogous to the Karoke requirement of the "bouncing ball" as a song is being played. In the case of talking books, however, the prospect of defining 300,000 sync events in the source document presents a significant challenge.

Though it is possible to implement word level syncronization with HTML and MML, it is more likely, from our perspective, that word level synchronization information will be more efficiently handled as a property of the audio file itself and implemented within the playback control. Therefore, at this time, we see no additional requirements in the MML or HTML source to support word level synchronization.

Required Browser Modifications

The primary change required to browsers to meet the talking book requirements is to add position tracking. For non-visual access, this is straightforward, as the current position in a document corresponds to the last spoken element. For a graphical browser, this is less clear, as both Netscape and IE, for example, separate selection point cursor, current active link, and display position. With both browsers, it is not presently clear what the current position is. At a minimum, we believe that if the user places the selection cursor, via a direct cursor, mouse click, or text search action, then the current position should be set to the ID anchor that corresponds to this position in the document.

The current browser position must be available to plug-in players or browser scripting via a document property, such as Document.CurrentPosition, as should the name of the associated MML file. The plug-in, when it receives a play command, can then query these properties in initiating the playback.

Document owner: M. T. Hakkinen