Markku Hakkinen,
The Productivity Works
George Kerscher,
Recording for the Blind and Dyslexic
Created: 4 August 1997 Revised: 9 September 1997
We have conducted an analysis of the current draft of the MML specification provided on 13 July 1997. Our analysis is based upon the requirements for the production and delivery of digital talking and hybrid books. These requirements arise directly out of the work of international development efforts in digital talking books, namely, DAISY, Recording for the Blind and Dyslexic, and Library of Congress NLS. These are active projects, all with specific time frames, ranging from the immediate to 18 months.
On 4 August 1997 in Boston, a meeting of 12 representatives from Royal National Institute for the Blind (RNIB), DAISY, W3C, RFB&D, NLS, Productivity Works, NCR Corporation, European Blind Union (EBU), and other organizations met to discuss digital talking book requirements and the need to provide input to W3C's multimedia working group. Over the last year, it has become evident to these groups that the web protocols, HTML, RTSP, and lately, the multimedia work group draft specification, are highly relevant to digital talking books, and in fact, two groups (DAISY and RFB&D) have accepted web-based protocols as the basis for their ongoing talking books efforts.
Digital Audio is an integral aspect of book and information delivery and has to be viewed simply as an additional method of publishing. This new method of publishing is typified by products such as Microsoft Encarta. The efforts in Digital Audio are directed at taking this new method of publishing information into the realms of educational materials, reference materials, and leisure materials, where mixed media and universal accessibility combine to provide significant new market places and opportunities for the publishers and content providers.
It is clear that this additional paradigm in publishing is here to stay, and that the usage of Web protocols provides the ideal mechanism for the multiple modalities of delivery platforms required. From CD playback-only devices, analogous to the Walkman, through low cost web access devices, to full fidelity PCs where audio and textual annotations can be added to the materials by the user, this new form of publishing hits an enormous potential marketplace.
The range of audience, then, for digital audio used as an integral part of book and information delivery, covers all schools, colleges, organizations, and individuals. It also covers all levels of aptitude and physical and visual capabilities within this vast potential user community. From the salesperson listening to competitive information produced by their company while exercising in the morning or driving to work, to a worker in a hands-busy and eyes busy environment using a reference manual, to a child at school who needs to call in on the telephone and quickly reference something for their homework, to a University student reviewing materials while they are cleaning their dorm room. All are simple uses of digital audio in this new format.
This, of course, does not even begin to address the uses for people learning a new language, people who have learning disabilities, and people that are artificially or fully illiterate. For the first time we have a chance through the Web and its protocols to reach the billion or more people in the world who are simply illiterate and to begin to deliver information and resources that will improve their lives.
The audience is everyone.
There are several classes of digital talking books:
Type of Book |
Navigation Types Available |
---|---|
Audio Only |
Time Based Navigation |
Audio with Page, Section Structure |
Time Based and Structure Navigation |
Audio with Full Text* |
Time Based, Structure-based, Text Search |
Text Only (synthetic speech) |
Structure-based and Text Search |
In addition to the navigation methods, each of the book types has synchronization mechanisms:
Type of Book |
Synchrhonization Available |
---|---|
Audio Only |
At Beginning/At End Events |
Audio with Page, Section Structure |
Structure Passage Events |
Audio with Full Text* |
Structure down to word level events |
Text Only (synthetic speech) |
Structure down to word level events |
Synchronization is critical to digital talking books for two basic reasons:
Maintaining Text Position during audio play back by structure or word
MML synchronization specification must provide sufficient granularity to provide word level event generation to the browser (or playback device) for positioning and synchronized display.
Beginning synchronized playback from a specific position in the document structure
From a digital talking book file (ie., HTML document) there must be an efficient mechanism for linking to the MML file. In order to achieve the desired functionality, the user agent will have to support initiation and control of audio (or more correctly, multimedia) playback based upon the MML file contents and present document position.
If the user is positioned at a specific paragraph or list item, pressing a "begin playback" key will use the current position in the document as an index into the MML file for starting the synchronized audio playback.
Before proposing specific examples of how MML can meet our requirements, we will review the present state of our development activities:
Let's examine the physical characteristics of a typical book in the RFB&D library (75,000 titles):
The text version of the book is in HTML format, and, along with the recorded audio, is split into multiple files, based on chapter.
What is the overhead of adding synchronization level information to both the HTML source and MML? For a 30 hour full text and audio book, the MML file will have to contain 70,000 synchronization references. The HTML source will have to contain 70,000 anchors which act as targets for the MML synchronization events and also serve as links back to the MML file for audio start up.
We have explored several options over the last year for achieving this kind of functionality, based on a combination of pure HTML 3.2 tags and real audio synchronization files. There is a book available (Prentice Hall Parade of Life: Animals) that demonstrates structure-based synchronization down to the paragraph, page, and heading level. Lacking the ability to initiate playback at a given anchor, there are "Audio Start" hrefs at the start of each page. The book functions, using the real audio player, with Netscape, Internet Explorer, and the prototype RFB&D Digital Audio Browser.
Using the draft MML specification, we have created sample code that demonstrates our desired functionality, assuming that the user agents support the audio playback controls described previously. The Digital Audio Browser will serve as one reference implementation for this functionality.
Using the ID attribute, it is possible to define a shortname target in the HTML file that, if an MML file is available, also serves as the link target back into the MML file.
For example, in the HTML source:
<H1 ID=00223>South American River Fish</H1><p>
MML files, which are assumed as external files associated via a meta tag, can be linked on a one to one basis to each of the HTML files in a talking book, or there may be one MML file for each book.
In the linked MML file:
<text href=chapter3.htm#00223 begin=01:23:32 end=99:99:99 id=00223>
With audio playback underway, when time 1:23:32 is hit, a URL sync event (goto) is sent to the playback control container or specified browser, with the target being #00223.
For the user to initiate playback at a given point, it is assumed that the browser tracks the current active anchor/id for the user's position in the document. If the browser recognizes that the current loaded document has an associated MML definition, then when a play function is selected (such as play button or hot key), the player control uses the current anchor/id as an index into the MML file for starting playback.
When it comes to sub-paragraph level synchronization, ie., when there is no associated tag such as <P> at the synchronization point, then we need to consider an empty tag that can contain the ID tag, eg., <ID=00223>.
During playback of audio from digital talking books, the concept of inter-element pauses is important. In the DAISY project, books are navigated by phrases, which are identified and tagged during the production process. During playback, the presentation can be speeded up by compressing or discarding inter-phrase pauses. The pause, which physically occurs at the end of the phrase, can be described by its start point or duration. Adding this pause information to the MML as an attribute will enable player controls to implement pause compression. In the example below, we are using the PD attribute to represent the PauseDuration.
<text href=chapter3.htm#00223 begin=01:23:32 end=99:99:99 pd=00:00:02 id=00223>
Another requirement of digital talking books is the need for synchronized display of text down to the word level during audio playback. This is analogous to the Karoke requirement of the "bouncing ball" as a song is being played. In the case of talking books, however, the prospect of defining 300,000 sync events in the source document presents a significant challenge.
Though it is possible to implement word level syncronization with HTML and MML, it is more likely, from our perspective, that word level synchronization information will be more efficiently handled as a property of the audio file itself and implemented within the playback control. Therefore, at this time, we see no additional requirements in the MML or HTML source to support word level synchronization.
The primary change required to browsers to meet the talking book requirements is to add position tracking. For non-visual access, this is straightforward, as the current position in a document corresponds to the last spoken element. For a graphical browser, this is less clear, as both Netscape and IE, for example, separate selection point cursor, current active link, and display position. With both browsers, it is not presently clear what the current position is. At a minimum, we believe that if the user places the selection cursor, via a direct cursor, mouse click, or text search action, then the current position should be set to the ID anchor that corresponds to this position in the document.
The current browser position must be available to plug-in players or browser scripting via a document property, such as Document.CurrentPosition, as should the name of the associated MML file. The plug-in, when it receives a play command, can then query these properties in initiating the playback.