Chair: Mark Hakkinen
Comments on this Document to: hakkinen
Revised: 24 September 1997
Since the initial NLS/NISO meeting in May 1997, there has been significant activity in the area of file formats for digital talking books. This report will summarize those activities and provides commentary that will serve as a basis for discussion within the NLS/NISO process.
At the time of the May meeting, we were looking forward to upcoming meetings of the Daisy Consortium and the W3C that would be addressing various aspects of DTB file formats. Since that time, the activity has been non-stop with significant activities ongoing.
The key activities:
Why are these key activities? Let's take the issues one by one:
Whether it is our NLS/NISO requirements, or those of the EBU, RFBD, or Daisy, it has become clear that the Internet will play some role in the delivery of DTBs. A major premise of the web happens to be open, standards based protocols and languages. HTML is an application of the ISO SGML standard and is maintained as a formal specification by the World Wide Web Consortium (W3C).
Does the web currently offer protocols or languages to enable creation and delivery of talking books? Work that was undertaken at RFBD in 1996-1997 seemed to indicate that much of what was needed is being put in place.
In 1996, RFBD began a project to develop a DTB that included both full text and audio. The design incorporated HTML as the text format, with audio delivery and synchronization achieved using Real Audio. Though based on web protocols, the entire project was developed for CD-ROM based delivery, using standard Windows-based web browsers as the playback device. The project continues to this date, with one book, Prentice Hall's Parade of Life: Animals, completed. This book includes full audio, full text (annotated with RFBD image descriptions), and full images. The audio is archived in 16-bit, 22khz ADPCM wave files, and CD's are being trialed with Real Audio (ISDN encoding), standard Wave, and Voxware codecs. The text format is standard HTML 3.2, with basic structural elements of the book represented by HTML tags such as paragraph, headings, and list items.
Synchronization of audio and text is presently being achieved with standard Real Audio, based on RTSP. Each synchronization point in the audio stream is associated with a specific, named anchor position in the HTML source document. As an audio file is being played, synchronization events are generated from the Real Audio playback software, which are picked up by the web browser and used to position the text display and highlight the text elements corresponding to the current position in the audio.
The Daisy Consortium, a multi-national effort to design and develop a digital audio book standard, made a decision early this year to re-examine its file format and came to the same conclusion as RFBD with respect to web-based protocols. The existing Daisy phrase-based file format (which is published and available at http://www.labyrinten.se) was reviewed during a meeting held in Sigtuna, Sweden, during May. A main finding of this and subsequent meetings was the ability of standard HTML style implementations to achieve the same results as the Daisy format, with the added benefit of full text and structural information.
The Daisy file format redesign is currently underway and is tied to the HTML/MML work underway at W3C. With this new direction, the stated commitment from the Daisy Consortium is to open, standards-based file formats.
A Japanese-funded development effort, the Sigtuna Project, was initiated in July and is presently designing and implementing web browser extensions based directly on the planned new Daisy MML-based file format to enable creation and web-based delivery of audio documents. The project is being carried out by The Productivity Works and Labyrinten with funding from the Japanese Society for the Rehabilitation of Persons with Disabilities.
In August, RFBD joined the Daisy Consortium to join in the development of a standard digital audio book format. Also, during August, George Kerscher of RFBD was named Project Manager of Daisy. As George will be in attendance at the Washington meeting, he can provide a current update on the project.
In April 1997, the W3C, with the support of the White House, began the Web Accessibility Initiative. WAI is intended as a working group within W3C to insure the requirements of accessibility are addressed in all W3C activities.
George Kerscher and Mark Hakkinen raised the issues of DTBs and covergence with Web protocols within WAI, and in August, held a half day meeting in Boston to discuss DTBs prior to the general WAI meeting. Representatives from W3C, including the new IPO Judy Brewer, and others from NLS, RNIB, NCR, EBU, Daisy, and RFBD, met to review the status of talking book efforts and to discuss ways to coordinate these activities with W3C. As a result, Productivity Works paid the membership fee, officially joined W3C and is now represented in the Multimedia Working Group. Other organizations involved in DTBs were encouraged to consider membership in W3C.
Progressive Networks, along with firms such as Voxware, have created mechanisms for the delivery of audio over the Internet. Dramatic improvements in audio compression and quality are being made regularly. The specifics of audio compression codecs and streaming protocols are not covered in our current discussion. The only thing we can safely say is that there will be continuing improvements in quality, and that there will be a variety of options available to us. A detailed survey of present and emerging audio compression file formats will be the subject of a subsequent report.
One of the adjunct capabilities developed by Progressive Networks for Real Audio was the synchronization of audio and text. The primary application of this capability was to enable the creation of web-based slide shows and presentations, during which web pages would be displayed in synchronization with an audio or video track. Upon analysis and testing, this capability appeared attractive for hybrid text and audio DTBs and served as the basis for the RFB&D Audio Plus prototype.
During 1997, the W3C began pursuing the development of a formal specification for multimedia synchronization, called MML. After repeated requests to W3C, George Kerscher and Mark Hakkinen received in mid-July a draft specification of MML. Initial analysis of the MML specification showed that it would meet most requirements for creation of DTBs. There were several issues of concern and a document, DTB Requirements for MML, was developed and submitted to the MML working group in early August for review. At that point Productivity Works had also joined W3C and Mark Hakkinen became a member of the working group.
The MML workgroup is open only to W3C members and invited experts, and the draft specification is not currently available to the general public. On September 22 and 23, the workgroup met in Amsterdam, with representatives from Microsoft, DEC, Lucent, Alcatel, France Telecom, Phillips, CWI, INRIA, Progresive Networks, and Productivity Works. Mark Hakkinen presented the DTB requirements document and gave demonstrations of several DTB examples along with an example of an MML-based DTB. Key requirements were well discussed and by the conclusion of the meeting it was felt that the specific DTB requirements could be met. Though the final specification of the MML draft is still in progress, it appears that DTBs can be represented, produced, and delivered in MML. It is expected that the working group will release the specification for public comment during the second half of October 1997.
ANSI is presently working on an HTML standard. If the combination of HTML 4.0 and MML becomes an accepted and viable format for delivering DTBs, then it is important to review how the ANSI effort at standardization impacts the NLS/NISO project.
Assuming acceptance of a content standard, such as HTML and MML, a co-requirement will be for content authoring guidelines for the DTB producers. This is especially important with the introduction of full text versions and the inclusion of structural information. Much in the same way that the WAI is working on accessibility guidelines for web authors (and authoring tools), similar guidelines will have to be developed for DTBs.
SAMI is another approach for audio and text synchronization being proposed by Microsoft's Accessibility Group and has not been presented to the W3C (or any other standards group to our knowledge). Based upon an initial review of a specification draft from David Bolnick at Microsoft, SAMI appears to be geared toward closed-captioning of video streams, with added capabilities to provide text synchronization for audio streams. The capabilities of SAMI appear redundant to elements of MML and ASF (mentioned below). As SAMI seems geared to production of short texts, augmenting video presentations, it does not appear appropriate to the creation of DTBs. A clarification of SAMI's scope should be requested from Microsoft.
Active Streaming Format (ASF) is a joint proposal from several firms, Microsoft, Adobe, Intel, Progressive Networks, and Vivo. ASF is an attempt to standardise the internet-based delivery of multimedia documents, which can include a variety of media types, occuring at defined points in time. The proposal is in fact a file format, which can include embedded media, hyper-references to URLs containing media, as well as downloadable components. The specification is up for public review, with comments accepted through September 30, 1997. At present, the specification has not been presented to any standards group.
Based upon initial review, an MML specified document is definable and deliverable by ASF, and can be considered as a compiled package of the HTML version of the book, the MML definition, supporting audio and image files, and even codecs needed to deliver any of the embedded media files. There are additional capabilities in the ASF that are not directly part of MML, such as ratings support (PICS) and the downloadble components.
Software players for ASF files are reportedly under development by Microsoft and Progressive, though release dates are not known. Also, the ASF format reportedly supports only vendor specific audio and media files, largely due to the streaming requirements. The complexity and breadth of the specification itself may have implications for the development of hand-held playback devices.
ASF appears to be an attractive mechanism for packaging DTBs for delivery via both the internet and fixed media, and deserves serious study as one of the possible delivery formats.
DSMCC (Digital Storage Media Command and Control), MPEG 2 Series ISO/IEC 13818 Part 6, was suggested for examination by Warner ten Kate of Phillips Research (Eindhoven). DSMCC is designed to control MPEG media. delivery and presentation in a broadcast environment. It is presently being implemented in set-top boxes by Phillips. Additional information has been requested and should be reviewed to see if there is applicability of this format to DTBs, particularly for delivery and presentation via set-top boxes.
The move toward web-based protocols for DTBs clearly expands the scope of delivery and playback devices into a broader, general consumer marketplace. An early sign of this is the Audible Player from Audible Corporation. This handheld, battery powered device plays two hours of digital audio, downloaded from the internet, has simple navigation based on defined sections, has bookmarking, and costs approximately $250 dollars. It is targeted toward commuters, who would listen to talking books on their way to and from work. Full details on the player are available at the Audible web site. In many ways, this device designed for the sighted commuter embodies many of the features required for the low-end device discussed at the NISO meeting.
To answer the question about file formats for the Audible player, it has just been announced (Sept 15) that Real Audio is a supported file format. It is thus possible to download any real audio content from the web into the player. Conversations with Audible indicate that the Codec used is, or will be, downloadable, enabling further support for other compression formats.
We should expect other devices to appear in the near future, and to see existing devices such as the Daisy Plextak prototype moving to support open file formats. Visuaide, for example, has begun development of a web-based DTB playback unit. MML Playback capabilities are also likely to appear in mainstream consumer electronics, such as DVD-based CD players.
In addition to portable devices, we are already seeing demonstrations of DTB delivery from Daisy and RFBD using web browsers like Netscape, Internet Explorer and pwWebSpeak.
The goal of DTBs being delivered across a variety of devices and platforms seems readily achievable.
One advantage of the web based approach to DTB delivery is that the choice of Codec for audio playback is left open to the requirements of the specific delivery media. Because we envision delivery via hard media such as CD, and over the internet, it is likely that we will see CDs delivered, for example, using MPEG audio and net-delivery using Real Audio or Voxware codecs. A key point to highlight is that the HTML/MML approach separates the definition of book structure and content from the codec used to deliver the audio.
There are other developments regarding playback and synchronization that are presently announced and unannounced, which should also be of considerable value to DTB efforts. These include speed-up and slow-down, word recognition, and voice morphing.
With the flexibility of audio delivery format, consideration must be given to the process of creating masters for the various delivery media. DTB producers and providers will benefit from mastering systems that read a standard archive format and generate the required delivery format. Delivery format examples include: CD or DVD masters, file directory layouts and contents for web servers, and audio cassette masters. It needs to be noted that a CD created with full text and audio would play only the audio on a basic Type 1 audio playback device and an audio cassette would naturally contain only the audio portion of the book. For type 1, 2, and 3 devices, there is interoperability of the distribution media.
It should also be noted that the archive format must offer some flexibility for each of the DTB archivers and producers. As technology and storage economics evolve, archivers should be able to convert the archived audio as needed. The production mastering systems, therefore, will have to be able to support additional input formats. This requires that the archivers utilize either standard and published audio formats, or provides the necessary conversion software to translate from any proprietary format to a supported standard.
Are the previously mentioned DTB file format efforts going in the right direction? How do these developments fit with the requirements we came up with in May? These are the types of questions which our group should now begin to address.
If we look at our basic requirements, how do the DTB developments tie in?
No descriptive documents defining the emerging HTML-based file formats are available, yet. The W3C specification of HTML 4.0 is public, though the actual MML spec is not. A description of the MML project is available and the "DTB requirements document" submitted to W3C is also available but I ask that it not be shared outside of this group at this time.
Other links of interest:
Audible Corporation, makers of the soon to be released hand-held device. (http://www.audible.com)
The DAISY Consortium (http://www.daisy.org)
The World Wide Web Consortium (http://www.w3.org)
Voxware, audio compression expertise, word synchronization, speed up/slow down. (http://www.voxware.com)
Progressive Networks, audio compression, delivery, and standards like RTSP. (http://www.real.com)
Microsoft SAMI Specification (http://www.microsoft.com/enable/products/multimedia.htm)
Active Streaming Format (http://www.microsoft.com/asf)