Structured Audio Navigation and DAISY

DAISY, the Digital Accessible Information SYstem, has evolved into a widely adopted, open standard for digital talking books.  Historically, DAISY as it now exists, is a result of the convergence of several competing efforts from the 1990’s that were striving to create a new generation, and more capable, replacement for the analog talking book.

Though the DAISY Consortium Web site is the definitive, official source for information on the current state of the standard and technology, some of the artifacts from the early days of the talking book convergence efforts are less well documented on the Web.  Via this artifact page, I will be sharing some of the material from my archives that is not currently available elsewhere on the Web.

Structured Audio Navigation

During 1997, I began using the term “Structured Audio Navigation” to describe how document structure would be used to provide navigation within an audio recording of a book.  The concept of structured audio was not new (see for example, Roy and Schmandt’s work on NewsComm in 1996, Arons’ work on HyperSpeech from 1991, or even my own work on PARAPET in 1978), but the specific approach being developed for DAISY was new.

Except for techincal documents used in the standards process, we did not have much time to publish our various efforts on applying structured audio to talking books. Kjell Hansson, Lars Sönnebo, and Jan Lindholm had presented their pioneering work on what would become DAISY 1.0 at ICCHP in 1994.  DAISY 1.0 provided structured navigation of audio based upon a phrase level and a table of contents/page model, relying on the primarily on the acoustic properties of the audio recording (periods of silence between spoken utterances) rather than the underlying textual content and structure.  

Based upon ideas from pwWebSpeak, 1996 saw my first efforts to create an audio book that linked the full text of a publication, including the structure, with the recorded audio narration. This work resulted in a collaboration with Recording for the Blind and Dyslexic, and eventual introduction to the DAISY and Library of Congress talking book efforts. Though prototypes  (which I plan to describe further in another Artifacts article) and working papers were many, none were officially published.  Digging in my archives, I did locate two documents that had been hosted on the former Productivity Works Web site.  First, a description of synchronized multimedia requirements for talking books which I wrote with George Kerscher, and that I subsequently presented to the W3C SYMM working group in September, 1997. Second is a report I made, also in September, to the Library of Congress National Library Service Digital Talking Book committee in my capacity as the Chair of the File Format Working Group.  Both documents give interesting insight as to how SMIL would eventually become a cornerstone of the DAISY standard.  Though George Kerscher described the role structured text was playing in the development of the new talking books standards at the IFLA 97 Conference, the earliest conference paper I can find describing the new approach to structured audio navigation and the use of SMIL was a presentation I gave with George at the CSUN conference in March, 1998.  Unfortunately, the presentation is not listed on the CSUN online proceedings site, but I found a copy of the paper and presentation slides in my archives, and make them available for posterity.  

Structure and Synchronized Multimedia

pwWebSpeak allowed a user to listen to a Web page, and navigate within the audio presentation of the page, generated by a text to speech synthesizer, using keyboard commands.   Pages written in HTML had the possibility of including a basic level of structural semantics, such as headings (e.g., the H1 through H6 tags), lists, paragraphs, and several other semantically useful elements.  Of, course, not all Web authors built structurally meaningful Web content, but if one could create structured documents, for example a book, using the right combination of elements, a user might be able to easily move between chapters, sections, paragraphs, lists, etc., and even search for the occurence of a word and begin listening to the book at the point where the word occurs.  For talking books, actual recordings of a human narrator were the primary form of audio in use, and we needed a means to link a digital audio recording to a structured HTML document.

In 1996, RealAudio proved to have one key piece of technology we needed: the ability to link a URL to a specific point of an audio file. Using the RealAudio player, prototype books were created, notably “Parade of Life: Animals,” that synchronized an HTML version of the popular science textbook from Prentice Hall with an audio narration.  I developed a software tool, named pwStudio, that would record audio and facilitate creation of the synchronization file needed by RealPlayer, in what would become the forerunner of later DAISY production tools. The hybrid talking books resulting from this process could be played in the Netscape browser or pwWebSpeak, though when using Netscape, keyboard-based navigation by structure was not possible. 

Though RealAudio provided an implementation path, the proprietary nature of the Real Networks code was a real problem in light of our goal of creating an open, non-proprietary standard.  Fortunately, this work coincided with the creation of the SYMM working group within W3C during 1997.  I flew from Newark  to Amsterdam one night in September that year, walking into the working group meeting at CWI about an hour after stepping off the plane to explain what we were trying to do with synchronizing text and audio in talking books. It was a good match, and that is how SMIL wound up as a key foundation of current day DAISY.


Leave a Reply

Your email address will not be published. Required fields are marked *