Structured Audio: Using Document Structure to Navigate Audio Information

Markku Hakkinen & George Kerscher

ABSTRACT:

Recent developments in non-visual access to information on the web have been extended to the design and delivery of new generation Digital Talking Books (DTBs). Traditionally, audio information has been viewed as a sequential, linear presentation, with navigation allowed only in terms of incremental forward and backward movements. The design principles of non-visual, first order systems (Hakkinen & De Witt, 1997) are being used to create DTBs that combine full text and digital audio. The new DTBs radically alter the way in which talking book users interact with the content, allowing navigation by structural elements such as words, paragraphs, headings, and pages, as well full text searching.

What is making structured audio practical and likely to be widely adopted is that it is based on international, standards-based specification languages, such as HTML and the new W3C synchronized multimedia language. These standards will insure interoperability of content and allow for a broad range of delivery devices, from standard web browsers to specialized, hand-held players.

The new model of structured audio is applicable to a much larger audience than the visually or print disabled. From immigrants learning a new language to aircraft maintenance workers in hands-busy eyes-busy situations, to commuters listening to a newspaper as they drive to work, structured audio permits the user to selectively listen to audio information. Applying structured audio to mainstream multimedia products, such as CD-ROM encyclopaedias and books will be a significant move toward universal accessibility.

This presentation will discuss the principles of structured audio and demonstrate its application in improving information accessibility.

Introduction

The spoken word has a long tradition in human history as a medium for information transfer. As technology made it possible to store audio and play it back at any time, applications for this medium grew. The visually disabled have been significant beneficiaries of this audio recording and playback, as applied in talking books. As technology improved, moving from phonographic discs, to magnetic tape, to CD audio, to internet streamed audio, the basic design of audio presentations has not altered significantly.

Audio is largely considered as a linear sequence of sounds, and most presentation (playback) systems, allow for navigation through this sequence by forward and backward scanning, by time positioning, and sometimes by track or section. When considering that talking books represent an often rich and complex structure through an extremely flat, linear view, there is an obvious loss of utility and actual content for the visually disabled user.

Through projects developing new generation digital talking books, the opportunity to integrate the structure of text documents with the corresponding audio rendition of that document enables the creation of highly navigable audio information.

Defining Structured Audio

Structured Audio is used to described an audio document that can be navigated and listened to in a non-linear fashion. This is perhaps best described through a series of user scenarios:

The Annotated Dante's Inferno

An digital talking book version of Dante Aligheri's Divine Comedy is used by a blind college student. Placing the CDROM in his talking book player, the student is presented with an audio menu with reading options. The student has the options of listening the book without annotations, with line numbers, with annotations, and with annotations and line numbers. Writing a term paper, the student begins by selecting to start at Canto 3, line 45, because that is where they left off yesterday. Because the full structure of the book is represented and exposed to the reader, it is easy to navigate into the book based on structure. In this instance, the student is listening to the annotations which are read as they occur in the actual book.

Physical Manipulation of the Spine: Medical Reference

A blind medical student is reviewing the contents of a reference work, Physical Manipulation of the Spine, and wants to initally skim the book to understand its contents. The book is structured in 23 parts, with six levels of subsections in each part. To skim the book, the student chooses to play back only 1 level sections, which in effect presents a summary of each section in the book, by reading the section heading and the section text. At any time, the student can stop the reading and explore further subsections.

The basic model of structured audio is to link the structured content of the book with the audio rendering of that content. This model permits the audio version of the book to be explored in the same way as the printed version. The key to achieving this linkage is through standard languages of the world wide web.

Enabling Structured Audio with SMIL and Mark Up Languages

In 1997, the World Wide Web Consortium (W3C) began working on the design of a new language for specifying synchronized, multimedia content for the world wide web. The result was the Synchronized Multimedia Integration Language, SMIL (pronounced "smile"), which was released for public review late in 1997. It is expected that SMIL will become a W3C recommendation sometime during 1998.

HTML offers useful structure for producing DTB's, but is not in itself rich enough to fully define the structural, navigational, and reading elements of a book.

XML offers the flexibility needed, but a lack of broad support at present leads us to look at an evolutionary, intermediate stage which can be developed and evaluated now. This approach offers a solid foundation for adoption of an XML-based solution, should that prove to be the eventual target. All information created in either syntax offers equivalent and immediately portable data.

HTML 4.0 incorporates tags useful for structuring, and visual and acoustic formatting of content. Block Level tags, such as paragraph, headings, lists, etc., have direct meaning within books, but the semantic structure of a typical book can rapidly exceed the depth of available HTML elements.

The introduction of Style Sheets in HTML creates a standard mechanism by which existing tags may be sub-classed to provide additional document visual and acoustic formatting. In addition, the user may define new block level (DIV) and in-line (SPAN) semantic specifications and visual and acoustic formatting behaviors. Using the Class attribute capabilities of elements standard in HTML 4.0/CSS1, it is possible to define extensions to HTML that provide the necessary semantic structuring to represent digital talking books.

For example, the present Heading tags within HTML (H1 through H6) can, depending upon the book, represent Parts, Chapters, Sections, Sub-Sections, Sub-Headings, Lessons, Cantos, Poems, Recipes, Exercises, Glossary, Appendix, Index, Bibliography, Title Page, etc., etc. all within the same architectural form using the class attribute.

Through style sheet definitions, it is possible to define heading levels with specific semantic meaning:

H1.Contents

H1.TitlePage

H1.Chapter

H2.Section

H3.SubHeading

With the advent of content-aware browsers, either through inherent browser understanding of content or through scripting and DOM (document object model), this style-based representation of book structure can become exposed and navigable.

Additional structures, such as Side Bars, Footnotes, Page Breaks, etc., can be represented using DIV and SPAN tags. For example:

DIV.SideBar

SPAN.Footnote

SPAN.PageStart

SPAN.Line

SPAN.Sentence

etc.

Book specific styles can be defined, so that in one case, Dante's Divine Comedy, H1 is classed semantically as Cantica, H2 as Canto, and a DIV for Line. In another case, a text book, Personality Theories, H1 is classed as Preface, TitlePage, Contents, Chapter, References, Appendix, Glossary, and Index. H2 is classed as SubHeading.

The relationship of the HTML/CSS approach to XML follows without much difficulty. Each of the Tag.Class elements would map to a specific, named XML element.

Taking our example of Dante, we would arrive at the following XML tags:

which can have the equivalent HTML/CSS representations:

What advantages (or disadvantages) are offered by XML? Clearly, XML allows for creation of tag names that may directly match the structural elements of a book. On the surface, this leads to clearer mark-up and can eliminate the ambiguity of overloading the static tag structure of HTML.

Each tag, be it a composite HTML/CSS or XML tag, can have associated with an ID attribute. This ID provides the unique linkage between the structural point in the document and the audio position as defined in the SMIL document.

This is shown by in the following fragment of XML and SMIL:

XML Document	SMIL Document
<canto id="c022"> <line id="l022">In level twenty 2.</line>.. </canto>	<par id=c022> <audio src="canto.wav"> <seq> <text src="canto2.html#c022">

Enabling Navigation in Playback Systems

The structure of an application needs to be exposed to the user. In order to achieve this the playback software must expose the structure to the user.

Common to All Audio Types:

Time-based navigation

Offset to a specific starting point in the audio stream

Index Based Navigation

An audio information Source may be indexed, as with the DAISY phrase-based approach, so that positioning of playback may be determined by phrase or index marker.

Semantic Index Based Navigation

An index may have a class designation, such as Page Number, Heading Level, Chapter, or paragraph. If the publication contains this semantic information, navigation may occur by page or heading, for example.

A structured audio player will be required to enumerate the available navigation methods for the selected publication. For simple, unstructured audio publications, the only navigation method available will be by time offset. For a hybrid talking book, navigation methods may be time, table of contents, and semantic index. Thus, player applications may be built which range in complexity, based upon the content being played, and upon the needs of the intended user populations.