Entropics Ltd., 1999, -667 pp.
The HTK Application Programming Interface (HAPI) is a library of functions providing the programmer with an interface to any speech recognition system supplied by Entropic or developed using the Hidden Markov Model Toolkit (HTK). HTK is a set of UNIX tools which are used to construct all the components of a modern speech recogniser. One of the principal components which can be produced with HTK is a set of Hidden Markov Model (HMM) based acoustic speech models. Other components include the pronunciation dictionary, language models and grammar used for recognition. These can be produced using HTK or Entropic's grapHvite Speech Recognition Developer System. HAPI encapsulates these components and provides the programmer with a simple and consistent interface through which they can integrate speech recognition into their applications.
Given a set of acoustic models, a dictionary, and a grammar, each unknown utterance is recognised using a decoder. This is the search engine at the heart of every speech recognition system. The core HAPI package may be shipped with a variety of decoders, for example the standard HTK decoder, the professional MVX decoder for medium vocabulary tasks, or the professional LVX decoder for large vocabulary tasks. In addition to the decoder, the recogniser also requires an appropriate dictionary and set of HMMs. These HMMs should be trained on speech data which is similar to that which will be recognised. For example, HMMs trained on wide band data collected through a high quality desk top microphone will not perform well if used to recognise speech over the telephone. The dictionary used should provide pronunciations, in terms of models, for all words in the intended recognition vocabulary.
The developer has the choice of either generating their own recognition components or licensing components from the wide range that Entropic has to offer. Entropic supplies pre-trained models for a number of languages and environments together with the corresponding dictionaries. For other requirements, custom models and dictionaries can be built using HTK, which provides a framework in which to produce and evaluate the accuracy and performance of all major types of recognition system.
The particular route taken to produce the recogniser components is irrelevant as far as HAPI is concerned. Having acquired the necessary components, HAPI is all that is required to produce both prototype and commercial applications. HAPI provides the programmer with a simple programming interface to the chosen components allowing them to incorporate speech recognition into their application with the minimum of effort. Although HAPI can be viewed as an extension to HTK (since it uses components produced using HTK and shares the same code libraries) it is better viewed as a stand alone interface. The underlying recogniser components of a HAPI application can be upgraded with no need to modify any existing application code.
The combination of HAPI, HTK and Entropic's off the shelf decoders provides the application developer with the complete set of tools and components required to produce and utilise state-of-the-art technology in applications spanning the entire spectrum of current uses of speech recognition.
This book is divided into four main parts. This first part describes the components of a speech recognition system and provides a high level overview of the capabilities of HAPI in the form of a tutorial on how to build a simple application - a phone dialer. Part 2 provides an in depth description of the facilities available within HAPI and how these facilities relate to HTK for those familiar with the HTK toolkit and who use it to produce their own recognition systems. Part 3 describes a few example programs and touches on areas not directly related to HAPI but which are nonetheless important when producing a recognition system for a real application (such as semantic parsing and dialogue management). Finally the appendices describe the differences between the various flavours of HAPI (due to the different programming languages being used to implement the specification) and provides a complete reference section.
This book does not cover the production of the recognition components (such as the acoustic models or word networks). These processes are described in some detail in the HTK Book and the grapHvite manual. Although there is a brief description of speech recognition systems in this chapter it is neither detailed nor comprehensive. For more information on the principles and algorithms used in speech recognisers the reader is advised to read the HTK Book as well as general literature from the field.
Index
Part I: Tutorial OverviewAn Introduction to HAPI
An Overview of HAPI
Using HAPI: An Example Application
Using HAPI: Improving the Dialer
Using HAPI: Speaker Adaptation
Interfacing to a source driver
Part II: HAPI in DepthDeveloping HAPI Applications
HAPI, Objects and Configuration
hapiHMMSetObject
hapiTransformObject
hapiSource/CoderObject
hapiDictObject
hapiLatObject
hapiNetObject
hapiRecObject
hapiResObject
Part III: Application IssuesSystem Design
Extended Results Processing
Part IV: DecodersDecoder Variations
The Core HTK Decoder
The LVX Decoder
Part V: AppendicesA HAPI Reference
B HAPI from JAVA (JHAPI)
C Error and Warning Codes