Society for Industrial and Applied Mathematics, 2001, -174 pp.
The first part of this monograph's title, Combinatorial Data Analysis (CDA), refers to a wide class of methods for the study of relevant data sets in which the arrangement of a collection of objects is absolutely central. Characteristically, CDA is involved either with the identification of arrangements that are optimal for a specific representation of a given data set (usually operationalized with some specific loss or merit function that guides a combinatorial search defined over a domain constructed from the constraints imposed by the particular representation selected), or with the determination in a confirmatory manner of whether a specific object arrangement given a priori reflects the observed data. As the second part of the title, Optimization by Dynamic Programming, suggests, the sole focus of this monograph is on the identification of arrangements; it is then restricted further, to where the combinatorial search is carried out by a recursive optimization process based on the general principles of dynamic programming. For an introduction to confirmatory CDA without any type of optimization component, the reader is referred to the monograph by Hubert (1987). For the use of combinatorial optimization strategies other than dynamic programming for some (clustering) problems in CDA, the recent comprehensive review by Hansen and Jaumard (1997) provides a particularly good introduction.
Our purpose in writing this monograph is to provide an applied documentation source, as well as an introduction to a collection of associated computer programs that would be of interest to applied statisticians and data analysts but also accessible to a notationally sophisticated but otherwise substantively focused user. Such a person would typically be most interested in analyzing a specific data set by implementing the flexible dynamic programming method for any of a number of seemingly diverse problems encountered in CDA. The background we have tried to assume is at the same level as that required to follow the documentation for good, commercially available optimization subroutines, such as the Numerical Algorithms Group (NAG) Fortran subroutine library, or at the level of one of the standard texts in applied multivariate analysis usually used for a graduate second-year methodology course in the behavioral or social sciences. An excellent example of the latter would be the widely used text now in its fourth edition by Johnson and Wichern (1998). Draft versions of the curix rent monograph have been used as supplementary material for a course relying on the latter text as the primary reference.
The content of the monograph itself and how the various parts are organized can be discussed under a number of headings that serve to characterize both the type of object arrangements to be identified and the form of the data on which the identification is to be based. Chapter 1 is a short preview that introduces the general topic by noting areas in combinatorial data analysis that can be approached by the optimization strategy of dynamic programming, and that presents a number of data sets to be used throughout the remaining chapters. The second chapter introduces the general dynamic programming paradigm (the GDPP, for short) and gives an introductory example of its usage in the well-known linear assignment task. The next two chapters focus the GDPP on topics within Cluster Analysis (Chapter 3) and Object Sequencing and Seriation (Chapter 4). Chapter 3 is further subdivided by several dichotomies: whether the clustering involves a single object partition (partitioning) or a hierarchy of nested partitions (and the associated representing ultrametric); the presence or absence of constraints on the type of partitions sought (typically through subsets contiguous with respect to some object order); the form of the available data with the usual distinction of having proximities between objects from a single set (one-mode) or between objects from two sets (two-mode). Chapter 4 can also be characterized by several dichotomies: whether the one-mode proximities are symmetric or skew-symmetric, with the latter representing dominance information among the objects, or whether the proximities are initially one- or two-mode. In addition, several related topics are introduced: sequencing through the construction of optimal paths (linear and circular); the incorporation of precedence constraints in the construction of an optimal order; and unifying the general areas of clustering and sequencing by identifying optimal partitions of an object set in which the classes are themselves ordered. Chapter 5 extends the GDPP heuristically for use with large(r) object sets in both the clustering and sequencing context, while (unfortunately) removing the absolute guarantee of optimality for the identified object arrangements. Finally, Chapter 6 provides preliminary discussion of a number of areas of extension and generalization that are now being pursued by the current authors and others.
An appendix is included as a user's manual for a collection of programs available as freeware on the World Wide Web (WWW) that carry out the various optimization tasks and which can be used to reproduce all the numerical examples given. We provide both the original code (in FortranQO) and executable programs (for 32-bit Intel-compatible processors running under Windows NT/95/98). Finally, we point out the liberal use throughout of chapter endnotes (rather than the more typographically intrusive footnotes). These serve several purposes: to note how some topic might be approached with one of the programs discussed in the appendix; to provide a little more peripheral comment on a topic; or to respond to a referee of an earlier version of this monograph who called for a more detailed presentation of a specific topic that we didn't include in the actual text.
Introduction.
General Dynamic Programming Paradigm.
Cluster Analysis.
Object Sequencing and Seriation.
Heuristic Applications of the GDPP.
Extensions and Generalizations.
Available Programs.