The advent of 'Big Data' has brought with it a rapid diversification of data sources, requiring analysis that accounts for the fact that these data have often been generated and recorded for different reasons. Data integration involves combining data residing in different sources to enable statistical inference, or to generate new statistical data for purposes that cannot be served by each source on its own. This can yield significant gains for scientific as well as commercial investigations. However, valid analysis of such data should allow for the additional uncertainty due to entity ambiguity, whenever it is not possible to state with certainty that the integrated source is the target population of interest. Analysis of Integrated Data aims to provide a solid theoretical basis for this statistical analysis in three generic settings of entity ambiguity: statistical analysis of linked datasets that may contain linkage errors; datasets created by a data fusion process, where joint statistical information is simulated using the information in marginal data from non-overlapping sources; and estimation of target population size when target units are either partially or erroneously covered in each source.
- Covers a range of topics under an overarching perspective of data integration.
- Focuses on statistical uncertainty and inference issues arising from entity ambiguity.
- Features state of the art methods for analysis of integrated data.
- Identifies the important themes that will define future research and teaching in the statistical analysis of integrated data.
Analysis of Integrated Data is aimed primarily at researchers and methodologists interested in statistical methods for data from multiple sources, with a focus on data analysts in the social sciences, and in the public and private sectors.
Author(s): Li-Chun Zhang, Raymond L. Chambers
Series: Chapman & Hall/CRC Statistics In The Social And Behavioral Sciences
Publisher: Chapman & Hall/CRC
Year: 2019
Language: English
Pages: 273
Tags: Multivariate Analysis, Multiple Imputation, Measurement Uncertainty
Cover......Page 1
Half Title......Page 2
Series Page......Page 3
Title Page......Page 4
Copyright Page......Page 5
Dedication......Page 6
Contents......Page 8
Preface......Page 14
Contributors......Page 16
1.1 Why this book?......Page 18
1.2 The structure of this book......Page 20
References......Page 28
2.1 Introduction......Page 30
2.1.1 Related work......Page 31
2.1.2 Outline of investigation......Page 32
2.2 The linkage data structure......Page 33
2.2.1 Definitions......Page 34
2.2.2 Agreement partition of match space......Page 35
2.3 On maximum likelihood estimation......Page 37
2.4.1 Linear regression under the linkage model......Page 39
2.4.2 Linear regression under the comparison data model......Page 41
2.4.3 Comparison data modelling (I)......Page 42
2.4.4 Comparison data modelling (II)......Page 44
2.5.1 Non-informative balanced selection......Page 47
2.5.2 Illustration for the C-PR data......Page 50
2.6 Concluding remarks......Page 51
Bibliography......Page 52
3. Capture-recapture methods in the presence of linkage errors......Page 56
3.2 The capture-recapture model: short formalization and notation......Page 57
3.3.1 The Fellegi and Sunter linkage model......Page 59
3.3.2 Definition and estimation of linkage errors......Page 61
3.3.3 Bayesian approaches to record linkage......Page 62
3.4.1 The Ding and Fienberg estimator......Page 64
3.4.2 The modified Ding and Fienberg estimator......Page 65
3.4.3 Some remarks......Page 66
3.4.4 Examples......Page 69
3.5.1 Log-linear model-based estimators......Page 74
3.5.2 An alternative modelling approach......Page 77
3.5.3 A Bayesian proposal......Page 78
3.5.4 Examples......Page 79
3.6 Concluding remarks......Page 82
Bibliography......Page 83
4.1 Introduction......Page 90
4.2 Statistical matching problem: notations and technicalities......Page 92
4.3 The joint distribution of variables not jointly observed: estimation and uncertainty......Page 94
4.3.1 Matching error......Page 98
4.3.2 Bounding the matching error via measures of uncertainty......Page 100
4.4 Statistical matching for complex sample surveys......Page 104
4.4.1 Technical assumptions on the sample designs......Page 105
4.4.2 A proposal for choosing a matching distribution......Page 107
4.4.3 Reliability of the matching distribution......Page 108
4.4.4 Evaluation of the matching reliability as a hypothesis problem......Page 110
4.5 Conclusions and pending issues: relationship between the statistical matching problem and ecological inference......Page 111
Bibliography......Page 113
5.1 Introduction......Page 118
5.2 Choice of the matching variables......Page 120
5.2.1 Traditional methods based on association......Page 121
5.2.2 Choosing the matching variables by uncertainty reduction......Page 122
5.2.3 An illustrative example......Page 123
5.2.4 The penalised uncertainty measure......Page 126
5.3 Simulations with European Social Survey data......Page 128
Bibliography......Page 134
6.1 Introduction......Page 138
6.2 Corroboration......Page 142
6.3 Maximum corroboration set......Page 144
6.4 High assurance estimation of Θ0......Page 147
6.5 A corroboration test......Page 148
6.6 Application: missing OCBGT data......Page 149
Bibliography......Page 150
7. Dual- and multiple-system estimation with fully and partially observed covariates......Page 154
7.1 Introduction......Page 155
7.2.1 Terminology and properties......Page 157
7.2.2 Example......Page 159
7.2.3 Graphical representation of log-linear models......Page 161
7.2.4 Three registers......Page 162
7.3.1 Modelling strategies with active and passive covariates......Page 163
7.3.2 Working with invariant population-size estimates......Page 164
7.4.1 Framework for population-size estimation with partially observed covariates......Page 165
7.4.2 Example......Page 167
7.4.4 Results of model fitting......Page 169
7.5.1 Precision......Page 171
7.5.2 Sensitivity......Page 173
7.6 An application when the same variable is measured differently in both registers......Page 174
7.6.1 Example: Injuries in road accidents in the Netherlands......Page 175
7.6.2 More detailed breakdown of transport mode in accidents......Page 177
7.7.1 Alternative approaches......Page 178
7.7.2 Quality issues......Page 181
Bibliography......Page 182
8.1 Introduction......Page 186
8.2 A latent class model for capture–recapture......Page 189
8.2.1 Decomposable models......Page 191
8.2.3 EM algorithm......Page 193
8.2.5 A mixture of different components......Page 195
8.2.6 Model selection......Page 196
8.3.1 Use of covariates......Page 198
8.3.2 Incomplete lists......Page 199
8.4 Evaluating the interpretation of the latent classes......Page 203
8.5 A Bayesian approach......Page 204
8.5.1 MCMC algorithm......Page 206
8.5.2 Simulations results......Page 208
Bibliography......Page 209
9.1 Introduction......Page 214
9.2 Log-linear models of incomplete contingency tables......Page 216
9.3.1 The models......Page 217
9.3.2 Maximum likelihood estimation......Page 220
9.3.3 Estimation based on list-survey data......Page 221
9.4.1 Latent likelihood ratio criterion......Page 223
9.4.2 Illustration......Page 226
9.5.1 Data and previous study......Page 229
9.5.2 Analysis allowing for erroneous enumeration......Page 230
Bibliography......Page 234
10.1 Introduction......Page 236
10.2 Geo-referenced data and potential locational errors......Page 238
10.3 A brief review of spatially balanced sampling methods......Page 239
10.3.1 Local pivotal methods......Page 240
10.3.2 Spatially correlated Poisson sampling......Page 241
10.3.4 Local cube method......Page 242
10.4 Spatial sampling for estimation of under-coverage rate......Page 243
10.5 Business surveys in the presence of locational errors......Page 249
10.6 Conclusions......Page 256
Bibliography......Page 257
Index......Page 264