Based around eleven international real life case studies and including contributions from leading experts in the field this groundbreaking book explores the need for the grid-enabling of data mining applications and provides a comprehensive study of the technology, techniques and management skills necessary to create them. This book provides a simultaneous design blueprint, user guide, and research agenda for current and future developments and will appeal to a broad audience; from developers and users of data mining and grid technology, to advanced undergraduate and postgraduate students interested in this field.
Author(s): Werner Dubitzky
Publisher: Wiley
Year: 2009
Language: English
Pages: 288
Data Mining Techniques in Grid Computing Environments......Page 2
Contents......Page 8
Preface......Page 16
List of Contributors......Page 20
1 Data mining meets grid computing: Time to dance?......Page 26
1.1 Introduction......Page 27
1.2.1 Complex data mining problems......Page 28
1.2.2 Data mining challenges......Page 29
1.3 Grid computing......Page 31
1.4.1 Data mining grid: a grid facilitating large-scale data mining......Page 34
1.4.2 Mining grid data: analyzing grid systems with data mining techniques......Page 36
1.5 Conclusions......Page 37
1.6 Summary of Chapters in this Volume......Page 38
2.1 Introduction......Page 42
2.2 Approach......Page 43
2.3 Knowledge Grid services......Page 45
2.3.1 The Knowledge Grid architecture......Page 46
2.3.2 Implementation......Page 49
2.4 Data analysis services......Page 54
2.5.1 The VEGA visual language......Page 56
2.5.2 UML application modelling......Page 57
2.5.3 Applications and experiments......Page 58
2.6 Conclusions......Page 59
3.1 Introduction......Page 62
3.2 Rationale behind the design and development of GridMiner......Page 64
3.3 Use Case......Page 65
3.4 Knowledge discovery process and its support by the GridMiner......Page 66
3.4.1 Phases of knowledge discovery......Page 67
3.4.2 Workflow management......Page 70
3.4.3 Data management......Page 71
3.4.4 Data mining services and OLAP......Page 72
3.4.5 Security......Page 74
3.5 Graphical user interface......Page 75
3.6.3 Distributed mining of data streams......Page 77
3.7 Conclusions......Page 78
4 ADaM services: Scientific data mining in the service-oriented architecture paradigm......Page 82
4.2 ADaM system overview......Page 83
4.3 ADaM toolkit overview......Page 85
4.4 Mining in a service-oriented architecture......Page 86
4.5 Mining web services......Page 87
4.5.1 Implementation architecture......Page 88
4.5.3 Implementation issues......Page 89
4.6 Mining grid services......Page 91
4.6.1 Architecture components......Page 92
4.6.2 Workflow example......Page 93
4.7 Summary......Page 94
5.2 Preliminaries and related work......Page 96
5.2.1 System misconfiguration detection......Page 98
5.2.2 Outlier detection......Page 99
5.3.2 Pre-processing......Page 100
5.3.3 Data organization......Page 101
5.4.1 General approach......Page 102
5.4.3 Algorithm......Page 103
5.5 The GMS......Page 105
5.6.1 Qualitative results......Page 107
5.6.2 Quantitative results......Page 108
5.6.3 Interoperability......Page 110
5.7 Conclusions and future work......Page 113
6.1 Introduction......Page 116
6.2.1 Category 1: knowledge discovery specific requirements......Page 118
6.3 Workflow-based knowledge discovery......Page 119
6.4 Data mining toolkit......Page 120
6.5 Data mining service framework......Page 121
6.6 Distributed data mining services......Page 124
6.7 Data manipulation tools......Page 125
6.9 Empirical experiments......Page 126
6.9.1 Evaluating the framework accuracy......Page 127
6.9.2 Evaluating the running time of the framework......Page 128
6.10 Conclusions......Page 129
7.1 Introduction......Page 130
7.2 A service-oriented solution......Page 131
7.3.1 Types of distributed data analysis......Page 132
7.3.3 Data mining services and data analysis management systems......Page 133
7.4.1 Hierarchical local data abstractions......Page 134
7.4.2 Learning global models from local abstractions......Page 135
7.5.1 DDM processes in BPEL4WS......Page 136
7.6.1 Performance of running distributed data analysis on BPEL......Page 137
7.6.2 Issues specific to service-oriented distributed data analysis......Page 138
7.7.1 Optimizing BPEL4WS process execution......Page 139
7.7.3 Improved support of data privacy preservation......Page 140
7.8 Conclusions......Page 141
8.1 Introduction......Page 144
8.1.1 Workflows on the grid......Page 145
8.2.1 System overview......Page 146
8.2.2 Workflow representation in DPML......Page 147
8.2.5 Multiple execution models......Page 148
8.2.7 Streaming and batch transfer of data elements......Page 149
8.2.9 Embedding......Page 150
8.3.1 Motivation for a new server architecture......Page 151
8.3.5 Architecture overview......Page 152
8.3.6 Activity service definition layer......Page 154
8.3.10 Prototyping and production clients......Page 155
8.4 Data management......Page 156
8.5.2 Analysis overview......Page 158
8.5.4 Service for defining exclusions......Page 159
8.5.7 Validation service......Page 160
8.6 Future directions......Page 161
9.1 Introduction......Page 166
9.3 The bioinformatics experiment landscape......Page 168
9.4 Taverna for bioinformatics experiments......Page 170
9.4.1 Three-tiered enactment in Taverna......Page 171
9.4.2 The open-typing data models......Page 172
9.5 Building workflows in Taverna......Page 173
9.5.1 Designing a SCUFL workflow......Page 174
9.6 Workflow case study......Page 175
9.6.1 The bioinformatics task......Page 177
9.6.2 Current approaches and issues......Page 178
9.6.3 Constructing workflows......Page 179
9.6.4 Candidate genes involved in trypanosomiasis resistance......Page 181
9.6.5 Workflows and the systematic approach......Page 182
9.7 Discussion......Page 184
10.1 Introduction......Page 190
10.2.3 Scalability......Page 192
10.2.4 Workflow environment......Page 193
10.3.2 Looping......Page 194
10.3.5 Shipping data......Page 195
10.4 Extensibility......Page 196
10.5.2 Partitioning data......Page 198
10.6 Discussion and related work......Page 200
10.8 Conclusions......Page 201
11.1 Introduction......Page 204
11.2 The architecture......Page 206
11.3 Runtime framework......Page 208
11.3.2 Global persistent storage......Page 210
11.3.3 Termination detection......Page 211
11.3.4 Application of the model......Page 212
11.4.1 Decision trees......Page 214
11.4.2 Clustering......Page 218
11.5 Visual metaphors......Page 220
11.6 Case studies......Page 221
11.7 Future developments......Page 222
11.8 Conclusions and future work......Page 223
12.1 Introduction......Page 226
12.2 DMGA overview......Page 227
12.3 Horizontal composition......Page 229
12.4 Vertical composition......Page 231
12.5 The need for brokering......Page 233
12.6 Brokering-based data mining grid architecture......Page 234
12.7.1 Horizontal composition use case: Apriori......Page 235
12.7.2 Vertical composition use cases: ID3 and J4.8......Page 238
12.8 Related work......Page 241
12.9 Conclusions......Page 242
13 Grid-based data mining with the Environmental Scenario Search Engine (ESSE)......Page 246
13.1 Environmental data source: NCEP/NCAR reanalysis data set......Page 247
13.2 Fuzzy search engine......Page 248
13.2.1 Operators of fuzzy logic......Page 249
13.2.2 Fuzzy logic predicates......Page 251
13.2.3 Fuzzy states in time......Page 252
13.2.5 Fuzzy search optimization......Page 254
13.3.1 Database schema optimization......Page 256
13.3.2 Data grid layer......Page 258
13.3.4 ESSE data processor......Page 260
13.4 Applications......Page 262
13.4.1 Global air temperature trends......Page 263
13.4.3 Atmospheric fronts......Page 264
13.5 Conclusions......Page 268
14.1 Introduction......Page 272
14.3 Using OGSA-DAI to support data mining applications......Page 273
14.3.1 OGSA-DAI’s activity framework......Page 274
14.3.2 OGSA-DAI workflows for data management and pre-processing......Page 278
14.4.1 Calculating a data summary......Page 280
14.4.2 Discovering association rules in protein unfolding simulations......Page 281
14.4.3 Mining distributed medical databases......Page 282
14.5 State-of-the-art solutions for grid data management......Page 283
14.7 Open Issues......Page 284
14.8 Conclusions......Page 285
Index......Page 288