Bioinformatics and Computational Biology Solutions Using R and Bioconductor

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Bioconductor is a widely used open source and open development software project for the analysis and comprehension of data arising from high-throughput experimentation in genomics and molecular biology. Bioconductor is rooted in the open source statistical computing environment R.

This volume's coverage is broad and ranges across most of the key capabilities of the Bioconductor project, including importation and preprocessing of high-throughput data from microarray, proteomic, and flow cytometry platforms:

Curation and delivery of biological metadata for use in statistical modeling and interpretation

Statistical analysis of high-throughput data, including machine learning and visualization

Modeling and visualization of graphs and networks

The developers of the software, who are in many cases leading academic researchers, jointly authored chapters. All methods are illustrated with publicly available data, and a major section of the book is devoted to exposition of fully worked case studies.

This book is more than a static collection of descriptive text, figures, and code examples that were run by the authors to produce the text; it is a dynamic document. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

Author(s): Frederick Marcus
Series: Statistics for Biology and Health
Edition: 1
Publisher: Springer
Year: 2005

Language: English
Pages: 492

Cover Page......Page 1
Title Page......Page 3
ISBN 0387251464......Page 4
Acknowledgments......Page 5
I Preprocessing data from genomic experiments......Page 7
II Meta-data: biological annotation and visualiza-tion......Page 9
III Statistical analysis for genomic experiments......Page 11
IV Graphs and networks......Page 14
V Case studies......Page 15
List of Contributors......Page 17
Part I Preprocessing data from genomic experiments......Page 20
1.1 Introduction......Page 22
1.2 Tasks......Page 23
1.2.2 Stepwise and integrated approaches......Page 24
1.3.1 Data sources......Page 25
1.3.2 Facilities in R and Bioconductor......Page 26
1.4 Statistical background......Page 27
1.4.1 An error model......Page 28
1.4.3 Sensitivity and specificity of probes......Page 30
1.5 Conclusion......Page 31
2.1 Introduction......Page 32
2.2.2 Examining probe-level data......Page 34
2.3.1 Background adjustment......Page 37
2.3.2 Normalization......Page 39
2.3.3 vsn......Page 43
2.4.1 expresso......Page 44
2.4.2 threestep......Page 45
2.4.4 GCRMA......Page 46
2.4.5 affypdnn......Page 47
2.5 Assessing preprocessing methods......Page 48
2.5.1 Carrying out the assessment......Page 49
2.6 Conclusion......Page 51
3.1 Introduction......Page 52
3.2 Exploratory data analysis......Page 53
3.2.1 Multi-array approaches......Page 54
3.3 Affymetrix quality assessment metrics......Page 56
3.4 RNA degradation......Page 57
3.5 Probe level models......Page 60
3.5.1 Quality diagnostics using PLM......Page 61
3.6 Conclusion......Page 66
4.1 Introduction......Page 68
4.2.1 Illustrative data......Page 69
4.3.1 Importing......Page 70
4.3.2 Reading target information......Page 71
4.3.3 Reading probe-related information......Page 72
4.3.5 Data structure: the marrayRaw class......Page 73
4.3.7 Subsetting......Page 75
4.4.1 Diagnostic plots......Page 76
4.4.2 Spatial plots of spot statistics -......Page 78
4.4.3 Boxplots of spot statistics -......Page 79
4.4.4 Scatter-plots of spot statistics -......Page 80
4.5 Normalization......Page 81
4.5.1 Two-channel normalization......Page 82
4.5.2 Separate-channel normalization......Page 83
4.6 Case study......Page 86
5.2 Experimental technologies......Page 90
5.2.3 Monitoring the response......Page 91
5.3 Reading data......Page 92
5.3.1 Plate reader data......Page 93
5.3.2 Further directions in normalization......Page 95
5.3.3 FCS format......Page 96
5.4.1 Visualization at the level of individual cells......Page 98
5.4.2 Visualization at the level of microtiter plates......Page 101
5.4.3 Brushing with Rggobi......Page 102
5.5.1 Discrete Response......Page 104
5.5.2 Continuous response......Page 107
Acknowledgement......Page 109
6.1 Introduction......Page 110
6.2 Baseline subtraction......Page 112
6.3 Peak detection......Page 114
6.4 Processing a set of calibration spectra......Page 115
6.4.1 Apply baseline subtraction to a set of spectra......Page 117
6.4.2 Normalize spectra......Page 118
6.4.3 Cutoff selection......Page 119
6.4.5 Quality assessment......Page 120
6.4.6 Get proto-biomarkers......Page 121
6.5 An example......Page 124
6.6 Conclusion......Page 127
Part II Meta-data: biological annotation and visualization......Page 130
7.1 Introduction......Page 132
7.2 External annotation resources......Page 134
7.3 Bioconductor annotation concepts: curated persistent packages and Web services......Page 135
7.3.1 Annotating a platform: HG-U95Av2......Page 136
7.3.2 An Example......Page 137
7.4 The annotate package......Page 138
7.5 Software tools for working with Gene Ontology (GO)......Page 139
7.5.1 Basics of working with the GO package......Page 140
7.5.3 Searching for terms......Page 141
7.5.4 Annotation of GO terms to LocusLink sequences: evidence codes......Page 142
7.6 Pathway annotation packages: KEGG and cMAP......Page 144
7.6.1 KEGG......Page 145
7.6.2 cMAP......Page 146
7.6.3 A Case Study......Page 148
7.7 Cross-organism annotation: the homology packages......Page 149
7.8 Annotation from other sources......Page 151
7.9 Discussion......Page 152
8.1 The Tools......Page 154
8.1.2 Entrez examples......Page 156
8.2 PubMed......Page 157
8.2.1 Accessing PubMed information......Page 158
8.2.2 Generating HTML output for your abstracts......Page 160
8.3 KEGG via SOAP......Page 161
8.4 Getting gene sequence information......Page 163
8.5 Conclusion......Page 164
9.1 Introduction......Page 166
9.2 A simple approach......Page 167
9.3 Using the annaffy package......Page 168
9.4 Linking to On-line Databases......Page 171
9.5.1 Limiting the results......Page 172
9.5.2 Annotating the probes......Page 173
9.5.3 Adding other data......Page 174
9.6 Graphical displays with drill-down functionality......Page 175
9.6.1 HTML image maps......Page 176
9.6.2 Scalable Vector Graphics (SVG)......Page 177
9.7.1 Text searching......Page 178
9.8 Concluding Remarks......Page 179
10.1 Introduction......Page 180
10.2 Practicalities......Page 181
10.3 High-volume scatterplots......Page 182
10.3.1 A note on performance......Page 183
10.4 Heatmaps......Page 185
10.4.1 Heatmaps of residuals......Page 187
10.5 Visualizing distances......Page 189
10.5.1 Multidimensional scaling......Page 192
10.6 Plotting along genomic coordinates......Page 193
10.6.1 Cumulative Expression......Page 197
10.7 Conclusion......Page 198
Part III Statistical analysis for genomic experiments......Page 200
11.1 Introduction and road map......Page 202
11.1.4 Machine learning......Page 203
11.2 Absolute and relative expression measures......Page 204
12.1 Introduction......Page 208
12.2.1 Definitions......Page 210
12.2.2 Distances between points......Page 211
12.2.3 Distances between distributions......Page 214
12.2.4 Experiment-specific distances between genes......Page 217
12.3.1 Distances and standardization......Page 218
12.4 Examples......Page 220
12.4.1 A co-citation example......Page 222
12.4.2 Adjacency......Page 226
12.5 Discussion......Page 227
13.1 Introduction......Page 228
13.2.1 Overview of clustering algorithms......Page 229
13.2.3 Building sequences of clustering results......Page 230
13.2.4 Visualizing clustering results......Page 233
13.2.5 Statistical issues in clustering......Page 234
13.2.6 Bootstrapping a cluster analysis......Page 235
13.2.7 Number of clusters......Page 236
13.3.1 Gene selection......Page 241
13.3.2 HOPACH clustering of genes......Page 242
13.3.5 HOPACH clustering of arrays......Page 243
13.3.6 Output files......Page 245
13.4 Conclusion......Page 247
14.1 Introduction......Page 248
14.2 Differential expression analysis......Page 249
14.2.1 Example: ALL data......Page 251
14.2.2 Example: Kidney cancer data......Page 255
14.3 Multifactor experiments......Page 258
14.3.1 Example: Estrogen data......Page 260
14.4 Conclusion......Page 267
15.1 Introduction......Page 268
15.2.1 Multiple hypothesis testing framework......Page 269
15.2.2 Test statistics null distribution......Page 274
15.2.3 Single-step procedures for controlling general Type I error rates θ(FVn)......Page 275
15.2.4 Step-down procedures for controlling the family-wise error rate......Page 276
15.2.5 Augmentation multiple testing procedures for controlling tail probability error rates......Page 277
15.3 Software implementation: R multtest package......Page 278
15.3.1 Resampling-based multiple testing procedures: MTP function......Page 279
15.4.1 ALL data package and initial gene filtering......Page 281
15.4.2 Association of expression measures and tumor cellular subtype: Two-sample t-statistics......Page 282
15.4.3 Augmentation procedures......Page 284
15.4.4 Association of expression measures and tumor molecular subtype: Multi-sample F-statistics......Page 285
15.4.5 Association of expression measures and time to relapse: Cox t-statistics......Page 287
15.5 Discussion......Page 289
16.1 Introduction......Page 292
16.2 Illustration: Two continuous features; decision regions......Page 293
16.3.1 Families of learning methods......Page 295
16.3.2 Model assessment......Page 300
16.3.3 Metatheorems on learner and feature selection......Page 302
16.3.4 Computing interfaces......Page 303
16.4.1 Exploring and comparing classifiers with the ALL data......Page 304
16.4.3 Other methods......Page 306
16.4.4 Structured cross-validation support......Page 307
16.4.6 Expression density diagnostics......Page 308
16.5 Conclusions......Page 310
17.1 Introduction......Page 312
17.2 Bagging and random forests......Page 314
17.3 Boosting......Page 315
17.5 Evaluation......Page 317
17.6.1 Acute lymphoblastic leukemia......Page 319
17.6.2 Renal cell cancer......Page 322
17.7 Applications: Survival analysis......Page 326
17.8 Conclusion......Page 329
18.1 Introduction......Page 332
18.1.1 Key user interface features......Page 333
18.2.2 Installation......Page 334
18.2.3 Configuration......Page 335
18.3.1 Data Preprocessing......Page 336
18.3.2 Differential expression multiple testing......Page 337
18.3.3 Linked annotation meta-data......Page 339
18.3.4 Retrieving results......Page 340
18.4.1 Architectural overview......Page 341
18.4.2 Creating a new module......Page 343
18.5 Conclusion......Page 345
Part IV Graphs and networks......Page 346
19.1 Introduction......Page 348
19.2.2 Algorithms......Page 349
19.3.1 Biomolecular Pathways......Page 350
19.3.2 Gene ontology: A graph of concept-terms......Page 352
19.3.3 Graphs induced by literature references and citations......Page 353
19.4 Discussion......Page 355
20.1 Overview......Page 356
20.2 Definitions......Page 357
20.2.1 Special types of graphs......Page 360
20.2.2 Random graphs......Page 362
20.3 Cohesive subgroups......Page 363
20.4 Distances......Page 365
21.1 Introduction......Page 366
21.2 The graph package......Page 367
21.2.1 Getting started......Page 368
21.3 The RBGL package......Page 371
21.3.1 Connected graphs......Page 374
21.3.2 Paths and related concepts......Page 376
21.4 Drawing graphs......Page 379
21.4.2 Node and edge attributes......Page 382
21.4.3 The function agopen and the Ragraph class......Page 384
21.4.4 User-defined drawing functions......Page 385
21.4.5 Image maps on graphs......Page 387
22.1 Introduction......Page 388
22.2 Comparing the transcriptome and the interactome......Page 389
22.2.1 Testing associations......Page 390
22.2.2 Data analysis......Page 392
22.3 Using GO......Page 393
22.3.1 Finding interesting GO terms......Page 394
22.4 Literature co-citation......Page 397
22.4.1 Statistical development......Page 399
22.4.3 Examples......Page 401
22.5 Pathways......Page 406
22.5.1 The graph structure of pathways......Page 407
22.5.2 Relating expression data to pathways......Page 409
22.6 Concluding remarks......Page 412
Part V Case studies......Page 414
23.1 Introduction......Page 416
23.2 Data representations......Page 417
23.3 Linear models......Page 418
23.4 Simple comparisons......Page 419
23.5 Technical Replication......Page 422
23.6 Within-array replicate spots......Page 425
23.7 Two groups......Page 426
23.8 Several groups......Page 428
23.9 Direct two-color designs......Page 430
23.10 Factorial designs......Page 431
23.11 Time course experiments......Page 433
23.12 Statistics for differential expression......Page 434
23.13 Fitted model objects......Page 436
23.14 Preprocessing considerations......Page 437
23.15 Conclusion......Page 439
24.1 Introduction......Page 440
24.2 Reading and customizing the data......Page 441
24.3 Training and validating classifiers......Page 442
24.4 Multiple random divisions......Page 445
24.5 Classification of test data......Page 447
24.6 Conclusion......Page 448
25.1 Introduction......Page 450
25.3 Preprocessing......Page 451
25.4 Ranking and filtering genes......Page 452
25.4.1 Summary statistics and tests for ranking......Page 453
25.4.3 Comparison......Page 456
25.5 Annotation......Page 457
25.5.1 PubMed abstracts......Page 458
25.5.2 Generating reports......Page 460
25.6 Conclusion......Page 461
A.1.3 Estrogen receptor stimulation......Page 462
A.2 URLs for projects mentioned......Page 463
References......Page 464
A......Page 484
B,C......Page 485
D,E......Page 486
H......Page 487
M......Page 488
O,P......Page 489
Q,R......Page 490
T......Page 491
U,V,W,X,Y......Page 492