Bioinformatics, Biocomputing and Perl: An Introduction

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Bioinformatics, Biocomputing and Perl presents a modern introduction to bioinformatics computing skills and practice. Structuring its presentation around four main areas of study, this book covers the skills vital to the day-to-day activities of today’s bioinformatician. Each chapter contains a series of maxims designed to highlight key points and there are exercises to supplement and cement the introduced material. 

Working with Perl presents an extended tutorial introduction to programming through Perl, the premier programming technology of the bioinformatics community. Even though no previous programming experience is assumed, completing the tutorial equips the reader with the ability to produce powerful custom programs with ease.

Working with Data applies the programming skills acquired to processing a variety of bioinformatics data. In addition to advice on working with important data stores such as the Protein DataBank, SWISS-PROT, EMBL and the GenBank, considerable discussion is devoted to using bioinformatics data to populate relational database systems.  The popular MySQL database is used in all examples.

Working with the Web presents a discussion of the Web-based technologies that allow the bioinformatics researcher to publish both data and applications on the Internet.

Working with Applications shifts gear from creating custom programs to using them. The tools described include Clustal-W, EMBOSS, STRIDE, BLAST and Xmgrace. An introduction to the important Bioperl Project concludes this chapter and rounds off the book.

Author(s): Michael Moorhouse, Paul Barry
Edition: 1
Publisher: Wiley
Year: 2004

Language: English
Pages: 508

Bioinformatics Biocomputing and Perl......Page 4
Contents......Page 10
Preface......Page 18
1.1 Introducing Biological Sequence Analysis......Page 24
1.2 Protein and Polypeptides......Page 27
1.3 Generalised Models and their Use......Page 28
1.4.1 Transcription......Page 29
1.4.2 Translation......Page 30
1.5 Genome Sequencing......Page 33
1.5.1 Sequence assembly......Page 34
1.6 The Example DNA-gene-protein system we will use......Page 35
Where to from Here......Page 36
2.1 The Layers of Technology......Page 38
2.1.1 From passive user to active developer......Page 39
2.2.1 Checking for perl......Page 40
Where to from Here......Page 41
I Working with Perl......Page 42
3.1 Let’s Get Started!......Page 44
3.1.1 Running Perl programs......Page 45
3.1.2 Syntax and semantics......Page 46
3.1.3 Program: run thyself!......Page 48
3.2.1 Using the Perl while construct......Page 49
3.3 More Iterations......Page 53
3.3.1 Introducing variable containers......Page 54
3.3.2 Variable containers and loops......Page 55
3.4 Selection......Page 57
3.4.1 Using the Perl if construct......Page 58
3.5 There Really is MTOWTDI......Page 59
3.6 Processing Data Files......Page 64
3.6.1 Asking getlines to do more......Page 66
3.7 Introducing Patterns......Page 67
The Maxims Repeated......Page 69
4.2 Arrays: Associating Data with Numbers......Page 72
4.2.2 How big is the array?......Page 74
4.2.3 Adding elements to an array......Page 75
4.2.5 Slicing arrays......Page 77
4.2.6 Pushing, popping, shifting and unshifting......Page 79
4.2.7 Processing every element in an array......Page 80
4.2.8 Making lists easier to work with......Page 82
4.3 Hashes: Associating Data with Words......Page 83
4.3.2 How big is the hash?......Page 84
4.3.4 Removing entries from a hash......Page 85
4.3.5 Slicing hashes......Page 86
4.3.6 Working with hash entries: a complete example......Page 87
4.3.7 Processing every entry in a hash......Page 89
The Maxims Repeated......Page 91
5.1 Named Blocks......Page 94
5.2.1 Calling subroutines......Page 96
5.3 Creating Subroutines......Page 97
5.3.1 Processing parameters......Page 99
5.3.2 Better processing of parameters......Page 101
5.3.3 Even better processing of parameters......Page 103
5.3.4 A more flexible drawline subroutine......Page 106
5.3.5 Returning results......Page 107
5.4 Visibility and Scope......Page 108
5.4.1 Using private variables......Page 109
5.4.2 Using global variables properly......Page 111
5.4.3 The final version of drawline......Page 112
5.5 In-built Subroutines......Page 113
5.6 Grouping and Reusing Subroutines......Page 115
5.6.1 Modules......Page 116
5.8 CPAN: The Module Repository......Page 119
5.8.1 Searching CPAN......Page 120
5.8.2 Installing a CPAN module manually......Page 121
5.8.4 A final word on CPAN modules......Page 122
The Maxims Repeated......Page 123
6.1.1 The standard streams: STDIN, STDOUT and STDERR......Page 126
6.2 Reading Files......Page 128
6.2.1 Determining the disk-file names......Page 129
6.2.2 Opening the named disk-files......Page 131
6.2.4 Putting it all together......Page 133
6.2.5 Slurping......Page 137
6.3 Writing Files......Page 139
6.3.2 Variable interpolation......Page 140
6.4 Chopping and Chomping......Page 141
The Maxims Repeated......Page 142
7.1 Pattern Basics......Page 144
7.1.2 What makes regular expressions so special?......Page 145
7.2.1 The + repetition metacharacter......Page 147
7.2.2 The | alternation metacharacter......Page 149
7.2.3 Metacharacter shorthand and character classes......Page 150
7.2.4 More metacharacter shorthand......Page 151
7.2.6 The ? and * optional metacharacters......Page 153
7.2.7 The any character metacharacter......Page 154
7.3.1 The \b word boundary metacharacter......Page 155
7.3.3 The $ end-of-line metacharacter......Page 156
7.4 The Binding Operators......Page 157
7.5 Remembering What Was Matched......Page 158
7.6 Greedy by Default......Page 160
7.7 Alternative Pattern Delimiters......Page 161
7.8 Another Useful Utility......Page 162
7.9 Substitutions: Search and Replace......Page 163
7.9.1 Substituting for whitespace......Page 164
7.10 Finding a Sequence......Page 165
The Maxims Repeated......Page 169
8.2 Strictness......Page 170
8.3 Perl One-liners......Page 172
8.4 Running Other Programs from perl......Page 175
8.5 Recovering from Errors......Page 176
8.6 Sorting......Page 178
8.7 HERE Documents......Page 182
Where to from Here......Page 183
The Maxims Repeated......Page 184
II Working with Data......Page 186
9.2 Downloading from the Web......Page 188
9.2.1 Using wget to download PDB data-files......Page 190
9.2.3 Smarter mirroring......Page 191
9.2.4 Downloading a subset of a dataset......Page 192
The Maxims Repeated......Page 194
10.1 Introduction......Page 196
10.2.1 X-Ray Crystallography......Page 197
10.2.2 Nuclear magnetic resonance......Page 199
10.3 The Protein Databank......Page 200
10.4 The PDB Data-file Formats......Page 202
10.4.1 Example structures......Page 203
10.4.2 Downloading PDB data-files......Page 204
10.5 Accessing Data in PDB Entries......Page 205
10.6 Accessing PDB Annotation Data......Page 206
10.6.1 Free R and resolution......Page 207
10.6.2 Database cross references......Page 209
10.6.3 Coordinates section......Page 211
10.6.4 Extracting 3D coordinate data......Page 214
10.7 Contact Maps......Page 215
10.8 STRIDE: Secondary Structure Assignment......Page 219
10.9 Assigning Secondary Structures......Page 220
10.9.1 Using STRIDE and parsing the output......Page 223
10.9.2 Extracting amino acid sequences using STRIDE......Page 227
10.10 Introducing the mmCIF Protein Format......Page 228
10.10.2 Converting mmCIFs to PDB with CIFTr......Page 229
10.10.5 Automated conversion of mmCIF to PDB......Page 231
The Maxims Repeated......Page 233
11.1.1 Reasons for redundancy......Page 234
11.1.3 Non-redundancy and non-representative......Page 235
11.2 Non-redundant Protein Structures......Page 236
The Maxims Repeated......Page 240
12.1 Introducing Databases......Page 242
12.1.1 Relating tables......Page 243
12.1.3 Solving the one-table problem......Page 245
12.2 Available Database Systems......Page 247
12.2.3 Open source database systems......Page 248
12.3.1 Defining data with SQL......Page 249
12.4 A Database Case Study: MER......Page 250
12.4.1 The requirement for the MER database......Page 254
12.4.2 Installing a database system......Page 255
12.4.3 Creating the MER database......Page 256
12.4.4 Adding tables to the MER database......Page 258
12.4.5 Preparing SWISS-PROT data for importation......Page 261
12.4.6 Importing tab-delimited data into proteins......Page 268
12.4.7 Working with the data in proteins......Page 269
12.4.8 Adding another table to the MER database......Page 271
12.4.9 Preparing EMBL data for importation......Page 272
12.4.11 Working with the data in dnas......Page 276
12.4.12 Relating data in one table to that in another......Page 277
12.4.13 Adding the crossrefs table to the MER database......Page 278
12.4.14 Preparing cross references for importation......Page 279
12.4.16 Working with the data in crossrefs......Page 282
12.4.17 Adding the citations table to the MER database......Page 286
12.4.18 Preparing citation information for importation......Page 288
12.4.20 Working with the data in citations......Page 291
The Maxims Repeated......Page 292
13.1 Why Program Databases?......Page 296
13.2 Perl Database Technologies......Page 297
13.3.1 Checking the DBI installation......Page 298
13.4 Programming Databases with DBI......Page 299
13.4.1 Developing a database utility module......Page 302
13.4.2 Improving upon dump_results......Page 303
13.5 Customising Output......Page 305
13.6 Customising Input......Page 308
13.7 Extending SQL......Page 312
The Maxims Repeated......Page 315
III Working with the Web......Page 318
14.1 An Example of What’s Possible......Page 320
14.3 Using SRS......Page 321
The Maxims Repeated......Page 323
15.1 The Web Development Infrastructure......Page 326
15.2 Creating Content for the WWW......Page 328
15.2.2 The dynamic creation of WWW content......Page 331
15.3 Preparing Apache for Perl......Page 333
15.3.1 Testing the execution of server-side programs......Page 335
15.4 Sending Data to a Web Server......Page 338
15.5 Web Databases......Page 343
The Maxims Repeated......Page 350
16.1 Why Automate Surfing?......Page 352
16.2 Automated Surfing with Perl......Page 353
Where to from Here......Page 358
The Maxims Repeated......Page 359
IV Working with Applications......Page 360
17.1 Introduction......Page 362
17.2 Sequence Databases......Page 363
17.2.1 Understanding EMBL entries......Page 366
17.2.2 Understanding SWISS-PROT entries......Page 369
17.3 General Concepts and Methods......Page 370
17.3.2 True/False/Negative/Positive......Page 371
17.3.3 Balancing the errors......Page 374
17.3.4 Using multiple algorithms to improve performance......Page 375
17.3.5 tRNA-ScanSE, a case study......Page 376
17.4 Introducing Bioinformatics Tools......Page 380
17.4.1 ClustalW......Page 381
17.4.2 Algorithms and methods......Page 382
17.4.3 Installation and use......Page 383
17.4.4 Substitution/scoring matrices......Page 384
17.5 BLAST......Page 385
17.5.1 Installing NCBI-BLAST......Page 387
17.5.2 Preparation of database files for faster searching......Page 388
17.5.3 The different types of BLAST search......Page 392
The Maxims Repeated......Page 394
18.1 Introduction......Page 396
18.2.2 Genetic structure and regulation......Page 397
18.2.3 Mobility of the Mer Operon......Page 398
18.3 Downloading the Raw DNA Sequence......Page 400
18.4 Initial BLAST Sequence Similarity Search......Page 401
18.5 GeneMark......Page 403
18.5.1 Using BLAST to identify specific sequences......Page 405
18.5.2 Dealing with false negatives and missing proteins......Page 409
18.5.3 Over-predicted genes and false positives......Page 410
18.6 Structural Prediction with SWISS-MODEL......Page 411
18.6.2 Modelling with SWISS-MODEL......Page 413
18.7 DeepView as a Structural Alignment Tool......Page 419
18.8 PROSITE and Sequence Motifs......Page 424
18.8.1 Using PROSITE patterns and matrices......Page 425
18.8.2 Downloading PROSITE and its search tools......Page 426
18.9.1 A look at the HMA domain of MerA and MerP......Page 430
Where to from Here?......Page 433
The Maxims Repeated......Page 434
19.1 Introducing Visualisation......Page 436
19.2 Displaying Tabular Data Using HTML......Page 438
19.2.1 Displaying SWISS-PROT identifiers......Page 440
19.3 Creating High-quality Graphics with GD......Page 445
19.3.1 Using the GD module......Page 447
19.3.2 Displaying genes in EMBL entries......Page 449
19.3.3 Introducing mogrify......Page 452
19.4 Plotting Graphs......Page 454
19.4.1 Graph-plotting using the GD::Graph modules......Page 455
19.4.2 Graph-plotting using Grace......Page 456
The Maxims Repeated......Page 462
20.1 What is Bioperl?......Page 464
20.3 Installing Bioperl......Page 465
20.4 Using Bioperl: Fetching Sequences......Page 467
20.4.1 Fetching multiple sequences......Page 468
20.4.2 Extracting sub-sequences......Page 470
20.5 Remote BLAST Searches......Page 471
20.5.1 A quick aside: the blastcl3 NetBlast client......Page 472
20.5.2 Parsing BLAST outputs......Page 473
Where to from Here......Page 474
The Maxims Repeated......Page 475
A Appendix A......Page 476
B Appendix B......Page 480
C Appendix C......Page 482
D Appendix D......Page 484
E Appendix E......Page 490
F Appendix F......Page 494
Index......Page 498