This book broadly covers the given spectrum of disciplines in Computational Life Sciences, transforming it into a strong helping hand for teachers, students, practitioners and researchers. In Life Sciences, problem-solving and data analysis often depend on biological expertise combined with technical skills in order to generate, manage and efficiently analyse big data. These technical skills can easily be enhanced by good theoretical foundations, developed from well-chosen practical examples and inspiring new strategies. This is the innovative approach of Computational Life Sciences-Data Engineering and Data Mining for Life Sciences: We present basic concepts, advanced topics and emerging technologies, introduce algorithm design and programming principles, address data mining and knowledge discovery as well as applications arising from real projects. Chapters are largely independent and often flanked by illustrative examples and practical advise.
Author(s): Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, Alexander Apke
Series: Studies in Big Data, 112
Publisher: Springer
Year: 2023
Language: English
Pages: 592
City: Cham
Preface
Contents
Contributors
Solving Problems in Life Sciences: On Programming Languages and Basic Concepts
Interesting Programming Languages Used in Life Sciences
1 Introduction
2 Julia
2.1 The Theoretical Appeal
2.2 A Real World Example
3 Perl/Raku
3.1 The Theoretical Appeal
3.2 A Real World Example
4 Spreadsheets
4.1 The Theoretical Appeal
4.2 A Tale of Algorithm Distribution
4.3 How Vlookup Solved a Data Protection Issue
5 (Not Your Grandparent's) SQL
5.1 The Theoretical Appeal
5.2 Two Small Epiphanies
6 R
6.1 The Theoretical Appeal
6.2 An Ad-hoc Overview for a Messy Dataset
7 Python
7.1 The Theoretical Appeal
7.2 Tying it All Together in One Small Application
8 Java
Introduction to Java
1 Getting Ready
1.1 Installation of JDK
1.2 Running Java from Command Line
1.3 Set Up Your Environment: Eclipse
1.4 Installing Version Control Software: SVN and/or Git
1.5 Installing and Using Maven
2 Your First Java Project
2.1 Creating a New Project
2.2 Creating a New Java Class
2.3 Sharing a Project via SVN
2.4 Sharing a Project via Git
3 Java Basics
3.1 Variable Declaration
3.2 Comparison
3.3 Arrays
3.4 Loops
3.5 Adding Extensions
3.6 Exceptions
4 Adding External Libraries
4.1 Adding an External Jar-File
4.2 Building External Libraries
Basic Data Processing
1 Data Architecture and Data Modeling
1.1 A Primer on Object-Oriented Programming
1.2 How Objects Are Represented in Java and Can Be Used to Store Data
1.3 Classes: How to Create an Object from a Class
1.4 More Information on Class Inheritance
2 Using Lists and Other Data Structures
2.1 Using ArrayList
2.2 Using LinkedList
2.3 Using Collections and Stack
2.4 Sorting a Collection
3 Handling Parameters
4 Reading and Writing Files and Data
4.1 Text Files
4.2 Tables
4.3 Pictures and Other Binary Data
5 Basic Mathematics and Statistics
Algorithm Design
1 A Simple Algorithm
2 Modeling Real World Problems
3 Running Times of Algorithms
3.1 The Big O Notation
3.2 Calculation Rules for the Big O Notation
3.3 Determining the Asymptotic Running Time of an Algorithm
4 A Faster Search Algorithm
5 Introduction to Complexity Theory
5.1 NP-complete Problems
6 Basic Concepts of Algorithm Design
6.1 Divide and Conquer
6.2 Dynamic Programming
6.3 Recursion
6.4 Greedy Heuristics
References
*-20pt Data Mining and Knowledge Discovery
Data and Knowledge Management
1 Data, Information, Knowledge, Wisdom
1.1 Data Processing and Workflows
1.2 Scientific Data and Data Life Cycle
2 Data Engineering Techniques
2.1 Data Collection
2.2 Data Processing
2.3 Data Analyses
2.4 Data Storage
2.5 Data Re-use
3 Technical, Ethical and Social Issues
3.1 Technical issues
3.2 Ethical and Social Issues
References
Databases and Knowledge Graphs
1 Introduction
1.1 Relational Database Concepts
2 Java: JDBC
2.1 SQLite
2.2 H2
3 Knowledge Graphs and noSQL Databases
3.1 A Primer on Knowledge Graphs
3.2 A Primer on noSQL Databases
3.3 Python: Neo4J
3.4 Link Prediction on Large Scale Knowledge Graphs
3.5 Machine Learning
Knowledge Discovery and AI Approaches for the Life Sciences
1 Knowledge Representation: Describing Complex Objects and Data
1.1 Structured Data
1.2 Unstructured Data
1.3 Problems
1.4 Using XML
1.5 Using JSON
1.6 Ontology Engineering for the Semantic Web: RDF and OWL
2 Knowledge Discovery: Methods for Efficient Data Processing on the Web
3 Basic Descriptive Statistics
3.1 Scales of Measurement
3.2 Frequencies and Statistical Value
3.3 Bivariate Statistics
3.4 Basic Methods of Inferential Statistics: t-Test, Analysis of Variance, Regression
4 AI Approaches for Life Sciences
4.1 Classification and Clustering
4.2 Binning
4.3 Hashing
4.4 Machine Learning Approaches for Classification
5 Personalized Medicine
5.1 Unmet Medical and Patient Needs
5.2 Properties of Biomedical Data and Challenges
5.3 Standardization and Harmonization
5.4 Perspectives on Personalized Medicine
References
Longitudinal Data
1 Sparse Data
1.1 Are we Longitudinal Yet?
1.2 Removing Mean Effects and Accessing Variability
2 Smoothing and Modelling Data
2.1 Locally Smoothing Data
2.2 Modelling Data and Adjusting for Phase Variation
2.3 Maximum Likelihood and Bayesian Approaches
2.4 Obtaining Your Model and Further Analysis
3 A Playful Dataset
Distributed Computing and Clouds
Computational Grids
1 Early Beginnings
2 Grid Computing
2.1 Grid Middleware
2.2 Site Autonomy
2.3 Using Resources of More Than One Grid: Grid Federation
3 Using Grid Technology in Life Sciences
3.1 Text Mining in Grids
3.2 Drug Discovery in Grids
4 How to Use Grid Resources as of 2021
5 Summary
References
Cloud Computing
1 Cloud Computing in a Nutshell
1.1 Commercial Cloud Providers
1.2 Cloud Access Patterns
1.3 Open Cloud Middleware
1.4 Service Level Agreements
2 Using Cloud Technology in Life Sciences
2.1 Life Sciences Applications in EGI
2.2 Life Sciences Applications in Helix Nebula Science Cloud
2.3 Life Sciences Applications in EOSC
2.4 Life Sciences Applications in ELIXIR
3 Summary
References
Standards
1 Grid Standards
1.1 Working Groups
1.2 Recommendations
2 Cloud Standards
2.1 Institute for Electrical and Electronics Engineers
2.2 International Organization for Standardization/International Electrical Commission ISO/IEC
2.3 International Telecommunication Union—Telecommunication Standardization Sector
2.4 National Institute of Standards and Technology
2.5 Open Grid Forum
2.6 OASIS Open
2.7 European Telecommunications Standards Institute
2.8 DMTF
2.9 ATIS
2.10 Global Inter-Cloud Technology Forum
2.11 SNIA
2.12 TIA
3 Summary
References
Advanced Topics in Computational Life Sciences
Network Analysis: Bringing Graphs to Java
1 Directed Graphs
1.1 Food Chains
1.2 Social Relation and Between-Species Interaction
2 Undirected Graphs
2.1 Protein Interaction Network
2.2 Similarity Graph
3 Some More Examples
3.1 Substructure and Maximal Common Substructure Searching
3.2 Random Graphs
3.3 Social Networks
3.4 Directed Protein Interaction Networks
References
Optimization
1 Linear Optimization
1.1 Formulation of an LP
1.2 Solving an LP with lpsolve
1.3 Possible States of an LP
1.4 Geometrical Approach
1.5 Algorithmic Aspects
2 Combinatorial Optimization
2.1 Integer Programs
2.2 Dynamic Programming
2.3 Branch-and-Bound
2.4 Local Search Heuristics
2.5 Hill Climbing
2.6 Simulated Annealing
2.7 Concluding Remarks
References
Image Processing and Manipulation
1 Using ImageJ as a Library
1.1 Reading and Writing Pictures
1.2 Using the ImageProcessor
1.3 Creating New Images and Destroying Images
1.4 Basic Image Manipulations
1.5 Particle Analyses
1.6 Classifying Objects
1.7 Colour Analysis
2 Other Libraries
3 Building an Analysis Pipeline
3.1 Bash Scripts
3.2 Parallel Environments
References
Sequence Analysis
1 Basics in Sequence Analysis
1.1 Of Molecules and Codes
1.2 From Subsequences to Functions
1.3 Molecular Genetics and Beyond
1.4 Computing on Biological Sequences
2 Introduction to BioJava
3 Reading and Writing FASTA
4 Database Search
5 NGS Sequences in Java
6 Sequence Alignment
6.1 Multiple Sequence Alignment
6.2 BLAST
7 Summary
References
Applications and Emerging Technologies
NGS Data Analysis with Apache Spark
1 Next-Generation Sequencing
1.1 Definition
1.2 Illumina Sequencing
1.3 NGS File Formats
2 FASTQC Software
2.1 Introduction
2.2 Interpretation of the FastQC Report
3 Introduction to Apache Spark
3.1 Apache Spark—Main Concepts
3.2 Apache Spark Main Features
3.3 Using the Best of Apache Spark
3.4 Spark Versus Hadoop
3.5 Spark Installation in Standalone Mode in Ubuntu
4 Implementation
5 Results
6 Conclusion
References
Plant Image Analysis
1 Introduction
2 Materials and Methods
2.1 Data
2.2 Segmentation
2.3 Object Recognition
2.4 Object Analysis
2.5 Explorative Data Analysis
3 Results
3.1 Comparison of Leaf Count to Ground Truth for the A1 and A2 Datasets
3.2 Overview of the A2 Dataset
3.3 Analyses of Correlations
4 Discussion
4.1 Challenges in the Identification of Plants
4.2 Object Analysis
4.3 Quality of the Analysis of the A1 Dataset
4.4 Correlations in Leaf Characteristics
5 Conclusion
References
Anonymization of Electronic Health Care Records: The EHR Anonymizer
1 Introduction: Electronic Health Care Data
1.1 EHR's and the Problem of Data Privacy
1.2 Anonymization
1.3 The EHR Anonymizer
2 Methods
2.1 File Handling
2.2 EHR and Annotations
2.3 BRAT Rapid Annotation Tool
2.4 Design of the GUI
2.5 Feedback Loop
3 Results
3.1 Annotation Performance
3.2 Statistical Metrics
4 Discussion
4.1 Aim of the Project
4.2 Results Interpretation
4.3 Future Work
References
Metadata-Enriched Image Database: A Tool for Generating and Interacting with an Image Database Using Metadata
1 Introduction
2 User Understanding
2.1 Determine User Objectives
2.2 Situation Assessment
2.3 Application Goal
2.4 Project Plan
3 Data Understanding
3.1 Data Exploration
3.2 Project Plan Extension
4 Background
4.1 SQLite Database Engine
4.2 RESTful API with Spring
5 Implementation
5.1 Command Line Interface
5.2 SQLite Databases
5.3 Documentation About Tables and Data Structures
5.4 Web Service
6 Evaluation
6.1 Result Evaluation
6.2 Process Review
6.3 Future Steps
7 Deployment
References
Biomedical Knowledge Graphs: Context, Queries and Complexity
1 Background
1.1 Preliminaries
1.2 Method
2 Results
2.1 Real World Usecases for Testing
2.2 Storing the Knowledge Graph
2.3 Polyglot Persistence Systems
2.4 Graph Queries
3 Discussion
3.1 Knowledge Discovery on Custom Layers
3.2 Missing Data
3.3 Performance
3.4 Context Based NLP
3.5 Answering Semantic Questions and FAIRification of Data
3.6 Perspectives for Personalised Medicine
4 Conclusion
References
Classification of Images from Biomedical Literature
1 Introduction
2 Background
2.1 Pre-processing of the Data
2.2 Logistic Regression
2.3 RESTful API
3 Workflow
3.1 Data Acquisition
3.2 Data Storage
3.3 Pre-processing of the Data
3.4 Machine Learning
3.5 RESTful API
3.6 Command-Line Application
4 Results
4.1 Pre-processing of the Data
4.2 Machine Learning
4.3 Command-Line Application
4.4 Web Application
5 Conclusion and Outlook
5.1 Pre-processing of the Data
5.2 Machine Learning
5.3 RESTful API
References
Index