Performance, Reliability, and Availability Evaluation of Computational Systems, Volume 2: Reliability, Availability Modeling

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This textbook intends to be a comprehensive and substantially self-contained two-volume book covering performance, reliability, and availability evaluation subjects. The volumes focus on computing systems, although the methods may also be applied to other systems. This text is helpful for computer performance professionals for supporting planning, design, configuring, and tuning the performance, reliability, and availability of computing systems. Such professionals may use these volumes to get acquainted with specific subjects by looking at the particular chapters. Volume II is composed of the last two parts. Part III examines reliability and availability modeling by covering a set of fundamental notions, definitions, redundancy procedures, and modeling methods such as Reliability Block Diagrams (RBD) and Fault Trees (FT) with the respective evaluation methods, adopts Markov chains, Stochastic Petri nets and even hierarchical and heterogeneous modeling to represent more complex systems. Part IV discusses performance measurements and reliability data analysis. It first depicts some basic measuring mechanisms applied in computer systems, then discusses workload generation. After, we examine failure monitoring and fault injection, and finally, we discuss a set of techniques for reliability and maintainability data analysis. Software failure is an out-of-specification result produced by the software system for the respective specified input value. This definition is consistent with the system failure definition. Hence, the reader might ask: why should we pay particular attention to software dependability issues? However, the reader should bear in mind that the scientific software communities involved in subjects have specific backgrounds, many of which are not rooted in system reliability. These communities have developed a specific vocabulary that does not necessarily match the common system dependability terminology. Software research communities have long pursued dependable software systems. Correction of codification problems and software testing begins with the very origin of software development itself. Since then, the term software bug has been broadly applied to refer to mistakes, failures, and faults in a software system, whereas debugging is referred to the methodical process of finding bugs.

Author(s): Paulo Romero, Martins Macie
Publisher: CRC Press
Year: 2023

Language: English
Pages: 748

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Preface
Acknowledgement
Chapter 15: Introduction
PART III: Reliability and Availability Modeling
Chapter 16: Fundamentals of Dependability
16.1. A Brief History
16.2. Fundamental Concepts
16.3. Some Important Probability Distributions
Chapter 17: Redundancy
17.1. Hardware Redundancy
17.2. Software Redundancy
Chapter 18: Reliability Block Diagram
18.1. Models Classification
18.2. Basic Components
18.3. Logical and Structure Functions
18.4. Coherent System
18.5. Compositions
18.6. System Redundancy and Component Redundancy
18.7. Common Cause Failure
18.8. Paths and Cuts
18.9. Importance Indices
Chapter 19: Fault Tree
19.1. Components of a Fault Tree
19.2. Basic Compositions
19.3. Compositions
19.4. Common Cause Failure
Chapter 20: Combinatorial Model Analysis
20.1. Structure Function Method
20.2. Enumeration Method
20.3. Factoring Method
20.4. Reductions
20.5. Inclusion-Exclusion Method
20.6. Sum of Disjoint Products Method
20.7. Methods for Estimating Bounds
20.7.1. Method Based on Inclusion and Exclusion
20.7.2. Method Based on the Sum of Disjoint Products
20.7.3. Min-Max Bound Method
20.7.4. Esary-Proschan Method
20.7.5. Decomposition
Chapter 21: Modeling Availability, Reliability, and Capacity with CTMC
21.1. Single Component
21.2. Hot-Standby Redundancy
21.3. Hot-Standby with Non-Zero Delay Switching
21.4. Imperfect Coverage
21.5. Cold-Standby Redundancy
21.6. Warm-Standby Redundancy
21.7. Active-Active Redundancy
21.8. Many Similar Machines with Repair Facilities
21.9. Many Similar Machines with Shared Repair Facility
21.10. Phase-Type Distribution and Preventive Maintenance
21.11. Two-States Availability Equivalent Model
21.12. Common Cause Failure
Chapter 22: Modeling Availability, Reliability, and Capacity with SPN
22.1. Single Component
22.2. Modeling TTF and TTR with Phase-Type Distribution
22.3. Hot-Standby Redundancy
22.4. Imperfect Coverage
22.5. Cold-Standby Redundancy
22.6. Warm-Standby Redundancy
22.7. Active-Active Redundancy
22.8. KooN Redundancy
22.8.1. Modeling Multiple Resources on Multiple Servers
22.9. Corrective Maintenance
22.10. Preventive Maintenance
22.11. Common Cause Failure
22.12. Some Additional Models
22.12.1. Data Center Disaster Recovery
22.12.2. Disaster Tolerant Cloud Systems
22.12.3. MHealth System Infrastructure
PART IV: Measuring and Data Analysis
Chapter 23: Performance Measuring
23.1. Basic Concepts
23.2. Measurement Strategies
23.3. Basic Performance Metrics
23.4. Counters and Timers
23.5. Measuring Short Time Intervals
23.6. Profiling
23.6.1. Deterministic Profiling
23.6.2. Statistical Profiling
23.7. Counters and Basic Performance Tools in Linux
23.7.1. System Information
23.7.2. Process Information
23.8. Final Comments
Chapter 24: Workload Characterization
24.1. Types of Workloads
24.2. Workload Generation
24.2.1. Benchmarks
24.2.2. Synthetic Operational Workload Generation
24.3. Workload Modeling
24.3.1. Modeling Workload Impact
24.3.2. Modeling Intended Workload
Chapter 25: Lifetime Data Analysis
25.1. Introduction
25.1.1. Reliability Data Sources
25.1.2. Censoring
25.2. Non-Parametric Methods
25.2.1. Ungrouped Complete Data Method
25.2.2. Grouped Complete Data Method
25.2.3. Ungrouped Multiply Censored Data Method
25.2.4. Kaplan-Meier Method
25.3. Parametric Methods
25.3.1. Graphical Methods
25.3.2. Method of Moments
25.3.3. Maximum Likelihood Estimation
25.3.4. Confidence Intervals
Chapter 26: Fault Injection and Failure Monitoring
26.1. Fault Acceleration
26.2. Some Notable Fault Injection Tools
26.3. Software-Based Fault Injection
Bibliography
Appendix A: MTTF 2oo5
Appendix B: Whetsone
Appendix C: Linpack_Bench
Appendix D: Livermore Loops
Appendix E: MMP - CTMC Trace Generator