This textbook provides an introduction to the free software Python and its use for statistical data analysis. It covers common statistical tests for continuous, discrete and categorical data, as well as linear regression analysis and topics from survival analysis and Bayesian statistics. Working code and data for Python solutions for each test, together with easy-to-follow Python examples, can be reproduced by the reader and reinforce their immediate understanding of the topic. With recent advances in the Python ecosystem, Python has become a popular language for scientific computing, offering a powerful environment for statistical data analysis and an interesting alternative to R. The book is intended for master and PhD students, mainly from the life and medical sciences, with a basic knowledge of statistics. As it also provides some statistics background, the book can be used by anyone who wants to perform a statistical data analysis.
Author(s): Thomas Haslwanter
Publisher: Springer
Year: 2016
Language: English
Pages: 278
Preface
For Whom This Book Is
Additional Material
Acknowledgments
Contents
Acronyms
Part I Python and Statistics
1 Why Statistics?
2 Python
2.1 Getting Started
2.1.1 Conventions
2.1.2 Distributions and Packages
a) Python Packages for Statistics
b) PyPI: The Python Package Index
2.1.3 Installation of Python
a) Under Windows
b) Under Linux
c) Under Mac OS X
2.1.4 Installation of R and rpy2
a) Under Windows
b) Under Linux
2.1.5 Personalizing IPython/Jupyter
a) In Windows
b) In Linux
c) In Mac OS X
2.1.6 Python Resources
2.1.7 First Python Programs
a) Hello World
b) SquareMe
2.2 Python Data Structures
2.2.1 Python Datatypes
2.2.2 Indexing and Slicing
2.2.3 Vectors and Arrays
2.3 IPython/Jupyter: An Interactive Programming Environment
2.3.1 First Session with the Qt Console
2.3.2 Notebook and rpy2
a) The Notebook
b) rpy2
2.3.3 IPython Tips
2.4 Developing Python Programs
2.4.1 Converting Interactive Commands into a Python Program
2.4.2 Functions, Modules, and Packages
a) Functions
b) Modules
2.4.3 Python Tips
2.4.4 Code Versioning
2.5 Pandas: Data Structures for Statistics
2.5.1 Data Handling
a) Common Procedures
b) Notes on Data Selection
2.5.2 Grouping
2.6 Statsmodels: Tools for Statistical Modeling
2.7 Seaborn: Data Visualization
2.8 General Routines
2.9 Exercises
3 Data Input
3.1 Input from Text Files
3.1.1 Visual Inspection
3.1.2 Reading ASCII-Data into Python
a) Simple Text-Files
b) More Complex Text-Files
c) Regular Expressions
3.2 Input from MS Excel
3.3 Input from Other Formats
3.3.1 Matlab
4 Display of Statistical Data
4.1 Datatypes
4.1.1 Categorical
a) Boolean
b) Nominal
c) Ordinal
4.1.2 Numerical
a) Numerical Continuous
b) Numerical Discrete
4.2 Plotting in Python
4.2.1 Functional and Object-Oriented Approaches to Plotting
4.2.2 Interactive Plots
4.3 Displaying Statistical Datasets
4.3.1 Univariate Data
a) Scatter Plots
b) Histograms
c) Kernel-Density-Estimation (KDE) Plots
d) Cumulative Frequencies
e) Error-Bars
f) Box Plots
g) Grouped Bar Charts
h) Pie Charts
i) Programs: Data Display
4.3.2 Bivariate and Multivariate Plots
a) Bivariate Scatter Plots
b) 3D Plots
4.4 Exercises
Part II Distributions and Hypothesis Tests
5 Background
5.1 Populations and Samples
5.2 Probability Distributions
5.2.1 Discrete Distributions
5.2.2 Continuous Distributions
5.2.3 Expected Value and Variance
a) Expected Value
b) Variance
5.3 Degrees of Freedom
5.4 Study Design
5.4.1 Terminology
5.4.2 Overview
5.4.3 Types of Studies
a) Observational or Experimental
b) Prospective or Retrospective
c) Longitudinal or Cross-Sectional
d) Case–Control and Cohort studies
e) Randomized Controlled Trial
f) Crossover Studies
5.4.4 Design of Experiments
a) Sample Selection
b) Sample Size
c) Bias
d) Randomization
e) Blinding
f) Factorial Design
5.4.5 Personal Advice
1) Preliminary Investigations and Murphy's Law
2) Calibration Runs
3) Documentation
4) Data Storage
5.4.6 Clinical Investigation Plan
6 Distributions of One Variable
6.1 Characterizing a Distribution
6.1.1 Distribution Center
a) Mean
b) Median
c) Mode
d) Geometric Mean
6.1.2 Quantifying Variability
a) Range
b) Percentiles
c) Standard Deviation and Variance
d) Standard Error
e) Confidence Intervals
6.1.3 Parameters Describing the Form of a Distribution
a) Location
b) Scale
c) Shape Parameters
6.1.4 Important Presentations of Probability Densities
6.2 Discrete Distributions
6.2.1 Bernoulli Distribution
6.2.2 Binomial Distribution
b) Example: Binomial Test
6.2.3 Poisson Distribution
6.3 Normal Distribution
6.3.1 Examples of Normal Distributions
6.3.2 Central Limit Theorem
6.3.3 Distributions and Hypothesis Tests
6.4 Continuous Distributions Derived from the NormalDistribution
6.4.1 t-Distribution
6.4.2 Chi-Square Distribution
a) Definition
b) Application Example
6.4.3 F-Distribution
a) Definition
b) Application Example
6.5 Other Continuous Distributions
6.5.1 Lognormal Distribution
6.5.2 Weibull Distribution
6.5.3 Exponential Distribution
6.5.4 Uniform Distribution
6.6 Exercises
7 Hypothesis Tests
7.1 Typical Analysis Procedure
7.1.1 Data Screening and Outliers
7.1.2 Normality Check
a) Probability-Plots
b) Tests for Normality
7.1.3 Transformation
7.2 Hypothesis Concept, Errors, p-Value, and Sample Size
7.2.1 An Example
7.2.2 Generalization and Applications
a) Generalization
b) Additional Examples
7.2.3 The Interpretation of the p-Value
7.2.4 Types of Error
a) Type I Errors
b) Type II Errors and Test Power
c) Pitfalls in the Interpretation of p-Values
7.2.5 Sample Size
a) Examples
b) Python Solution
c) Programs: Sample Size
7.3 Sensitivity and Specificity
7.3.1 Related Calculations
7.4 Receiver-Operating-Characteristic (ROC) Curve
8 Tests of Means of Numerical Data
8.1 Distribution of a Sample Mean
8.1.1 One Sample t-Test for a Mean Value
a) Example
8.1.2 Wilcoxon Signed Rank Sum Test
8.2 Comparison of Two Groups
8.2.1 Paired t-Test
8.2.2 t-Test between Independent Groups
8.2.3 Nonparametric Comparison of Two Groups: Mann–Whitney Test
8.2.4 Statistical Hypothesis Tests vs Statistical Modeling
a) Classical t-Test
b) Statistical Modeling
8.3 Comparison of Multiple Groups
8.3.1 Analysis of Variance (ANOVA)
a) Principle
b) Example: One-Way ANOVA
8.3.2 Multiple Comparisons
a) Tukey's Test
b) Bonferroni Correction
c) Holm Correction
8.3.3 Kruskal–Wallis Test
8.3.4 Two-Way ANOVA
8.3.5 Three-Way ANOVA
8.4 Summary: Selecting the Right Test for Comparing Groups
8.4.1 Typical Tests
8.4.2 Hypothetical Examples
8.5 Exercises
9 Tests on Categorical Data
9.1 One Proportion
9.1.1 Confidence Intervals
9.1.2 Explanation
9.1.3 Example
9.2 Frequency Tables
9.2.1 One-Way Chi-Square Test
9.2.2 Chi-Square Contingency Test
a) Assumptions
b) Degrees of Freedom
c) Example 1
d) Example 2
e) Comments
9.2.3 Fisher's Exact Test
a) Example: ``A Lady Tasting Tea''
9.2.4 McNemar's Test
a) Example
9.2.5 Cochran's Q Test
a) Example
9.3 Exercises
10 Analysis of Survival Times
10.1 Survival Distributions
10.2 Survival Probabilities
10.2.1 Censorship
10.2.2 Kaplan–Meier Survival Curve
10.3 Comparing Survival Curves in Two Groups
Part III Statistical Modeling
11 Linear Regression Models
11.1 Linear Correlation
11.1.1 Correlation Coefficient
11.1.2 Rank Correlation
11.2 General Linear Regression Model
11.2.1 Example 1: Simple Linear Regression
11.2.2 Example 2: Quadratic Fit
11.2.3 Coefficient of Determination
a) Relation to Unexplained Variance
b) ``Good'' Fits
11.3 Patsy: The Formula Language
11.3.1 Design Matrix
a) Definition
b) Examples
11.4 Linear Regression Analysis with Python
11.4.1 Example 1: Line Fit with Confidence Intervals
11.4.2 Example 2: Noisy Quadratic Polynomial
11.5 Model Results of Linear Regression Models
11.5.1 Example: Tobacco and Alcohol in the UK
11.5.2 Definitions for Regression with Intercept
11.5.3 The R2 Value
11.5.4 2: The Adjusted R2 Value
a) The F-Test
b) Log-Likelihood Function
c) Information Content of Statistical Models: AIC and BIC
11.5.5 Model Coefficients and Their Interpretation
a) Coefficients
b) Standard Error
c) t-Statistic
d) Confidence Interval
11.5.6 Analysis of Residuals
a) Skewness and Kurtosis
b) Omnibus Test
c) Durbin–Watson
d) Jarque–Bera Test
e) Condition Number
11.5.7 Outliers
11.5.8 Regression Using Sklearn
11.5.9 Conclusion
11.6 Assumptions of Linear Regression Models
11.7 Interpreting the Results of Linear Regression Models
11.8 Bootstrapping
11.9 Exercises
12 Multivariate Data Analysis
12.1 Visualizing Multivariate Correlations
12.1.1 Scatterplot Matrix
12.1.2 Correlation Matrix
12.2 Multilinear Regression
13 Tests on Discrete Data
13.1 Comparing Groups of Ranked Data
13.2 Logistic Regression
13.2.1 Example: The Challenger Disaster
13.3 Generalized Linear Models
13.3.1 Exponential Family of Distributions
13.3.2 Linear Predictor and Link Function
13.4 Ordinal Logistic Regression
13.4.1 Problem Definition
13.4.2 Optimization
13.4.3 Code
13.4.4 Performance
14 Bayesian Statistics
14.1 Bayesian vs. Frequentist Interpretation
14.1.1 Bayesian Example
14.2 The Bayesian Approach in the Age of Computers
14.3 Example: Analysis of the Challenger Disaster with a Markov-Chain–Monte-Carlo Simulation
14.4 Summing Up
Solutions
Problems of Chap.2
Problems of Chap.4
Problems of Chap.6
Problems of Chap.8
Problems of Chap.9
Problems of Chap.11
Glossary
References
Index