Fast Python: High performance techniques for large datasets

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Master Python techniques and libraries to reduce run times, efficiently handle huge datasets, and optimize execution for complex machine learning applications. Fast Python is a toolbox of techniques for high performance Python including: • Writing efficient pure-Python code • Optimizing the NumPy and pandas libraries • Rewriting critical code in Cython • Designing persistent data structures • Tailoring code for different architectures • Implementing Python GPU computing Fast Python is your guide to optimizing every part of your Python-based data analysis process, from the pure Python code you write to managing the resources of modern hardware and GPUs. You'll learn to rewrite inefficient data structures, improve underperforming code with multithreading, and simplify your datasets without sacrificing accuracy. Written for experienced practitioners, this book dives right into practical solutions for improving computation and storage efficiency. You'll experiment with fun and interesting examples such as rewriting games in Cython and implementing a MapReduce framework from scratch. Finally, you'll go deep into Python GPU computing and learn how modern hardware has rehabilitated some former antipatterns and made counterintuitive ideas the most efficient way of working. About the Technology Face it. Slow code will kill a big data project. Fast pure-Python code, optimized libraries, and fully utilized multiprocessor hardware are the price of entry for machine learning and large-scale data analysis. What you need are reliable solutions that respond faster to computing requirements while using less resources, and saving money. About the Book Fast Python is a toolbox of techniques for speeding up Python, with an emphasis on big data applications. Following the clear examples and precisely articulated details, you’ll learn how to use common libraries like NumPy and pandas in more performant ways and transform data for efficient storage and I/O. More importantly, Fast Python takes a holistic approach to performance, so you’ll see how to optimize the whole system, from code to architecture. What’s Inside • Rewriting critical code in Cython • Designing persistent data structures • Tailoring code for different architectures • Implementing Python GPU computing About the Reader For intermediate Python programmers familiar with the basics of concurrency. About the Author Tiago Antão is one of the co-authors of Biopython, a major bioinformatics package written in Python.

Author(s): Tiago Rodrigues Antao
Edition: 1
Publisher: Manning
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 304
City: Shelter Island, NY
Tags: Python; Concurrency; Asynchronous Programming; Parallel Programming; MapReduce; Apache Parquet; Profiling; High Performance; Laziness; NumPy; pandas; GPU Programming; Generators; Cython; Dask; Data Processing; Apache Arrow; Zarr

Fast Python
contents
preface
acknowledgments
about this book
Who should read this book?
How this book is organized: A road map
About the code
liveBook discussion forum
Hardware and software
about the author
about the cover illustration
Part 1: Foundational Approaches
Chapter 1: An urgent need for efficiency in data processing
1.1 How bad is the data deluge?
1.2 Modern computing architectures and high-performance computing
1.2.1 Changes inside the computer
1.2.2 Changes in the network
1.2.3 The cloud
1.3 Working with Python’s limitations
1.3.1 The Global Interpreter Lock
1.4 A summary of the solutions
Chapter 2: Extracting maximum performance from built-in features
2.1 Profiling applications with both IO and computing workloads
2.1.1 Downloading data and computing minimum temperatures
2.1.2 Python’s built-in profiling module
2.1.3 Using local caches to reduce network usage
2.2 Profiling code to detect performance bottlenecks
2.2.1 Visualizing profiling information
2.2.2 Line profiling
2.2.3 The takeaway: Profiling code
2.3 Optimizing basic data structures for speed: Lists, sets, and dictionaries
2.3.1 Performance of list searches
2.3.2 Searching using sets
2.3.3 List, set, and dictionary complexity in Python
2.4 Finding excessive memory allocation
2.4.1 Navigating the minefield of Python memory estimation
2.4.2 The memory footprint of some alternative representations
2.4.3 Using arrays as a compact representation alternative to lists
2.4.4 Systematizing what we have learned: Estimating memory usage of Python objects
2.4.5 The takeaway: Estimating memory usage of Python objects
2.5 Using laziness and generators for big-data pipelining
2.5.1 Using generators instead of standard functions
Chapter 3: Concurrency, parallelism, and asynchronous processing
3.1 Writing the scaffold of an asynchronous server
3.1.1 Implementing the scaffold for communicating with clients
3.1.2 Programming with coroutines
3.1.3 Sending complex data from a simple synchronous client
3.1.4 Alternative approaches to interprocess communication
3.1.5 The takeaway: Asynchronous programming
3.2 Implementing a basic MapReduce engine
3.2.1 Understanding MapReduce frameworks
3.2.2 Developing a very simple test scenario
3.2.3 A first attempt at implementing a MapReduce framework
3.3 Implementing a concurrent version of a MapReduce engine
3.3.1 Using concurrent.futures to implement a threaded server
3.3.2 Asynchronous execution with futures
3.3.3 The GIL and multithreading
3.4 Using multiprocessing to implement MapReduce
3.4.1 A solution based on concurrent.futures
3.4.2 A solution based on the multiprocessing module
3.4.3 Monitoring the progress of the multiprocessing solution
3.4.4 Transferring data in chunks
3.5 Tying it all together: An asynchronous multithreaded and multiprocessing MapReduce server
3.5.1 Architecting a complete high-performance solution
3.5.2 Creating a robust version of the server
Chapter 4: High-performance NumPy
4.1 Understanding NumPy from a performance perspective
4.1.1 Copies vs. views of existing arrays
4.1.2 Understanding NumPy’s view machinery
4.1.3 Making use of views for efficiency
4.2 Using array programming
4.2.1 The takeaway
4.2.2 Broadcasting in NumPy
4.2.3 Applying array programming
4.2.4 Developing a vectorized mentality
4.3 Tuning NumPy’s internal architecture for performance
4.3.1 An overview of NumPy dependencies
4.3.2 How to tune NumPy in your Python distribution
4.3.3 Threads in NumPy
Part 2: Hardware
Chapter 5: Re-implementing critical code with Cython
5.1 Overview of techniques for efficient code re-implementation
5.2 A whirlwind tour of Cython
5.2.1 A naive implementation in Cython
5.2.2 Using Cython annotations to increase performance
5.2.3 Why annotations are fundamental to performance
5.2.4 Adding typing to function returns
5.3 Profiling Cython code
5.3.1 Using Python’s built-in profiling infrastructure
5.3.2 Using line_profiler
5.4 Optimizing array access with Cython memoryviews
5.4.1 The takeaway
5.4.2 Cleaning up all internal interactions with Python
5.5 Writing NumPy generalized universal functions in Cython
5.5.1 The takeaway
5.6 Advanced array access in Cython
5.6.1 Bypassing the GIL’s limitation on running multiple threads at a time
5.6.2 Basic performance analysis
5.6.3 A spacewar example using Quadlife
5.7 Parallelism with Cython
Chapter 6: Memory hierarchy, storage, and networking
6.1 How modern hardware architectures affect Python performance
6.1.1 The counterintuitive effect of modern architectures on performance
6.1.2 How CPU caching affects algorithm efficiency
6.1.3 Modern persistent storage
6.2 Efficient data storage with Blosc
6.2.1 Compress data; save time
6.2.2 Read speeds (and memory buffers)
6.2.3 The effect of different compression algorithms on storage performance
6.2.4 Using insights about data representation to increase compression
6.3 Accelerating NumPy with NumExpr
6.3.1 Fast expression processing
6.3.2 How hardware architecture affects our results
6.3.3 When NumExpr is not appropriate
6.4 The performance implications of using the local network
6.4.1 The sources of inefficiency with REST calls
6.4.2 A naive client based on UDP and msgpack
6.4.3 A UDP-based server
6.4.4 Dealing with basic recovery on the client side
6.4.5 Other suggestions for optimizing network computing
Part 3: Applications and Libraries for Modern Data Processing
Chapter 7: High-performance pandas and Apache Arrow
7.1 Optimizing memory and time when loading data
7.1.1 Compressed vs. uncompressed data
7.1.2 Type inference of columns
7.1.3 The effect of data type precision
7.1.4 Recoding and reducing data
7.2 Techniques to increase data analysis speed
7.2.1 Using indexing to accelerate access
7.2.2 Row iteration strategies
7.3 pandas on top of NumPy, Cython, and NumExpr
7.3.1 Explicit use of NumPy
7.3.2 pandas on top of NumExpr
7.3.3 Cython and pandas
7.4 Reading data into pandas with Arrow
7.4.1 The relationship between pandas and Apache Arrow
7.4.2 Reading a CSV file
7.4.3 Analyzing with Arrow
7.5 Using Arrow interop to delegate work to more efficient languages and systems
7.5.1 Implications of Arrow’s language interop architecture
7.5.2 Zero-copy operations on data with Arrow’s Plasma server
Chapter 8: Storing big data
8.1 A unified interface for file access: fsspec
8.1.1 Using fsspec to search for files in a GitHub repo
8.1.2 Using fsspec to inspect zip files
8.1.3 Accessing files using fsspec
8.1.4 Using URL chaining to traverse different filesystems transparently
8.1.5 Replacing filesystem backends
8.1.6 Interfacing with PyArrow
8.2 Parquet: An efficient format to store columnar data
8.2.1 Inspecting Parquet metadata
8.2.2 Column encoding with Parquet
8.2.3 Partitioning with datasets
8.3 Dealing with larger-than-memory datasets the old-fashioned way
8.3.1 Memory mapping files with NumPy
8.3.2 Chunk reading and writing of data frames
8.4 Zarr for large-array persistence
8.4.1 Understanding Zarr’s internal structure
8.4.2 Storage of arrays in Zarr
8.4.3 Creating a new array
8.4.4 Parallel reading and writing of Zarr arrays
Part 4: Advanced Topics
Chapter 9: Data analysis using GPU computing
9.1 Making sense of GPU computing power
9.1.1 Understanding the advantages of GPUs
9.1.2 The relationship between CPUs and GPUs
9.1.3 The internal architecture of GPUs
9.1.4 Software architecture considerations
9.2 Using Numba to generate GPU code
9.2.1 Installation of GPU software for Python
9.2.2 The basics of GPU programming with Numba
9.2.3 Revisiting the Mandelbrot example using GPUs
9.2.4 A NumPy version of the Mandelbrot code
9.3 Performance analysis of GPU code: The case of a CuPy application
9.3.1 GPU-based data analysis libraries
9.3.2 Using CuPy: A GPU-based version of NumPy
9.3.3 A basic interaction with CuPy
9.3.4 Writing a Mandelbrot generator using Numba
9.3.5 Writing a Mandelbrot generator using CUDA C
9.3.6 Profiling tools for GPU code
Chapter 10: Analyzing big data with Dask
10.1 Understanding Dask’s execution model
10.1.1 A pandas baseline for comparison
10.1.2 Developing a Dask-based data frame solution
10.2 The computational cost of Dask operations
10.2.1 Partitioning data for processing
10.2.2 Persisting intermediate computations
10.2.3 Algorithm implementations over distributed data frames
10.2.4 Repartitioning the data
10.2.5 Persisting distributed data frames
10.3 Using Dask’s distributed scheduler
10.3.1 The dask.distributed architecture
10.3.2 Running code using dask.distributed
10.3.3 Dealing with datasets larger than memory
appendix A: Setting up the environment
A.1 Setting up Anaconda Python
A.2 Installing your own Python distribution
A.3 Using Docker
A.4 Hardware considerations
appendix B: Using Numba to generate efficient low-level code
B.1 Generating optimized code with Numba
B.2 Writing explicitly parallel functions in Numba
B.3 Writing NumPy-aware code in Numba
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Z