Hacks, Leaks, and Revelations : The Art of Analyzing Hacked and Leaked Data

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Unlock the internet’s treasure trove of public interest data with Hacks, Leaks, and Revelations by Micah Lee, an investigative reporter and security engineer. This hands-on guide blends real-world techniques for researching large datasets with lessons on coding, data authentication, and digital security. All of this is spiced up with gripping stories from the front lines of investigative journalism. Dive into exposed datasets from a wide array of sources: the FBI, the DHS, police intelligence agencies, extremist groups like the Oath Keepers, and even a Russian ransomware gang. Lee’s own in-depth case studies on disinformation-peddling pandemic profiteers and neo-Nazi chatrooms serve as blueprints for your research. Gain practical skills in searching massive troves of data for keywords like “antifa” and pinpointing documents with newsworthy revelations. Get a crash course in Python to automate the analysis of millions of files. Using Python or other programming languages, you can give your computer precise instructions for performing tasks that existing tools or shell scripts don’t allow. For example, you could write a Python script that scours a million pieces of video metadata to determine where the videos were filmed. In my experience, Python is also simpler, easier to understand, and less error-prone than shell scripts. This chapter provides a crash course on the fundamentals of Python programming. You’ll learn to write and execute Python scripts and use the interactive Python interpreter. You’ll also use Python to do math, define variables, work with strings and Boolean logic, loop through lists of items, and use functions. Future chapters rely on your understanding of these basic skills. You will also learn how to: Master encrypted messaging to safely communicate with whistleblowers. Secure datasets over encrypted channels using Signal, Tor Browser, OnionShare, and SecureDrop. Harvest data from the BlueLeaks collection of internal memos, financial records, and more from over 200 state, local, and federal agencies. Probe leaked email archives about offshore detention centers and the Heritage Foundation. Analyze metadata from videos of the January 6 attack on the US Capitol, sourced from the Parler social network.

Author(s): Micah Lee
Publisher: No Starch Press
Year: 2023

Language: English
Pages: 544

Cover
Praise for Hacks, Leaks, and Revelations
Title Page
Copyright
Dedication
About the Author and Technical Reviewer
Acknowledgments
Introduction
Why I Wrote This Book
What You’ll Learn
What You’ll Need
Part I: Sources and Datasets
1. Protecting Sources and Yourself
Safely Communicating with Sources
Working with Public Data
Protecting Sensitive Information
Minimizing the Digital Trail
Working with Hackers and Whistleblowers
Secure Storage for Datasets
Low-Sensitivity Datasets
Medium-Sensitivity Datasets
High-Sensitivity Datasets
Authenticating Datasets
The AFLDS Dataset
The WikiLeaks Twitter Group Chat
Redaction
What Data to Publish
What to Redact
Making Requests for Comment
Password Managers
Disk Encryption
Exercise 1-1: Encrypt Your Internal Disk
Windows
macOS
Linux
Exercise 1-2: Encrypt a USB Disk
Windows
macOS
Linux
Protecting Yourself from Malicious Documents
Exercise 1-3: Install and Use Dangerzone
Summary
2. Acquiring Datasets
The End of WikiLeaks
Distributed Denial of Secrets
Downloading Datasets with BitTorrent
The Origins of BlueLeaks
Exercise 2-1: Download the BlueLeaks Dataset
Communicating with Encrypted Messaging Apps
Exercise 2-2: Install and Practice Using Signal
Encrypting Messages with PGP
Staying Anonymous Online with Tor and OnionShare
Exercise 2-3: Play with Tor and OnionShare
Communicating with My Tea Party Patriots Source
Other Options for Acquiring Datasets from Sources
Encrypted USB Drives
Virtual Private Servers
Whistleblower Submission Systems
Summary
Part II: Tools of the Trade
3. The Command Line Interface
Introducing the Command Line
The Shell
Users and Paths
User Privileges
Exercise 3-1: Install Ubuntu in Windows
Basic Command Line Usage
Opening a Terminal
Clearing Your Screen and Exiting the Shell
Exploring Files and Directories
Navigating Relative and Absolute Paths
Changing Directories
Using the help Argument
Accessing Man Pages
Tips for Navigating the Terminal
Entering Commands with Tab Completion
Editing Commands
Dealing with Spaces in Filenames
Using Single Quotes Around Double Quotes
Installing and Uninstalling Software with Package Managers
Exercise 3-2: Manage Packages with Homebrew on macOS
Exercise 3-3: Manage Packages with apt on Windows or Linux
Exercise 3-4: Practice Using the Command Line with cURL
Download a Web Page with cURL
Save a Web Page to a File
Text Files vs. Binary Files
Exercise 3-5: Install the VS Code Text Editor
Exercise 3-6: Write Your First Shell Script
Navigate to Your USB Disk
Create an Exercises Folder
Open a VS Code Workspace
Write the Shell Script
Run the Shell Script
Exercise 3-7: Clone the Book’s GitHub Repository
Summary
4. Exploring Datasets in the Terminal
Introducing for Loops
Exercise 4-1: Unzip the BlueLeaks Dataset
Unzip Files on macOS or Linux
Unzip Files on Windows
Organize Your Files
How the Hacker Obtained the BlueLeaks Data
Exercise 4-2: Explore BlueLeaks on the Command Line
Calculate How Much Disk Space Folders Use
Use Pipes and Sort Output
Create an Inventory of Filenames in a Dataset
Count the Files in a Dataset
Exercise 4-3: Find Revelations in BlueLeaks with grep
Filter for Documents Mentioning Antifa
Filter for Certain Types of Files
Use grep with Regular Expressions
Search Files in Bulk with grep
Encrypted Data in the BlueLeaks Dataset
Data Analysis with Servers in the Cloud
Exercise 4-4: Set Up a VPS
Generate an SSH Key
Add Your Public Key to the Cloud Provider
Create a VPS
SSH into Your Server
Start a Byobu Session
Install Updates
Exercise 4-5: Explore the Oath Keepers Dataset Remotely
Summary
5. Docker, Aleph, and Making Datasets Searchable
Introducing Docker and Linux Containers
Exercise 5-1: Initialize Docker Desktop on Windows and macOS
Exercise 5-2: Initialize Docker Engine on Linux
Running Containers with Docker
Running an Ubuntu Container
Listing and Killing Containers
Mounting and Removing Volumes
Passing Environment Variables
Running Server Software
Freeing Up Disk Space
Exercise 5-3: Run a WordPress Site with Docker Compose
Make a docker-compose.yaml File
Start Your WordPress Site
Introducing Aleph
Exercise 5-4: Run Aleph Locally in Linux Containers
Using Aleph’s Web and Command Line Interfaces
Indexing Data in Aleph
Exercise 5-5: Index a BlueLeaks Folder in Aleph
Mount Your Datasets into the Aleph Shell
Index the icefishx Folder
Check Indexing Status
Explore BlueLeaks with Aleph
Additional Aleph Features
Dedicated Aleph Servers
Summary
6. Reading Other People’s Email
The Email Protocol and Message Structure
File Formats for Email Dumps
EML Files
MBOX Files
PST Outlook Data Files
Exercise 6-1: Download Email Dumps from Three Datasets
The Nauru Police Force Dataset
The Oath Keepers Dataset
The Heritage Foundation Dataset
Researching Email Dumps with Thunderbird
Exercise 6-2: Configure Thunderbird for Email Dumps
Reading Individual EML Files with Thunderbird
Exercise 6-3: Import the Nauru Police Force EML Email Dump
Searching Email in Thunderbird
Quick Filter Searches
The Search Messages Dialog
Exercise 6-4: Import the Oath Keepers MBOX Email Dump
Exercise 6-5: Import the Heritage Foundation PST Email Dump
Other Tools for Researching Email Dumps
Microsoft Outlook
Aleph
Summary
Part III: Python Programming
7. An Introduction to Python
Exercise 7-1: Install Python
Windows
Linux
macOS
Exercise 7-2: Write Your First Python Script
Python Basics
The Interactive Python Interpreter
Comments
Math with Python
Strings
Exercise 7-3: Write a Python Script with Variables, Math, and Strings
Lists and Loops
Defining and Printing Lists
Running for Loops
Control Flow
Comparison Operators
if Statements
Nested Code Blocks
Searching Lists
Logical Operators
Exception Handling
Exercise 7-4: Practice Loops and Control Flow
Functions
The def Keyword
Default Arguments
Return Values
Docstrings
Exercise 7-5: Practice Writing Functions
Summary
8. Working with Data in Python
Modules
Python Script Template
Exercise 8-1: Traverse the Files in BlueLeaks
List the Filenames in a Folder
Count the Files and Folders in a Folder
Traverse Folders with os.walk()
Exercise 8-2: Find the Largest Files in BlueLeaks
Third-Party Modules
Exercise 8-3: Practice Command Line Arguments with Click
Avoiding Hardcoding with Command Line Arguments
Exercise 8-4: Find the Largest Files in Any Dataset
Dictionaries
Defining Dictionaries
Getting and Setting Values
Navigating Dictionaries and Lists in the Conti Chat Logs
Exploring Dictionaries and Lists Full of Data in Python
Selecting Values in Dictionaries and Lists
Analyzing Data Stored in Dictionaries and Lists
Exercise 8-5: Map Out the CSVs in BlueLeaks
Accept a Command Line Argument
Loop Through the BlueLeaks Folders
Fill Up the Dictionary
Display the Output
Reading and Writing Files
Opening Files
Writing Lines to a File
Reading Lines from a File
Exercise 8-6: Practice Reading and Writing Files
Summary
Part IV: Structured Data
9. Blueleaks, Black Lives Matter, and the CSV File Format
Installing Spreadsheet Software
Introducing the CSV File Format
Exploring CSV Files with Spreadsheet Software and Text Editors
My BlueLeaks Investigation
Focusing on a Fusion Center
Introducing NCRIC
Investigating a SAR
Reading and Writing CSV Files in Python
Exercise 9-1: Make BlueLeaks CSVs More Readable
Accept the CSV Path as an Argument
Loop Through the CSV Rows
Display CSV Fields on Separate Lines
How to Read Bulk Email from Fusion Centers
Lists of Black Lives Matter Demonstrations
“Intelligence” Memos from the FBI and DHS
A Brief HTML Primer
Exercise 9-2: Make Bulk Email Readable
Accept the Command Line Arguments
Create the Output Folder
Define the Filename for Each Row
Write the HTML Version of Each Bulk Email
Discovering the Names and URLs of BlueLeaks Sites
Exercise 9-3: Make a CSV of BlueLeaks Sites
Open a CSV for Writing
Find All the Company.csv Files
Add BlueLeaks Sites to the CSV
Summary
10. Blueleaks Explorer
Undiscovered Revelations in BlueLeaks
Exercise 10-1: Install BlueLeaks Explorer
Create the Docker Compose Configuration File
Bring Up the Containers
Initialize the Databases
The Structure of NCRIC
Exploring Tables and Relationships
Searching for Keywords
Building Your Own BlueLeaks Structure
Defining the JRIC Structure
Showing Useful Fields
Changing Field Types
Adding JRIC’s Leads Table
Building a Relationship
Verifying BlueLeaks Data
Exercise 10-2: Finish Building the Structure for JRIC
The Technology Behind BlueLeaks Explorer
The Backend
The Frontend
Summary
11. Parler, the January 6 Insurrection, and the JSON File format
The Origins of the Parler Dataset
How the Parler Videos Were Archived
The Dataset’s Impact on Trump’s Second Impeachment
Exercise 11-1: Download and Extract Parler Video Metadata
Download the Metadata
Uncompress and Download Individual Parler Videos
Extract Parler Metadata
The JSON File Format
Understanding JSON Syntax
Parsing JSON with Python
Handling Exceptions with JSON
Tools for Exploring JSON Data
Counting Videos with GPS Coordinates Using grep
Formatting and Searching Data with the jq Command
Exercise 11-2: Write a Script to Filter for Videos with GPS from January 6, 2021
Accept the Parler Metadata Path as an Argument
Loop Through Parler Metadata Files
Filter for Videos with GPS Coordinates
Filter for Videos from January 6, 2021
Working with GPS Coordinates
Searching by Latitude and Longitude
Converting Between GPS Coordinate Formats
Calculating GPS Distance in Python
Finding the Center of Washington, DC
Exercise 11-3: Update the Script to Filter for Insurrection Videos
Plotting GPS Coordinates on a Map with simplekml
Exercise 11-4: Create KML Files to Visualize Location Data
Create a KML File for All Videos with GPS Coordinates
Create KML Files for Videos from January 6, 2021
Visualizing Location Data with Google Earth
Viewing Metadata with ExifTool
Summary
12. Epik Fail, Extremism Research, and SQL Databases
The Structure of SQL Databases
Relational Databases
Clients and Servers
Tables, Columns, and Types
Exercise 12-1: Create and Test a MySQL Server Using Docker and Adminer
Run the Server
Connect to the Database with Adminer
Create a Test Database
Exercise 12-2: Query Your SQL Database
INSERT Statements
SELECT Statements
JOIN Clauses
UPDATE Statements
DELETE Statements
Introducing the MySQL Command Line Client
Exercise 12-3: Install and Test the Command Line MySQL Client
MySQL-Specific Queries
The History of Epik
The Epik Hack
Epik’s WHOIS Data
Exercise 12-4: Download and Extract Part of the Epik Dataset
Exercise 12-5: Import Epik Data into MySQL
Create a Database for api_system
Import api_system Data
Exploring Epik’s SQL Database
The domain Table
The privacy Table
The hosting and hosting_server Tables
Working with Epik Data in the Cloud
Summary
Part V: Case Studies
13. Pandemic Profiteers and Covid-19 Disinformation
The Origins of AFLDS
The Cadence Health and Ravkoo Datasets
Extracting the Data into an Encrypted File Container
Analyzing the Data with Command Line Tools
Creating a Single Spreadsheet of Patients
Calculating Revenue from Prescriptions Filled by Ravkoo
Finding the Price and Quantity of Drugs Sold
Categorizing Prescription Data by Drug
A Deeper Look at the Cadence Health Patient Data
Finding Cadence’s Partners
Searching for Patients by City
Searching for Patients by Age
Authenticating the Data
The Aftermath
HIPAA’s Breach Notification Rule
Congressional Investigation
Simone Gold’s New Business Venture
Scandal and Infighting at AFLDS
Summary
14. Neo-Nazis and their Chatrooms
How Antifascists Infiltrated Neo-Nazi Discord Servers
Analyzing Leaked Chat Logs
Making JSON Files Readable
Exploring Objects, Keys, and Values with jq
Converting Timestamps
Finding Usernames
The Discord History Tracker
A Script to Search the JSON Files
My Discord Analysis Code
Designing the SQL Database
Importing Chat Logs into the SQL Database
Building the Web Interface
Using Discord Analysis to Find Revelations
The Pony Power Discord Server
The Launch of DiscordLeaks
The Aftermath
The Lawsuit Against Unite the Right
The Patriot Front Chat Logs
Summary
Afterword
A. Solutions to Common WSL Problems
Understanding WSL’s Linux Filesystem
The Disk Performance Problem
Solving the Disk Performance Problem
Storing Only Active Datasets in Linux
Storing Your Linux Filesystem on a USB Disk
Next Steps
B. Scraping the Web
Legal Considerations
HTTP Requests
Scraping Techniques
Loading Pages with HTTPX
Parsing HTML with Beautiful Soup
Automating Web Browsers with Selenium
Next Steps
Index