Data Wrangling: Concepts, Applications and Tools

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

DATA WRANGLING

Written and edited by some of the world’s top experts in the field, this exciting new volume provides state-of-the-art research and latest technological breakthroughs in data wrangling, its theoretical concepts, practical applications, and tools for solving everyday problems.

Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. This process typically includes manually converting and mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. Data wrangling is increasingly ubiquitous at today’s top firms.

Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data’s format, typically by converting “raw” data into another format more suitable for use. Data wrangling is a necessary component of any business. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale, including many applications, such as Datameer, Infogix, Paxata, Talend, Tamr, TMMData, and Trifacta.

This book synthesizes the processes of data wrangling into a comprehensive overview, with a strong focus on recent and rapidly evolving agile analytic processes in data-driven enterprises, for businesses and other enterprises to use to find solutions for their everyday problems and practical applications. Whether for the veteran engineer, scientist, or other industry professional, this book is a must have for any library.

Author(s): M. Niranjanamurthy, Kavita Sheoran, Geetika Dhand, Prabhjot Kaur
Publisher: Wiley-Scrivener
Year: 2023

Language: English
Pages: 355
City: Beverly

Cover
Title Page
Copyright Page
Contents
Chapter 1 Basic Principles of Data Wrangling
1.1 Introduction
1.2 Data Workflow Structure
1.3 Raw Data Stage
1.3.1 Data Input
1.3.2 Output Actions at Raw Data Stage
1.3.3 Structure
1.3.4 Granularity
1.3.5 Accuracy
1.3.6 Temporality
1.3.7 Scope
1.4 Refined Stage
1.4.1 Data Design and Preparation
1.4.2 Structure Issues
1.4.3 Granularity Issues
1.4.4 Accuracy Issues
1.4.5 Scope Issues
1.4.6 Output Actions at Refined Stage
1.5 Produced Stage
1.5.1 Data Optimization
1.5.2 Output Actions at Produced Stage
1.6 Steps of Data Wrangling
1.7 Do’s for Data Wrangling
1.8 Tools for Data Wrangling
References
Chapter 2 Skills and Responsibilities of Data Wrangler
2.1 Introduction
2.2 Role as an Administrator (Data and Database)
2.3 Skills Required
2.3.1 Technical Skills
2.3.1.1 Python
2.3.1.2 R Programming Language
2.3.1.3 SQL
2.3.1.4 MATLAB
2.3.1.5 Scala
2.3.1.6 EXCEL
2.3.1.7 Tableau
2.3.1.8 Power BI
2.3.2 Soft Skills
2.3.2.1 Presentation Skills
2.3.2.2 Storytelling
2.3.2.3 Business Insights
2.3.2.4 Writing/Publishing Skills
2.3.2.5 Listening
2.3.2.6 Stop and Think
2.3.2.7 Soft Issues
2.4 Responsibilities as Database Administrator
2.4.1 Software Installation and Maintenance
2.4.2 Data Extraction, Transformation, and Loading
2.4.3 Data Handling
2.4.4 Data Security
2.4.5 Data Authentication
2.4.6 Data Backup and Recovery
2.4.7 Security and Performance Monitoring
2.4.8 Effective Use of Human Resource
2.4.9 Capacity Planning
2.4.10 Troubleshooting
2.4.11 Database Tuning
2.5 Concerns for a DBA
2.6 Data Mishandling and Its Consequences
2.6.1 Phases of Data Breaching
2.6.2 Data Breach Laws
2.6.3 Best Practices For Enterprises
2.7 The Long-Term Consequences: Loss of Trust and Diminished Reputation
2.8 Solution to the Problem
2.9 Case Studies
2.9.1 UBER Case Study
2.9.1.1 Role of Analytics and Business Intelligence in Optimization
2.9.1.2 Mapping Applications for City Ops Teams
2.9.1.3 Marketplace Forecasting
2.9.1.4 Learnings from Data
2.9.2 PepsiCo Case Study
2.9.2.1 Searching for a Single Source of Truth
2.9.2.2 Finding the Right Solution for Better Data
2.9.2.3 Enabling Powerful Results with Self-Service Analytics
2.10 Conclusion
References
Chapter 3 Data Wrangling Dynamics
3.1 Introduction
3.2 Related Work
3.3 Challenges: Data Wrangling
3.4 Data Wrangling Architecture
3.4.1 Data Sources
3.4.2 Auxiliary Data
3.4.3 Data Extraction
3.4.4 Data Wrangling
3.4.4.1 Data Accessing
3.4.4.2 Data Structuring
3.4.4.3 Data Cleaning
3.4.4.4 Data Enriching
3.4.4.5 Data Validation
3.4.4.6 Data Publication
3.5 Data Wrangling Tools
3.5.1 Excel
3.5.2 Altair Monarch
3.5.3 Anzo
3.5.4 Tabula
3.5.5 Trifacta
3.5.6 Datameer
3.5.7 Paxata
3.5.8 Talend
3.6 Data Wrangling Application Areas
3.7 Future Directions and Conclusion
References
Chapter 4 Essentials of Data Wrangling
4.1 Introduction
4.2 Holistic Workflow Framework for Data Projects
4.2.1 Raw Stage
4.2.2 Refined Stage
4.2.3 Production Stage
4.3 The Actions in Holistic Workflow Framework
4.3.1 Raw Data Stage Actions
4.3.1.1 Data Ingestion
4.3.1.2 Creating Metadata
4.3.2 Refined Data Stage Actions
4.3.3 Production Data Stage Actions
4.4 Transformation Tasks Involved in Data Wrangling
4.4.1 Structuring
4.4.2 Enriching
4.4.3 Cleansing
4.5 Description of Two Types of Core Profiling
4.5.1 Individual Values Profiling
4.5.1.1 Syntactic
4.5.1.2 Semantic
4.5.2 Set-Based Profiling
4.6 Case Study
4.6.1 Importing Required Libraries
4.6.2 Changing the Order of the Columns in the Dataset
4.6.3 To Display the DataFrame (Top 10 Rows) and Verify that the Columns are in Order
4.6.4 To Display the DataFrame (Bottom 10 rows) and Verify that the Columns Are in Order
4.6.5 Generate the Statistical Summary of the DataFrame for All the Columns
4.7 Quantitative Analysis
4.7.1 Maximum Number of Fires on Any Given Day
4.7.2 Total Number of Fires for the Entire Duration for Every State
4.7.3 Summary Statistics
4.8 Graphical Representation
4.8.1 Line Graph
4.8.2 Pie Chart
4.8.3 Bar Graph
4.9 Conclusion
References
Chapter 5 Data Leakage and Data Wrangling in Machine Learning for Medical Treatment
5.1 Introduction
5.2 Data Wrangling and Data Leakage
5.3 Data Wrangling Stages
5.3.1 Discovery
5.3.2 Structuring
5.3.3 Cleaning
5.3.4 Improving
5.3.5 Validating
5.3.6 Publishing
5.4 Significance of Data Wrangling
5.5 Data Wrangling Examples
5.6 Data Wrangling Tools for Python
5.7 Data Wrangling Tools and Methods
5.8 Use of Data Preprocessing
5.9 Use of Data Wrangling
5.10 Data Wrangling in Machine Learning
5.11 Enhancement of Express Analytics Using Data Wrangling Process
5.12 Conclusion
References
Chapter 6 Importance of Data Wrangling in Industry 4.0
6.1 Introduction
6.1.1 Data Wrangling Entails
6.2 Steps in Data Wrangling
6.2.1 Obstacles Surrounding Data Wrangling
6.3 Data Wrangling Goals
6.4 Tools and Techniques of Data Wrangling
6.4.1 Basic Data Munging Tools
6.4.2 Data Wrangling in Python
6.4.3 Data Wrangling in R
6.5 Ways for Effective Data Wrangling
6.5.1 Ways to Enhance Data Wrangling Pace
6.6 Future Directions
References
Chapter 7 Managing Data Structure in R
7.1 Introduction to Data Structure
7.2 Homogeneous Data Structures
7.2.1 Vector
7.2.2 Factor
7.2.3 Matrix
7.2.4 Array
7.3 Heterogeneous Data Structures
7.3.1 List
7.3.2 Dataframe
References
Chapter 8 Dimension Reduction Techniques in Distributional Semantics: An Application Specific Review
8.1 Introduction
8.2 Application Based Literature Review
8.3 Dimensionality Reduction Techniques
8.3.1 Principal Component Analysis
8.3.2 Linear Discriminant Analysis
8.3.2.1 Two-Class LDA
8.3.2.2 Three-Class LDA
8.3.3 Kernel Principal Component Analysis
8.3.4 Locally Linear Embedding
8.3.5 Independent Component Analysis
8.3.6 Isometric Mapping (Isomap)
8.3.7 Self-Organising Maps
8.3.8 Singular Value Decomposition
8.3.9 Factor Analysis
8.3.10 Auto-Encoders
8.4 Experimental Analysis
8.4.1 Datasets Used
8.4.2 Techniques Used
8.4.3 Classifiers Used
8.4.4 Observations
8.4.5 Results Analysis Red-Wine Quality Dataset
8.5 Conclusion
References
Chapter 9 Big Data Analytics in Real Time for Enterprise Applications to Produce Useful Intelligence
9.1 Introduction
9.2 The Internet of Things and Big Data Correlation
9.3 Design, Structure, and Techniques for Big Data Technology
9.4 Aspiration for Meaningful Analyses and Big Data Visualization Tools
9.4.1 From Information to Guidance
9.4.2 The Transition from Information Management to Valuation Offerings
9.5 Big Data Applications in the Commercial Surroundings
9.5.1 IoT and Data Science Applications in the Production Industry
9.5.1.1 Devices that are Inter Linked
9.5.1.2 Data Transformation
9.5.2 Predictive Analysis for Corporate Enterprise Applications in the Industrial Sector
9.6 Big Data Insights’ Constraints
9.6.1 Technological Developments
9.6.2 Representation of Data
9.6.3 Data That Is Fragmented and Imprecise
9.6.4 Extensibility
9.6.5 Implementation in Real Time Scenarios
9.7 Conclusion
References
Chapter 10 Generative Adversarial Networks: A Comprehensive Review
List of Abbreviations
10.1 Introductýon
10.2 Background
10.2.1 Supervised vs Unsupervised Learning
10.2.2 Generative Modeling vs Discriminative Modeling
10.3 Anatomy of a GAN
10.4 Types of GANs
10.4.1 Conditional GAN (CGAN)
10.4.2 Deep Convolutional GAN (DCGAN)
10.4.3 Wasserstein GAN (WGAN)
10.4.4 Stack GAN
10.4.5 Least Square GAN (LSGANs)
10.4.6 Information Maximizing GAN (INFOGAN)
10.5 Shortcomings of GANs
10.6 Areas of Application
10.6.1 Image
10.6.2 Video
10.6.3 Artwork
10.6.4 Music
10.6.5 Medicine
10.6.6 Security
10.7 Conclusion
References
Chapter 11 Analysis of Machine Learning Frameworks Used in Image Processing: A Review
11.1 Introduction
11.2 Types of ML Algorithms
11.2.1 Supervised Learning
11.2.2 Unsupervised Learning
11.2.3 Reinforcement Learning
11.3 Applications of Machine Learning Techniques
11.3.1 Personal Assistants
11.3.2 Predictions
11.3.3 Social Media
11.3.4 Fraud Detection
11.3.5 Google Translator
11.3.6 Product Recommendations
11.3.7 Videos Surveillance
11.4 Solution to a Problem Using ML
11.4.1 Classification Algorithms
11.4.2 Anomaly Detection Algorithm
11.4.3 Regression Algorithm
11.4.4 Clustering Algorithms
11.4.5 Reinforcement Algorithms
11.5 ML in Image Processing
11.5.1 Frameworks and Libraries Used for ML Image Processing
11.6 Conclusion
References
Chapter 12 Use and Application of Artificial Intelligence in Accounting and Finance: Benefits and Challenges
12.1 Introduction
12.1.1 Artificial Intelligence in Accounting and Finance Sector
12.2 Uses of AI in Accounting & Finance Sector
12.2.1 Pay and Receive Processing
12.2.2 Supplier on Boarding and Procurement
12.2.3 Audits
12.2.4 Monthly, Quarterly Cash Flows, and Expense Management
12.2.5 AI Chatbots
12.3 Applications of AI in Accounting and Finance Sector
12.3.1 AI in Personal Finance
12.3.2 AI in Consumer Finance
12.3.3 AI in Corporate Finance
12.4 Benefits and Advantages of AI in Accounting and Finance
12.4.1 Changing the Human Mindset
12.4.2 Machines Imitate the Human Brain
12.4.3 Fighting Misrepresentation
12.4.4 AI Machines Make Accounting Tasks Easier
12.4.5 Invisible Accounting
12.4.6 Build Trust through Better Financial Protection and Control
12.4.7 Active Insights Help Drive Better Decisions
12.4.8 Fraud Protection, Auditing, and Compliance
12.4.9 Machines as Financial Guardians
12.4.10 Intelligent Investments
12.4.11 Consider the “Runaway Effect”
12.4.12 Artificial Control and Effective Fiduciaries
12.4.13 Accounting Automation Avenues and Investment Management
12.5 Challenges of AI Application in Accounting and Finance
12.5.1 Data Quality and Management
12.5.2 Cyber and Data Privacy
12.5.3 Legal Risks, Liability, and Culture Transformation
12.5.4 Practical Challenges
12.5.5 Limits of Machine Learning and AI
12.5.6 Roles and Skills
12.5.7 Institutional Issues
12.6 Suggestions and Recommendation
12.7 Conclusion and Future Scope of the Study
References
Chapter 13 Obstacle Avoidance Simulation and Real-Time Lane Detection for AI-Based Self-Driving Car
13.1 Introduction
13.1.1 Environment Overview
13.1.1.1 Simulation Overview
13.1.1.2 Agent Overview
13.1.1.3 Brain Overview
13.1.2 Algorithm Used
13.1.2.1 Markovs Decision Process (MDP)
13.1.2.2 Adding a Living Penalty
13.1.2.3 Implementing a Neural Network
13.2 Simulations and Results
13.2.1 Self-Driving Car Simulation
13.2.2 Real-Time Lane Detection and Obstacle Avoidance
13.2.3 About the Model
13.2.4 Preprocessing the Image/Frame
13.3 Conclusion
References
Chapter 14 Impact of Suppliers Network on SCM of Indian Auto Industry: A Case of Maruti Suzuki India Limited
14.1 Introduction
14.2 Literature Review
14.2.1 Prior Pandemic Automobile Industry/COVID-19 Thump on the Automobile Sector
14.2.2 Maruti Suzuki India Limited (MSIL) During COVID-19 and Other Players in the Automobile Industry and How MSIL Prevailed
14.3 Methodology
14.4 Findings
14.4.1 Worldwide Economic Impact of the Epidemic
14.4.2 Effect on Global Automobile Industry
14.4.3 Effect on Indian Automobile Industry
14.4.4 Automobile Industry Scenario That Can Be Expected Post COVID-19 Recovery
14.5 Discussion
14.5.1 Competitive Dimensions
14.5.2 MSIL Strategies
14.5.3 MSIL Operations and Supply Chain Management
14.5.4 MSIL Suppliers Network
14.5.5 MSIL Manufacturing
14.5.5 MSIL Distributors Network
14.5.6 MSIL Logistics Management
14.6 Conclusion
References
About the Editors
Index
EULA