Data Engineering and Data Science: Concepts and Applications

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The field of Data Science is incredibly broad, encompassing everything from cleaning data to deploying predictive models. However, it is rare for any single data scientist to be working across the spectrum day to day. Data scientists usually focus on a few areas and are complemented by a team of other scientists and analysts. Data engineering is also a broad field, but any individual data engineer doesn’t need to know the whole spectrum of skills. Data engineering is the aspect of Data Science that focuses on practical applications of data collection and analysis. For all the work that data scientists do to answer questions using large sets of information, there have to be mechanisms for collecting and validating that information. Basically, R programming language has been used, along with some Python libraries to perform exploratory data analysis on the datasets which have been used. Different packages or libraries which are available in R and Python have been explored. Data pre-processing has been performed using Python libraries. In this exciting new volume, the team of editors and contributors sketch the broad outlines of data engineering, then walk through more specific descriptions that illustrate specific data engineering roles. Data-driven discovery is revolutionizing the modeling, prediction, and control of complex systems. This book brings together Machine Learning, engineering mathematics, and mathematical physics to integrate modeling and control of dynamical systems with modern methods in Data Science. It highlights many of the recent advances in scientific computing that enable data-driven methods to be applied to a diverse range of complex systems, such as turbulence, the brain, climate, epidemiology, finance, robotics, and autonomy. Whether for the veteran engineer or scientist working in the field or laboratory, or the student or academic, this is a must have for any library.

Author(s): Kukatlapalli Pradeep Kumar, Aynur Unal, Vinay Jha Pillai, Hari Murthy, M. Niranjanamurthy
Series: ADVANCES IN DATA ENEGINEERING
Publisher: Wiley-Scrivener
Year: 2023

Language: English
Pages: 467

Cover
Title Page
Copyright Page
Contents
Preface
Chapter 1 Quality Assurance in Data Science: Need, Challenges and Focus
1.1 Introduction
1.1.1 Quality Assurance and Testing
1.1.2 Data Science and Quality Assurance
1.1.3 Background
1.2 Testing and Quality Assurance
1.2.1 Key Terminologies Associated With Testing
1.3 Product Quality and Test Efforts
1.3.1 Testing Metrics
1.3.2 How to Improve the Business Value to Products Using Test Automation
1.3.3 Data Analysis and Management in Test Automation
1.3.4 Data Models in Data Science
1.4 Data Masking in Data Model and Associated Risks
1.5 Prediction in Data Science
Case Study
1.6 Role of Metrics in Evaluation
1.7 Quantity of Data in Quality Assurance
1.8 Identifying the Right Data Sources
1.8.1 Need to Gather Up-to-Date Data
1.8.2 Synthesising Existing Advanced Technologies for Continuous Business Improvements
1.9 Conclusion
References
Chapter 2 Design and Implementation of Social Media Mining – Knowledge Discovery Methods for Effective Digital Marketing Strategies
2.1 Introduction
2.1.1 Objectives of the Study
2.2 Literature Review
2.3 Novel Framework for Social Media Data Mining and Knowledge Discovery
2.4 Classification for Comparison Analysis
2.5 Clustering Methodology to Provide Digital Marketing Strategies
2.5.1 Status (Text Form)
2.5.2 Images (Photos)
2.5.3 Video Post
2.5.4 Link Post
2.6 Experimental Results
2.7 Conclusion
References
Chapter 3 A Study on Big Data Engineering Using Cloud Data Warehouse
3.1 Introduction
3.2 Comparison Study of Different Cloud Data Warehouses
3.2.1 Amazon Redshift
3.2.2 High-Level Architecture of Amazon Redshift
3.2.3 Features of Amazon Redshift Cloud Data Warehouse
3.2.4 Pricing of Amazon Redshift Cloud Data Warehouse
3.3 Snowflake Cloud Data Warehouse
3.3.1 High-Level Architecture of Snowflake Cloud Data Warehouse
3.3.2 Features of Snowflake Cloud Data Warehouse
3.3.3 Snowflake Cloud Data Warehouse Pricing
3.4 Google BigQuery Cloud Data Warehouse
3.4.1 High-Level Architecture of Google BigQuery Cloud Data Warehouse
3.4.2 Features of Google BigQuery Cloud Data Warehouse
3.4.3 Google BigQuery Cloud Data Warehouse Pricing
3.5 Microsoft Azure Synapse Cloud Data Warehouse
3.5.1 Microsoft Azure Synapse Cloud Data Warehouse Architecture
3.5.2 Features of Microsoft Azure Synapse Cloud Data Warehouse
3.5.3 Pricing of Microsoft Azure Synapse Cloud Data Warehouse
3.6 Informatica Intelligent Cloud Services (IICS)
3.6.1 Informatica Intelligent Cloud Services Architecture
3.6.2 Salient Features of Informatica Intelligent Cloud Services
3.6.3 Informatica Intelligent Cloud Services Pricing Model
3.7 Conclusion
Acknowledgements
References
Chapter 4 Data Mining with Cluster Analysis Through Partitioning Approach of Huge Transaction Data
4.1 Introduction
4.2 Methodology Used in Proposed Cluster Analysis System
4.2.1 Design of Algorithms
4.3 Literature Survey on Existing Systems
4.3.1 Experimental Results
4.4 Conclusion
References
Chapter 5 Application of Data Science in Macromodeling of Nonlinear Dynamical Systems
5.1 Introduction
5.2 Nonlinear Autonomous Dynamical System
5.3 Nonlinear System - MOR
5.3.1 Proper Orthogonal Decomposition
5.4 Data Science Life Cycle
5.4.1 Problem Identification
5.4.2 Identifying Available Data Sources and Data Collection
5.4.3 Data Processing
5.4.4 Data Exploration
5.4.5 Feature Extraction
5.4.6 Modeling
5.4.7 Model Performance Evaluation
5.5 Artificial Neural Network in Modeling
5.5.1 Machine Learning
5.5.2 Biological Neuron Model
5.5.3 Artificial Neural Networks
5.5.4 Network Topologies
5.5.4.1 NARX Neural Network
5.5.5 ANN Modeling Using Mathematical Models
5.6 Neuron Spiking Model Using FitzHugh-Nagumo (F-N) System
5.6.1 Linearization of F-N System
5.6.2 Reduced Order Model of Linear System
5.6.3 Finite Difference Discretization of F-N System
5.6.4 MOR of F-N System Using POD-Galerkin Method
5.7 Ring Oscillator Model
5.7.1 Model Order Reduction of Ring Oscillator Circuit
5.7.2 Ring Oscillator Circuit Approximation Using Linear System MOR
5.7.3 POD-ANN Macromodel of Ring Oscillator Circuit
5.8 Nonlinear VLSI Interconnect Model Using Telegraph Equation
5.8.1 Macromodeling of VLSI Interconnect
5.8.2 Discretisation of Interconnect Model
5.8.3 Linearization of VLSI Interconnect Model
5.8.4 Reduced Order Linear Model of VLSI Interconnect
5.9 Macromodel Using Machine Learning
5.9.1 Activation Function
5.9.2 Bayesian Regularization
5.9.3 Optimization
5.10 MOR of Dynamical Systems Using POD-ANN
5.10.1 Accuracy and Performance Index
5.11 Numerical Results
5.11.1 F-N System
5.11.2 Ring Oscillator Model
5.11.3 Reduced Order POD Approximation of Ring Oscillator
5.11.3.1 Study of POD-ANN Approximation of Ring Oscillator for Variation in Amplitude of Input Signal and for Different Input Signals
5.11.3.2 POD-ANN Approximation of Ring Oscillator for Variation in Frequency
5.11.4 POD-ANN Approximation of VLSI Interconnect
5.12 Conclusion
References
Chapter 6 Comparative Analysis of Various Ensemble Approaches for Web Page Classification
6.1 Introduction
6.2 Literature Survey
6.3 Material and Methods
6.4 Ensemble Classifiers
6.4.1 Bagging
6.4.1.1 Bagging Meta Estimator
6.4.1.2 Random Forest
6.4.2 Boosting
6.4.2.1 AdaBoost
6.4.2.2 Gradient Tree Boosting
6.4.2.3 XGBoost
6.4.3 Stacking
6.5 Results
6.5.1 Bagging Meta Estimator
6.5.2 Random Forest
6.5.3 AdaBoost
6.5.4 Gradient Tree Boosting
6.5.5 XGBoost
6.5.6 Stacking
6.5.7 Comparison with Single Classifiers
6.6 Conclusion
Acknowledgement
References
Chapter 7 Feature Engineering and Selection Approach Over Malicious Image
7.1 Introduction
7.2 Feature Engineering Techniques
7.2.1 Methodologies in Feature Engineering
7.2.2 Strides in Feature Engineering
7.2.3 Feature Extraction
7.2.4 Feature Selection
7.2.5 Feature Engineering in Image Processing
7.2.6 Importance of Feature Engineering in Image Processing
7.3 Malicious Feature Engineering
7.4 Image Processing Technique
7.4.1 Steps Involved in Image Processing Technique
7.4.2 Image Processing Task
7.4.2.1 Image Enhancement
7.4.2.2 Image Restoration
7.4.2.3 Coloring Image Processing
7.4.2.4 Wavelets Processing and Multiple Solutions
7.4.2.5 Image Compression
7.4.2.6 Character Recognition
7.4.2.7 Characteristics of Image Processing
7.5 Image Processing Techniques for Analysis on Malicious Images
7.6 Conclusion
References
Blog
Chapter 8 Cubic-Regression and Likelihood Based Boosting GAM to Model Drug Sensitivity for Glioblastoma
8.1 Introduction
8.1.1 Glioblastoma
8.2 Literature Survey
8.3 Materials and Methods
8.3.1 Methodology
8.3.1.1 Generalized Additive Models (GAMs)
8.3.1.2 Model-Based Boosting – Boosted GAM
8.3.2 Datasets Description
8.4 Evaluations, Results and Discussions
8.4.1 Akaike Information Criterion (AIC)
8.4.2 Adjusted R-Squared
8.4.3 Discussion
Conclusion
References
Chapter 9 Unobtrusive Engagement Detection through Semantic Pose Estimation and Lightweight ResNet for an Online Class Environment
9.1 Introduction
9.2 Related Work
9.2.1 Analysis for a Classroom Environment
9.2.2 Pose Estimation
9.2.3 Face Alignment and Landmark Estimation
9.2.4 Deep Networks for Emotional Analysis
9.3 Proposed Methodology
9.3.1 Data Description
9.3.2 Facial Detection and Recognition
9.3.2.1 Face Detection
9.3.2.2 Facial Landmark Detection
9.3.3 Emotion Quantification
9.3.4 Pose Estimation
9.3.4.1 Facial Pose Estimation
9.4 Experimentation
9.5 Results and Discussions
Conclusion
References
Chapter 10 Building Rule Base for Decision Making – A Fuzzy-Rough Approach
10.1 Introduction
10.2 Literature Review
10.3 Discretization of the Dataset Using Fuzzy Set Theory
10.4 Description of the Dataset
10.5 Process Involved in Proposed Work
10.6 Experiment
10.7 Evaluation Result
10.8 Discussion
Conclusion
References
Chapter 11 An Effective Machine Learning Approach to Model Healthcare Data
11.1 Introduction
11.2 Types of Data in Healthcare
11.3 Big Data in Healthcare
11.4 Different V’s of Big Data
11.5 About COPD
11.6 Methodology Implemented
Conclusion
References
Chapter 12 Recommendation Engine for Retail Domain Using Machine Learning Techniques
12.1 Introduction
12.2 Proposed System
12.2.1 Classification of Suppliers
12.2.2 Recommendation for Buyer
12.2.3 Forecasting Using ARIMA Model
12.3 Results
12.3.1 ARIMA Forecasting
12.4 Conclusion
References
Chapter 13 Mining Heterogeneous Lung Cancer from Computer Tomography (CT) Scan with the Confusion Matrix
13.1 Introduction
13.2 Literature Review
13.3 Methodology
13.3.1 Description of the Data
13.3.2 Image Preprocessing
13.3.3 Image Segmentation
13.3.4 Image Processing
13.3.5 Zero Component Analysis (ZCA) Whitening
13.3.6 Local Binary Pattern (LBP Feature)
13.3.7 LESH Vector
13.3.8 Local Energy Map and Orientation Map
13.3.9 Training with Deep Learning Methods
13.4 Result
13.4.1 Lorenz Curve
13.4.2 Confusion Matrix
13.4.3 Gini Coefficient
13.5 Conclusion and Future Scope
References
Chapter 14 ML Algorithms and Their Approach on COVID-19 Data Analysis
14.1 Introduction
14.2 DataSet
14.2.1 Labeled Datasets
14.2.2 Unlabelled Datasets
14.2.3 COVID-19 Data
14.3 Types of Machine Learning Algorithms
14.3.1 Supervised Learning
14.3.2 Unsupervised Learning
14.3.3 Semi-Supervised Learning
14.3.4 Reinforcement Learning
14.4 Conclusion
References
Chapter 15 Analysis and Design for the Early Stage Detection of Lung Diseases Using Machine Learning Algorithms
15.1 Introduction
15.2 Machine Learning Algorithms
15.2.1 Linear Regression
15.2.2 Logistic Regression
15.2.3 Decision Tree
15.2.4 Random Forest
15.2.5 Naïve Bayes
15.2.6 Support Vector Machine (SVM)
15.3 Evaluation Metrics and Comparative Results for Early Detection of Lung Diseases
15.3.1 Accuracy (A)
15.3.2 Precision (P) and Recall (R)
15.3.3 Mean Squared Error (MSE)
15.3.4 Matthews Correlation Coefficient (MCC)
15.4 Conclusion
References
Chapter 16 Estimation of Cancer Risk through Artificial Neural Network
16.1 Introduction
16.2 Case Studies Related to Cancer Risk Estimation Using ANN
16.2.1 ANN Technique for Early LC Detection
16.2.2 ANNs in Image Processing for Early Diagnosis of BC
16.3 Datasets Used in Cancer Risk Estimation
16.3.1 Datasets Related to Breast Cancer
16.3.2 Dataset Related to Lung Cancer
16.3.3 BC Coimbra Data Set and BC Wisconsin (Diagnostic) Data
16.3.4 Comparison of ANN Techniques with Other Methods for Cancer Risk Estimation
16.4 Discussion
16.5 Future Scope
16.6 Conclusion
References
Chapter 17 Applications and Advancements in Data Science and Analytics
17.1 Data Science and Analytics in Software Testing
17.2 Applications of Data Science and Analytics
17.3 Selenium Testing Tool in Data Science
17.3.1 Basic Techniques for Testing Voice-Based Applications
17.3.2 Image Web Scraping Using Selenium
17.4 Challenges and Advancements in Data Science
17.5 Data Science and Analytics Tools
17.6 Conclusion
References
About the Editors
Index
EULA