This book constitutes refereed proceedings from the First International Conference on Speech and Language Technologies for Low-resource Languages, SPELLL 2022, held in Kalavakkam, India, in November 2022.
The 25 presented papers were thoroughly reviewed and selected from 70 submissions. The papers are organised in the following topical sections: language resources; language technologies; speech technologies; multimodal data analysis; fake news detection in low-resource languages (regional-fake); low resource cross-domain, cross-lingualand cross-modal offensie content analysis (LC4).
Author(s): Anand Kumar M, Bharathi Raja Chakravarthi, Bharathi B, Colm O’Riordan, Hema Murthy, Thenmozhi Durairaj, Thomas Mandl
Edition: 1
Publisher: Springer Cham
Year: 2023
Language: English
Pages: XIII, 356
Preface
Organization
Contents
Language Resources
KanSan: Kannada-Sanskrit Parallel Corpus Construction for Machine Translation
1 Introduction
2 Related Work
3 Kannada-Sanskrit Parallel Corpus Construction
4 Baseline Machine Translation Models
4.1 Preprocessing
4.2 Subword Tokenization
4.3 Statistical Machine Translation
4.4 Neural Machine Translation
4.5 Back-Translation
5 Experimental Setup and Results
6 Conclusion and Future Work
References
A Parsing Tool for Short Linguistic Constructions
1 Introduction
2 Related Work
3 Proposed IL TAG Parser
4 Customized TAG Grammar
5 Parsing Example for Short Linguistic Notation
6 Performance Analysis
7 Conclusion
References
TamilEmo: Fine-grained Emotion Detection Dataset for Tamil
1 Introduction
2 Related Work
2.1 Datasets for Emotion Detection
2.2 Emotion Classification
3 Tamil Emotion Dataset
3.1 Scraping Raw Data
3.2 Annotator Statistics
3.3 Inter-Annotator Agreement
3.4 Selecting and Curating YouTube Comments
4 Data Analysis
4.1 Keywords
5 Modeling
5.1 Data Preparation
5.2 Baseline Experiments
5.3 Experiment Settings
5.4 Results and Discussion
6 Conclusion
A Emotion Definitions
References
Context Sensitive Tamil Language Spellchecker Using RoBERTa
1 Introduction
2 Related Works
3 Model
3.1 Dictionary Creation
3.2 Test Dataset Creation
3.3 XLM-RoBERTa-base Model
4 Error Analysis
5 Experiments
5.1 Experiment: Xlm-roberta-base Model on Wikipedia Article ch4wikiarticle
6 Comparison with Tamil Spellcheckers
7 Conclusion
References
Correlating Copula Constructions in Tamil and English for Machine Translation
1 Introduction
2 Zero Copula Construction
3 Copula Construction with `aaku' as the Copula Verb
4 Copula Construction with `iru' as the Copula Verb
5 Copula Construction with `aaka-Iru' as the Copula Verb
6 Copula Construction as Embedded Sentences
7 Problematic Cases
8 Conclusion
References
Language Technologies
Tamil NLP Technologies: Challenges, State of the Art, Trends and Future Scope
1 Introduction
2 Risks and Challenges in the Technological Development of Tamil
3 Language Resources: Data, Knowledge Base, and Resources
3.1 Text Corpora
3.2 Corpora for Speech
3.3 Parallel Corpora
3.4 Lexical Resources
3.5 Grammars
4 Language Technology: Tools, Grammatical Technologies and Applications
4.1 Word Segmentation or Tokenization
4.2 Stemming and Lemmatization
4.3 Morphological Analysis
4.4 Morphological Generation
4.5 Part-of-speech Tagging
4.6 Chunking
4.7 Named Entity Recognition (NER)
4.8 Shallow Parsing
4.9 Syntactic Parsing
4.10 Identification of the Clause Boundary
4.11 Speech Recognition
4.12 Speech Synthesis
4.13 Optical Character Recognition
5 Semantic Analysis
5.1 Word Sense Disambiguation
5.2 Question Answering
5.3 Relationship Extraction
5.4 Paraphrase Identification
5.5 Automatic Text Summarization
5.6 Co-reference Resolution
5.7 Text Generation
5.8 Machine Translation
6 Social Media Text Analysis
6.1 Sentiment Analysis
6.2 Offensive Content Identification
7 Conclusion and Future Scope
References
Contextualized Embeddings from Transformers for Sentiment Analysis on Code-Mixed Hinglish Data: An Expanded Approach with Explainable Artificial Intelligence
1 Introduction
2 Motivation
3 Related Work
3.1 Pre-trained Language Models
3.2 Sentiment Analysis on Hinglish
3.3 LIME
4 Methodology
4.1 Dataset
4.2 Experiment
4.3 Evaluation Metrics
5 Results and Analysis
5.1 Empirical Results
5.2 Statistical Testing
5.3 Explainability
6 Discussion
7 Conclusion and Future Work
References
Transformer Based Hope Speech Comment Classification in Code-Mixed Text
1 Introduction
2 Related Work
3 Dataset Description
4 Methodology
4.1 Feature Extraction
4.2 Machine Learning Model
4.3 Deep Learning Models
4.4 Approaches
5 Results and Evaluation
5.1 Results
5.2 Evaluations
6 Conclusion
References
Paraphrase Detection in Indian Languages Using Deep Learning
1 Introduction
2 Literature Survey
3 Methodology
3.1 System Architecture
3.2 System Description
3.3 Dataset
4 Implementation
4.1 BERT
4.2 Seq2Seq
4.3 USE
4.4 Ensembled Model
5 Experimental Results
5.1 Prediction Results
5.2 Task 1
5.3 Task 2
5.4 Comparison of Algorithms
5.5 Performance Comparison
5.6 Error Analysis
6 Conclusion
References
Opinion Classification on Code-mixed Tamil Language
1 Introduction
1.1 Sentiment Analysis
1.2 Sentiment Analysis on Code-mixed Language
2 Literature Review
2.1 Sentiment Analysis on Mono Lingual Data
2.2 Sentiment Analysis on Code-Mixed Data
3 Methodology
3.1 Data Description
3.2 System Design
3.3 Data Pre-processing
3.4 Word Embedding
3.5 Machine Learning Approach
4 Performance Evaluation
4.1 Evaluation Metrics
5 Results and Discussions
6 Conclusion
References
Analyzing Tamil News Tweets in the Contextof Topic Identification
1 Introduction
2 Related Work
3 Corpus Creation
3.1 Extracting Tweets
3.2 Generating Labels Through Keyword-Based Distant Supervision
4 Corpus Analysis
5 Experiments and Results
5.1 Experiments
5.2 Results
6 Limitations and Future Work
7 Conclusion
References
Textual Entailment Recognition with Semantic Features from Empirical Text Representation
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Empirical Text Representation
3.2 Feature Extraction of Text-Hypothesis Pair
4 Experiments Results
4.1 Dataset
4.2 Experimental Settings
4.3 Performance Analysis of Entailment Recognition
4.4 Comparative Analysis
5 Conclusion with Future Direction
References
Impact of Transformers on Multilingual Fake News Detection for Tamil and Malayalam
1 Introduction
1.1 Motivation and Contribution
2 Multilingual Fake News Dataset Description
3 Methodology
3.1 Experimental Setup
4 Results and Discussions
5 Conclusion and Future Works
References
Development of Multi-lingual Models for Detecting Hope Speech Texts from Social Media Comments
1 Introduction
2 Literature Survey
3 Proposed Methodology
3.1 Preprocessing
3.2 Model Construction
4 Experimental Settings, Results and Findings
4.1 Experimental Results
4.2 Findings and Discussions
5 Conclusion and Future Work
References
Transfer Learning Based Youtube Toxic Comments Identification
1 Introduction
2 Related Work
3 Proposed System
3.1 Models
3.2 Classifiers
4 Performance Evaluation
5 Conclusion
References
Contextual Analysis of Tamil Proverbs for Automatic Meaning Extraction
1 Introduction
2 Background
3 Related Works
3.1 Works Related to Tamil Language and Literature
3.2 Works Related to Sentence Scoring Approach
4 Proposed Work
4.1 Dataset Creation
4.2 Meaning Extraction for Tamil Proverbs
5 Result and Discussion
6 Conclusions
References
Question Answering System for Tamil Using Deep Learning
1 Introduction
2 Related Works
3 System Architecture
3.1 Modules
4 Experiment and Results
4.1 Datasets
4.2 Models Used
4.3 BERT
4.4 XLM-RoBERTa
4.5 Results
5 Conclusion
References
Exploring the Opportunities and Challenges in Contributing to Tamil Wikimedia
1 Wikimedia Project- an Introduction
2 Tamil Wikimedia Project
2.1 Veteren Tamil Wikimedia Contributors
2.2 Being Part of Tamil Wikimedia
3 Opportunities in Tamil Wikimedia
4 Challenges in Tamil Wikimedia
5 Conclusion and Future Scope
References
Speech Technologies
.26em plus .1em minus .1emEarly Alzheimer Detection Through Speech Analysis and Vision Transformer Approach
1 Introduction
2 Related Work
3 Proposed Vision Transformer Approach for Alzheimer Detection
3.1 Log Mel Spectogram
3.2 Vision Transformer Deep Learning Model
4 MFCC and Random Forest Approach for Alzheimer Detection
4.1 MFCC
5 Experimental Analysis
5.1 Experimental Setup
5.2 Experimental Analysis
5.3 Metrics
5.4 Experimental Results
6 Conclusion
References
Multimodal Data Analysis
Active Contour Segmentation and Deep Learning Based Hand Gesture Recognition System for Deaf and Dumb People
1 Introduction
2 Related Works
3 Proposed System
3.1 Image Segmentation
3.2 Deep Learning Based Hand Gesture Recognition Model:
4 Implementation
5 Result and Discussions
6 Conclusion
Appendix 1
References
Multimodal Hate Speech Detection from Bengali Memes and Texts
1 Introduction
2 Related Work
3 Methods
3.1 Data Preprocessing
3.2 Neural Word Embeddings
3.3 Training of DNN Baseline Models
3.4 Training of Transformer-Based Models
3.5 Multimodal Fusion and Classification
4 Experiment Results
4.1 Datasets
4.2 Experiment Setup
4.3 Analysis of Hate Speech Detection
5 Conclusion
References
Workshop 1: Fake News Detection in Low-Resource Languages (Regional-Fake)
A Novel Dataset for Fake News Detection in Tamil Regional Language
1 Introduction
2 Related Work
3 Proposed Work
3.1 Data Scraping
3.2 Real News Data Collection
3.3 Fake News Data Collection
3.4 Challenges in Data Collection
3.5 Data Cleansing
3.6 Exploratory Data Analysis (EDA)
3.7 Corpus Statistics
4 Benchmark Models
4.1 Data Representation
4.2 Classifiers
4.3 Results
5 Conclusion
References
Fake News Detection in Low-Resource Languages
1 Introduction
2 Related Work
3 Fake News Dataset
4 Methodologies Used
4.1 Logistic Regression
4.2 BERT-Base Model
5 Implementation
6 Result and Analysis
7 Conclusion
References
Workshop 2: Low Resource Cross-Domain, Cross-Lingual and Cross-Modal Offensive Content Analysis (LC4)
MMOD-MEME: A Dataset for Multimodal Face Emotion Recognition on Code-Mixed Tamil Memes
1 Introduction
2 Related Works
3 Dataset Collection
4 Details of Dataset Construction
5 Dataset Analysis
6 Conclusion
References
End-to-End Unified Accented Acoustic Model for Malayalam-A Low Resourced Language
1 Introduction
2 Related Work
3 Proposed Methodology and Design
3.1 Dataset Construction
3.2 Feature Engineering
3.3 Building the Accented ASR System
4 Experimental Results
5 Conclusion and Future Scope
References
Author Index