Abstract:
The classification of sentiment on a small scale often suffers from the small amount of data, which limits the generalization ability of the models. This study evaluates and compares the effectiveness of three data augmentation strategies:
- EDA (Easy Data Augmentation)
- Back-translation
- Contextual token substitution (nlpaug-style)
Methods: Tested on both traditional ML classifiers (SVM, Random Forest) and transformer-based models (BERT) using low-resource sentiment datasets.
- All augmentation methods improved performance
- Contextual augmentation gave the most consistent gains for BERT
- EDA and back-translation were more effective for traditional classifiers
Published in: TechRxiv, 2025