Optimizing Wav2Vec2 for Low-Resource Languages
Fine-tuning speech recognition on limited Nepali datasets while maintaining acoustic robustness across dialects.
Challenge: Low-resource languages lack labeled audio. Direct fine-tuning overfits and fails across dialects.
Solution: Three-stage approach—self-supervised pretraining on unlabeled audio, supervised fine-tuning with SpecAugment, and accent-aware reweighting using SMOTE.
$L = \alpha L_{ctc} + (1-\alpha) L_{accent}$ where $\alpha=0.5$
Result: 18% WER (vs 32% baseline), 76% generalization across 5 dialects.