The written report for this project is available here: report.pdf. To see the code used for this project, visit the GitHub repository.
A modular pipeline for fine-tuning Small Language Models (SLMs) on multilingual summarization tasks using parallel Wikipedia articles across English, French, German, and Japanese.
- Comparative analysis of three distinct model architectures and fine-tuning strategies
- Cross-lingual knowledge transfer with up to 167% ROUGE score improvements
- Surprising finding: smallest model with full fine-tuning outperformed larger alternatives
- Validated through both ROUGE metrics and LLM-based evaluation
- Built on Wikipedia "good" and "featured" articles for high-quality training data
- Supports four languages with language-specific prompt engineering
This research challenges conventional wisdom about parameter-efficient fine-tuning methods. While the field has moved toward techniques like QLoRA for training larger models with limited resources, our findings suggest that strategic full fine-tuning of compact models can be more effective for multilingual tasks.
The motivation came from a fundamental question in modern NLP: Is it better to fully fine-tune smaller models or use parameter-efficient methods on larger models? Most practitioners assume bigger is always better, especially with techniques like QLoRA making large model training accessible. But this assumption hasn't been thoroughly tested for multilingual summarization tasks.
We wanted a solution that could:
- Provide effective multilingual summarization with limited computational resources
- Demonstrate meaningful cross-lingual knowledge transfer
- Challenge existing assumptions about model scaling and fine-tuning efficiency
- Work reliably across diverse language families and writing systems
Let's examine our key finding through a concrete example. We trained three models: Qwen2.5-0.5B (494M parameters) with full fine-tuning, Phi-4-mini (3.8B parameters) with QLoRA, and mBART-50 (610M parameters) with traditional fine-tuning. Surprisingly, the smallest Qwen model significantly outperformed both alternatives. When we trained the French-specific model and tested it on novel German topics, it showed 60.7% ROUGE-1 improvements. Even more remarkably, when summarizing the same topics across languages, improvements reached 167.4% for French summaries.
This cross-lingual transfer effect means you can train one model on a single language and get substantial improvements across multiple languages. The model learns general summarization skills rather than just language-specific patterns. You can deploy a single fine-tuned model for multilingual applications instead of maintaining separate models for each language.
Our approach also revealed interesting asymmetries in transfer learning. Training on Japanese produced the most generalizable improvements across all languages, while German-to-French transfer differed significantly from French-to-German transfer. This suggests that certain languages may serve as better "pivot" languages for multilingual training.
The technical implementation uses mixed-precision training, gradient checkpointing for memory efficiency, and language-specific prompting to avoid cross-contamination. We validated results using both traditional ROUGE metrics and modern LLM-based evaluation with Gemma 3 27B.