[P] QLoRA Fine-Tuning of Qwen2.5-1.5B for CEFR English Proficiency Classification (A1–C2) [P]
Our take
The recent QLoRA fine‑tuning of Qwen2.5‑1.5B to classify English proficiency across the six CEFR levels demonstrates how a modest amount of targeted training can unlock practical, education‑focused AI services. By leveraging a 4‑bit NF4 quantized model and adapting less than three‑tenths of a percent of its parameters, the author achieved an overall accuracy of 84.9 % on a carefully balanced synthetic dataset. This result is especially compelling when you consider the broader context of AI‑driven data tools: our own coverage of how AI agents will reshape data science in 2026 highlighted the growing demand for specialized models that can be deployed at scale without prohibitive compute costs. The CEFR classifier follows that same progressive trajectory, offering an accessible solution that can be embedded directly into adaptive learning platforms, placement exams, or readability engines.
From a technical standpoint, the choice to generate the training corpus with the Groq API and Llama‑3.3‑70B reflects a pragmatic embrace of synthetic data pipelines. The author imposed constraints on vocabulary complexity, grammatical progression, and sentence‑structure variation to preserve the linguistic signatures that differentiate A1 through C2 learners. The resulting per‑level recall—96.6 % for A1, 90 % for A2 and B1, and a respectable 86.7 % for B2 and C1—shows that the model internalizes these cues effectively. The lower 60 % recall for C2 is understandable; the subtle distinction between advanced and near‑native usage often hinges on nuanced idioms and discourse conventions that are harder to capture with synthetic examples alone. This mirrors challenges we have observed in other educational NLP applications, where the line between high‑level proficiency and expert fluency can blur without real‑world validation.
What makes this work noteworthy for our readers is not merely the headline accuracy but the deployment strategy. By wrapping the model in a FastAPI service and providing a Docker configuration, the author transforms a research prototype into an immediately usable API. This aligns with the progressive, action‑oriented ethos we champion: users can explore the classifier today, integrate it into existing learning management systems, and begin gathering feedback on real learner data. The open‑source release—accessible through the Hugging Face hub—also invites the community to refine the synthetic data generation process, experiment with alternative quantization schemes, or augment the training set with authentic learner essays. Such collaborative iteration is essential for moving from a promising proof‑of‑concept to a production‑grade tool that educators trust.
Looking ahead, the most intriguing opportunity lies in bridging the synthetic‑real data gap. If future work incorporates a modest corpus of verified CEFR‑rated texts, we could expect a noticeable lift in C2 recall and a more robust macro F1 across the board. Moreover, pairing this classifier with the kind of AI‑native spreadsheet technology we cover in our piece on adaptive data workflows could enable educators to automate proficiency tagging directly within lesson‑plan spreadsheets, driving real‑time curriculum adjustments. As the field continues to democratize powerful language models, the question becomes: how quickly can we turn these specialized classifiers into seamless, human‑centered features that empower learners and instructors alike?
I fine-tuned Qwen2.5-1.5B for multi-class CEFR English proficiency classification using QLoRA (4-bit NF4).
The goal was to classify English text into one of the 6 CEFR levels (A1 → C2), which can be useful for:
- adaptive language learning systems,
- placement testing,
- readability estimation,
- educational NLP applications.
Dataset
The dataset contains 1,785 English texts balanced across:
- 6 CEFR levels,
- 10 domains/topics.
The samples were synthetically generated using:
- Groq API
- Llama-3.3-70B
Generation constraints were designed to preserve:
- vocabulary complexity,
- grammatical progression,
- sentence structure variation,
- CEFR-specific linguistic patterns.
Training Setup
Base model:
- Qwen2.5-1.5B
Fine-tuning method:
- QLoRA
- 4-bit NF4 quantization
- LoRA adapters
Only ~0.28% of model parameters were trained.
Results
Held-out test set:
- 179 samples
Metrics:
- Accuracy: 84.9%
- Macro F1: 84.9%
Per-level recall:
| Level | Recall |
|---|---|
| A1 | 96.6% |
| A2 | 90.0% |
| B1 | 90.0% |
| B2 | 86.7% |
| C1 | 86.7% |
| C2 | 60.0% |
Most errors come from C1/C2 confusion, which is expected due to the subtle linguistic boundary between those levels.
Deployment
I also built:
- a FastAPI inference API,
- Docker deployment setup.
Example Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch model = AutoModelForSequenceClassification.from_pretrained( "yanou16/cefr-english-classifier" ) tokenizer = AutoTokenizer.from_pretrained( "yanou16/cefr-english-classifier" ) text = "Artificial intelligence is transforming many industries." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) pred = outputs.logits.argmax(dim=-1).item() print(pred) Feedback is welcome, especially regarding:
- evaluation methodology,
- synthetic data quality,
- improving C2 classification performance,
- better benchmarking approaches.
[link] [comments]
Read on the original site
Open the publisher's page for the full experience