3 min readfrom Machine Learning

I trained a NER model on 33,000 Indian Supreme Court judgments (1950–2024) CASE_CITATION hits 97.76% F1, +17 points over the only prior baseline [P]

Our take

Introducing en_legal_ner_ind_trf v0.1, a fine-tuned NER model based on InLegalBERT, trained on approximately 33,000 Indian Supreme Court judgments from 1950 to 2024. Achieving an impressive 97.76% F1 score for CASE_CITATION—17 points higher than the existing OpenNyAI model—this release marks a significant advancement in legal data processing. With 13 labels and an overall F1 of 78.67%, it addresses the limitations of prior models, especially with pre-1990 texts.

TL;DR: Released en_legal_ner_ind_trf v0.1 - InLegalBERT fine-tuned on ~34,700 silver-annotated chunks from 33k Indian SC judgments. 13 labels. 78.67% overall F1. CASE_CITATION at 97.76% already exceeds OpenNyAI's PRECEDENT score by +17 points. Free, Apache-2.0.

Why this exists

OpenNyAI is the only prior Indian legal NER model with any community presence. It's unmaintained and degrades on pre-1990 OCR-era text - the first 40 years of India's constitutional jurisprudence.

No replacement existed.

Results

Entity F1 Support
CASE_CITATION 97.76% 3,821
PROVISION 96.35% 20,248
STATUTE 91.94% 8,187
LAWYER 74.67% 3,982
JUDGE 68.06% 1,978
DATE 55.15% 3,289
RESPONDENT 50.44% 1,731
COURT 50.34% 1,033
WITNESS 49.77% 762
OTHER_PERSON 47.11% 4,266
PETITIONER 44.71% 1,573
ORG 41.34% 2,128
GPE 36.56% ⚠ 1,197
micro avg 78.67% 54,195

Evaluated on a held-out validation split (~500 documents, stride=512, non-overlapping). The 25-file locked test set is untouched - head-to-head with OpenNyAI runs in v1.0.

Comparison note: OpenNyAI (RoBERTa + transition-based parser, gold-annotated) achieved 91.1% overall strict F1. Not directly comparable - different test sets, different annotation quality, different corpus scope. The +17 point gap on CASE_CITATION is the one apples-to-apples number worth flagging.

The annotation pipeline

Silver labels from four automatic pipelines merged per document:

  • Regex — 14-pattern citation extractor + statute/provision extractor → CASE_CITATION, STATUTE, PROVISION
  • Metadata projection — case metadata JSONs mapped to character offsets via RapidFuzz → JUDGE, PETITIONER, RESPONDENT
  • Transformer NER — OpenNyAI en_legal_ner_trf, offset-corrected → LAWYER, COURT, ORG, GPE, DATE, OTHER_PERSON, WITNESS
  • Gazetteer — 858 Central Acts with alias resolution → confirms and adds STATUTE spans

Trained with Focal Loss (γ=2.0) to handle label imbalance between STATUTE/CASE_CITATION and O tokens. Hardware: Kaggle T4 (free tier).

Known weak spots - being honest

GPE (36.56%) and ORG (41.34%) are the problem labels. In Indian legal text, "State of Maharashtra" or "Union of India" appear as GPE, PETITIONER, RESPONDENT, or ORG depending on context. A linear token classification head can't resolve overlapping roles. CRF head is v1.0's job.

Positional bias - silver training data has repetitive header structures. Performance degrades when parties appear mid-document.

Pre-1990 OCR noise - judgments from 1950–1989 vary in quality. Recall drops the further back you go.

What's next

300-file gold annotation is in progress (3 volunteers onboard). v1.0 will add a CRF head, run the locked test set, and publish the official head-to-head with OpenNyAI.

Model: huggingface.co/evolawyer/inlegalbert-sc-ner-silver

Dataset: huggingface.co/datasets/evolawyer/indian-sc-judgments-ner-silver

GitHub: github.com/evolawyer/inlegalbert-sc-ner-silver

Happy to go deep on the annotation pipeline, conflict resolution between the four label sources, or the Focal Loss setup.

submitted by /u/gkv856
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#no-code spreadsheet solutions#financial modeling with spreadsheets#rows.com#big data performance#big data management in spreadsheets#generative AI for data analysis#conversational data analysis#large dataset processing#cloud-based spreadsheet applications#Excel alternatives for data analysis#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#data analysis tools#data cleaning solutions#CASE_CITATION#NER#PROVISION#STATUTE