I trained a NER model on 33,000 Indian Supreme Court judgments (1950–2024) CASE_CITATION hits 97.76% F1, +17 points over the only prior baseline [P]

Our take

Introducing en_legal_ner_ind_trf v0.1, a fine-tuned NER model based on InLegalBERT, trained on approximately 33,000 Indian Supreme Court judgments from 1950 to 2024. Achieving an impressive 97.76% F1 score for CASE_CITATION—17 points higher than the existing OpenNyAI model—this release marks a significant advancement in legal data processing. With 13 labels and an overall F1 of 78.67%, it addresses the limitations of prior models, especially with pre-1990 texts.

TL;DR: Released en_legal_ner_ind_trf v0.1 - InLegalBERT fine-tuned on ~34,700 silver-annotated chunks from 33k Indian SC judgments. 13 labels. 78.67% overall F1. CASE_CITATION at 97.76% already exceeds OpenNyAI's PRECEDENT score by +17 points. Free, Apache-2.0.

Why this exists

OpenNyAI is the only prior Indian legal NER model with any community presence. It's unmaintained and degrades on pre-1990 OCR-era text - the first 40 years of India's constitutional jurisprudence.

No replacement existed.

Results

Entity	F1	Support
CASE_CITATION	97.76%	3,821
PROVISION	96.35%	20,248
STATUTE	91.94%	8,187
LAWYER	74.67%	3,982
JUDGE	68.06%	1,978
DATE	55.15%	3,289
RESPONDENT	50.44%	1,731
COURT	50.34%	1,033
WITNESS	49.77%	762
OTHER_PERSON	47.11%	4,266
PETITIONER	44.71%	1,573
ORG	41.34%	2,128
GPE	36.56% ⚠	1,197
micro avg	78.67%	54,195

Evaluated on a held-out validation split (~500 documents, stride=512, non-overlapping). The 25-file locked test set is untouched - head-to-head with OpenNyAI runs in v1.0.

Comparison note: OpenNyAI (RoBERTa + transition-based parser, gold-annotated) achieved 91.1% overall strict F1. Not directly comparable - different test sets, different annotation quality, different corpus scope. The +17 point gap on CASE_CITATION is the one apples-to-apples number worth flagging.

The annotation pipeline

Silver labels from four automatic pipelines merged per document:

Regex — 14-pattern citation extractor + statute/provision extractor → CASE_CITATION, STATUTE, PROVISION
Metadata projection — case metadata JSONs mapped to character offsets via RapidFuzz → JUDGE, PETITIONER, RESPONDENT
Transformer NER — OpenNyAI en_legal_ner_trf, offset-corrected → LAWYER, COURT, ORG, GPE, DATE, OTHER_PERSON, WITNESS
Gazetteer — 858 Central Acts with alias resolution → confirms and adds STATUTE spans

Trained with Focal Loss (γ=2.0) to handle label imbalance between STATUTE/CASE_CITATION and O tokens. Hardware: Kaggle T4 (free tier).

Known weak spots - being honest

GPE (36.56%) and ORG (41.34%) are the problem labels. In Indian legal text, "State of Maharashtra" or "Union of India" appear as GPE, PETITIONER, RESPONDENT, or ORG depending on context. A linear token classification head can't resolve overlapping roles. CRF head is v1.0's job.

Positional bias - silver training data has repetitive header structures. Performance degrades when parties appear mid-document.

Pre-1990 OCR noise - judgments from 1950–1989 vary in quality. Recall drops the further back you go.

What's next

300-file gold annotation is in progress (3 volunteers onboard). v1.0 will add a CRF head, run the locked test set, and publish the official head-to-head with OpenNyAI.

Model: huggingface.co/evolawyer/inlegalbert-sc-ner-silver

Dataset: huggingface.co/datasets/evolawyer/indian-sc-judgments-ner-silver

GitHub: github.com/evolawyer/inlegalbert-sc-ner-silver

Happy to go deep on the annotation pipeline, conflict resolution between the four label sources, or the Focal Loss setup.

submitted by /u/gkv856
[link] [comments]

Tagged with

#no-code spreadsheet solutions#financial modeling with spreadsheets#rows.com#big data performance#big data management in spreadsheets#generative AI for data analysis#conversational data analysis#large dataset processing#cloud-based spreadsheet applications#Excel alternatives for data analysis#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#data analysis tools#data cleaning solutions#CASE_CITATION#NER#PROVISION#STATUTE