2 min readfrom Data Science

Need feedback on Two-stage ML approach for detecting and correcting mislabeled entity relationships (meters ↔ transformers)

Our take

Hello everyone, I'm seeking feedback on my two-stage machine learning approach to address a data quality issue involving mislabeled entity relationships between meters and transformers. I have a dataset of approximately 20,000 reviewed associations, with the goal of detecting and correcting incorrect links. My current model utilizes logistic regression, but I’m considering a two-stage process: first, detecting incorrect associations, and second, recommending the most likely correct transformer. I welcome your insights on this approach and any suggestions for optimizing the correction step and feature selection.

Hey everyone,

I am working on a real-world data quality problem and would appreciate feedback on my modeling approach.

Context:

I have a dataset of meters and their associated transformers (utility infrastructure). Some of these associations are incorrect, and the goal is to both detect and correct them.

Training data:

I’m using ~20,000 manually reviewed meter–transformer associations:

- Correct association → label = 1

- Incorrect association → label = 0

For incorrect cases, I also augment the data with the correct transformer, e.g.:

Meter1 | Trans1 | 0 (incorrect)

Meter1 | Trans2 | 1 (corrected)

Meter2 | Trans3 | 1 (correct)

Current baseline:

I started with a logistic regression model (class_weight="balanced" due to ~37% incorrect vs 63% correct).

Using a 0.20 threshold gives strong true negative performance (~98%), but only moderate recall.

Candidate generation:

For inference, I generate candidate transformers within a 550 ft radius for each meter (including the currently assigned one):

Meter1 | CandidateTrans1 | current

Meter1 | CandidateTrans2 | candidate

Meter1 | CandidateTrans3 | candidate

Current idea:

I’m considering splitting the problem into two stages:

Model 1 — Detection

Binary classification:

Is the current meter → transformer association incorrect?

Model 2 — Correction

For meters flagged as incorrect, rank candidate transformers to recommend the most likely correct one.

Pipeline:

Raw data → Detection model → Flag suspicious cases → Candidate generation → Ranking model → Recommendation

Features:

- Distance-based metrics (meter-to-transformer, centroid distances, etc.)

- Voltage correlation within meter clusters

- FLOC / naming similarity

- Cluster-level stats (group size, intra-cluster correlation)

- Relative features (distance rank, ratios, etc.)

Questions:

  1. Does this 2-stage decomposition (detection → correction) make sense vs a single end-to-end model?

  2. For the correction step, would you frame this as classification or learning-to-rank?

  3. Any recommendations for handling dependency between samples (e.g., meters within the same cluster)?

  4. Given the feature interactions, would you prioritize tree-based models (e.g., XGBoost) over simpler models?

Goal:

Maximize the number of incorrect associations that can be correctly fixed in production.

Open to hearing feedback !

submitted by /u/Zestyclose_Candy6313
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#real-time data collaboration#big data performance#big data management in spreadsheets#conversational data analysis#intelligent data visualization#data visualization tools#enterprise data management#data analysis tools#data cleaning solutions#automated anomaly detection#financial modeling with spreadsheets#AI formula generation techniques#cloud-based spreadsheet applications#rows.com#machine learning in spreadsheet applications#enterprise-level spreadsheet solutions#financial modeling