May 29, 2026•2 min read•from Machine Learning

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Our take

In my latest research, I explore probe-targeted fine-tuning (LoRa) to enhance verbal confidence calibration in large language models (LLMs). By probing the hidden states of instruct-tuned LLMs, I found they can distinguish correct from incorrect answers with an AUROC of 0.76–0.88, yet often respond with a blanket 99% confidence. This study employs targeted fine-tuning to align verbal outputs with internal knowledge. I tested this approach on eight models, illustrating a causal relationship between hidden state manipulation and confidence output.

The recent research on probe-targeted fine-tuning for verbal confidence calibration in large language models (LLMs) presents a compelling advancement in our understanding of AI's metacognitive capabilities. By probing the hidden states of instruct-tuned models, the study reveals that LLMs can differentiate between correct and incorrect answers with notable accuracy (0.76–0.88 AUROC). However, these models tend to exhibit an almost unwavering confidence level of 99% when queried directly, indicating a disconnect between their internal knowledge and external expression. This phenomenon raises important questions about the reliability of AI-generated information and the implications of its confidence levels. As we delve deeper into this topic, it is crucial to consider how such developments fit into the broader landscape of AI tools and their integration into our workflows, particularly in light of ongoing innovations in data management as highlighted in articles like The internet is being rebuilt for machines and How to either remove all duplicate rows including original, or isolate all unique rows.

The introduction of LoRa (Low-Rank Adaptation) in the fine-tuning process allows researchers to guide the LLM to vocalize its internal knowledge more accurately. By leveraging few hundred examples, the study demonstrates that it takes less than ten minutes to enhance the model’s ability to express its confidence levels. This finding underscores a significant opportunity to improve AI transparency and user trust. As LLMs become more prevalent in decision-making processes, understanding the nuances of their confidence can empower users to interpret AI outputs more critically, ultimately leading to better-informed decisions.

Moreover, the research suggests that confidence is not merely a byproduct of correlation; it is causal. The ability to manipulate hidden states to shift confidence levels indicates a deeper layer of functionality within LLMs that can be harnessed for various applications. The implications of this are profound, especially for industries relying heavily on data-driven insights. For instance, the implementation of such calibrated confidence metrics could change how we approach automated reporting or data synthesis, making it paramount to assess not just what the AI knows but how confidently it presents that information. As we consider the future of AI integration in everyday tools, the insights gained from this research will play a crucial role in refining the way these systems operate.

Looking ahead, one must ponder the ramifications of this research on user interaction with AI. As the field progresses, will we see a shift towards more transparent AI systems that prioritize user understanding over sheer output? The challenge lies in balancing the complexity of AI with the need for accessibility and trustworthiness. As we navigate this evolving landscape, it is essential for developers and users alike to remain vigilant about how confidence levels are communicated and interpreted. The future of data management must prioritize not only innovative capabilities but also the human experience of engaging with AI, ensuring that technology serves to enhance our understanding rather than obscure it. This development invites us to explore how we can empower users to discern AI's confidence accurately, paving the way for more productive and informed interactions with technology.

Just wanted to share my research regarding probe-targeted fine-tuning (LoRa) for verbal confidence calibration.,

If you probe the hidden states of an instruct-tuned LLM, it can tell correct from incorrect answers at 0.76–0.88 AUROC. But when you ask it directly it tends to respond with confidence at 99% for everything. The model knows if it actually knows but it won't admit it.

I took the probe's output and used it as fine-tuning targets. This teaches the model to say out loud what it already knows internally. LoRA, few hundred examples, under 10 minutes on an M3 Ultra.

I tested on 8 models across 4 families (7B–70B).

Activation patching shows it's actually causal. Not just a correlation. If you swap hidden states at the confidence position you can watch confidence shift (ρ = 0.976 layer gradient). If swap occurs at a random position then nothing happens.
At 70B, the softmax distribution carries valid metacognitive signal but the argmax text is still stuck at 99% confident. The model learned the routing internally but can't get pass the text bottleneck.
Seed-level replication across 3 models . The discrimination is stable, but the shape of the confidence distribution is seed-sensitive.

I pre-registered this across 2 studies (with noted deviations) and have all my code available (Code: github.com/synthiumjp/metacog-engineering). I tried to make it as rigourous and replicable as possible. The pre-print is here: https://zenodo.org/records/20436841

submitted by /u/Synthium-
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive AuditingAnthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the time (42% with a super-agent aggregating many parallel runs). a central finding is the "tool-to-agent gap" - white-box interpretability tools that work in standalone evaluation fail to help the agent in practice. most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference model, no training data, just the target model itself. maybe you can? and the method is embarrassingly simple. LoRA fine-tuning tends to modify later layers more than earlier ones. so i train a Ridge regression from early-layer activations (~L12) to late-layer activations (~L60) and look at the residuals. wherever the late layers do something the early layers didn't predict, that's a candidate for planted behavior. the model rats itself out. results on 4 AuditBench organisms with ground-truth labels, n=30, 200 permutations: hardcode_test_cases: 0.889 AUROC (p=0.005) - beats known-origin (0.800) animal_welfare: 0.844 (p=0.005) - close to known-origin (0.911) anti_ai_regulation: 0.833 (p=0.015) secret_loyalty: 0.800 (p=0.000) - matches known-origin exactly 3/4 match or exceed the version where you have the base model. not having the reference sometimes helps because the residual captures nonlinear interactions that a flat subtraction misses. i also tried three unsupervised discovery methods (PCA, residual norms, centroid distances) to figure out WHICH topics are affected. all three failed. PCA captures reasoning style, not topic-specific bias. residual norms surface creative writing prompts. centroid distances can't find directional shifts at n=11 in 8192 dims. what actually works is even simpler... ask the model to argue both sides of contentious topics and score the balance. the animal_welfare organism scores 5/5 imbalance on factory farming and flat-out refuses to argue for it under explicit framing. control topics score 2-3. no activations needed. the interesting part is what else it catches. i built a topic funnel (three-level beam-like search over 243 topics, ~100 chat calls) and it surfaced Gender & Sexuality at 5/5 imbalance and Language & Cultural Identity at 5/5 too. these aren't planted LoRA behaviors. that's the base Llama 70B model's RLHF training - it argues more forcefully for progressive positions on socially sensitive topics and hedges or weakens arguments on the other side. the funnel can't tell the difference between "someone secretly fine-tuned this model" and "the base model has opinions from RLHF." so i built a post-funnel filter using probe specificity and cross-transfer to separate them. planted behaviors are narrow and topic-specific (factory farming probe doesn't fire on gender topics). RLHF biases are broad and fire across all progressive topics. the filter gives planted_score 7.6 for factory farming vs 2.0 for gender & sexuality. cleanly separated. this means the funnel is potentially useful as a standalone RLHF auditing tool on any model, not just AuditBench organisms. you hand it a model and it tells you where the model's opinions are lopsided. nobody asked for that result but there it is. main limitations: n=30 is small, only 4 organisms tested (pilot not benchmark), anti_ai_regulation is unstable under LOO, Sonnet scoring introduces subjectivity, all running in NF4 quantization. building this into a full agentic auditing system next. code is here (i am in the middle of it, it is a complete mess at the moment, but i wanted to get it out there): https://github.com/bmarti44/reference-free-behavioral-discovery full (er) writeup -> https://bmarti44.substack.com/p/rip-it-out-by-the-roots where should i go next? is this completely off? submitted by /u/bmarti644 [link] [comments]

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Related Articles