2 min readfrom Machine Learning

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Our take

In my latest research, I explore probe-targeted fine-tuning (LoRa) to enhance verbal confidence calibration in large language models (LLMs). By probing the hidden states of instruct-tuned LLMs, I found they can distinguish correct from incorrect answers with an AUROC of 0.76–0.88, yet often respond with a blanket 99% confidence. This study employs targeted fine-tuning to align verbal outputs with internal knowledge. I tested this approach on eight models, illustrating a causal relationship between hidden state manipulation and confidence output.

The recent research on probe-targeted fine-tuning for verbal confidence calibration in large language models (LLMs) presents a compelling advancement in our understanding of AI's metacognitive capabilities. By probing the hidden states of instruct-tuned models, the study reveals that LLMs can differentiate between correct and incorrect answers with notable accuracy (0.76–0.88 AUROC). However, these models tend to exhibit an almost unwavering confidence level of 99% when queried directly, indicating a disconnect between their internal knowledge and external expression. This phenomenon raises important questions about the reliability of AI-generated information and the implications of its confidence levels. As we delve deeper into this topic, it is crucial to consider how such developments fit into the broader landscape of AI tools and their integration into our workflows, particularly in light of ongoing innovations in data management as highlighted in articles like The internet is being rebuilt for machines and How to either remove all duplicate rows including original, or isolate all unique rows.

The introduction of LoRa (Low-Rank Adaptation) in the fine-tuning process allows researchers to guide the LLM to vocalize its internal knowledge more accurately. By leveraging few hundred examples, the study demonstrates that it takes less than ten minutes to enhance the model’s ability to express its confidence levels. This finding underscores a significant opportunity to improve AI transparency and user trust. As LLMs become more prevalent in decision-making processes, understanding the nuances of their confidence can empower users to interpret AI outputs more critically, ultimately leading to better-informed decisions.

Moreover, the research suggests that confidence is not merely a byproduct of correlation; it is causal. The ability to manipulate hidden states to shift confidence levels indicates a deeper layer of functionality within LLMs that can be harnessed for various applications. The implications of this are profound, especially for industries relying heavily on data-driven insights. For instance, the implementation of such calibrated confidence metrics could change how we approach automated reporting or data synthesis, making it paramount to assess not just what the AI knows but how confidently it presents that information. As we consider the future of AI integration in everyday tools, the insights gained from this research will play a crucial role in refining the way these systems operate.

Looking ahead, one must ponder the ramifications of this research on user interaction with AI. As the field progresses, will we see a shift towards more transparent AI systems that prioritize user understanding over sheer output? The challenge lies in balancing the complexity of AI with the need for accessibility and trustworthiness. As we navigate this evolving landscape, it is essential for developers and users alike to remain vigilant about how confidence levels are communicated and interpreted. The future of data management must prioritize not only innovative capabilities but also the human experience of engaging with AI, ensuring that technology serves to enhance our understanding rather than obscure it. This development invites us to explore how we can empower users to discern AI's confidence accurately, paving the way for more productive and informed interactions with technology.

Just wanted to share my research regarding probe-targeted fine-tuning (LoRa) for verbal confidence calibration.,

If you probe the hidden states of an instruct-tuned LLM, it can tell correct from incorrect answers at 0.76–0.88 AUROC. But when you ask it directly it tends to respond with confidence at 99% for everything. The model knows if it actually knows but it won't admit it.

I took the probe's output and used it as fine-tuning targets. This teaches the model to say out loud what it already knows internally. LoRA, few hundred examples, under 10 minutes on an M3 Ultra.

I tested on 8 models across 4 families (7B–70B).

  • Activation patching shows it's actually causal. Not just a correlation. If you swap hidden states at the confidence position you can watch confidence shift (ρ = 0.976 layer gradient). If swap occurs at a random position then nothing happens.
  • At 70B, the softmax distribution carries valid metacognitive signal but the argmax text is still stuck at 99% confident. The model learned the routing internally but can't get pass the text bottleneck.
  • Seed-level replication across 3 models . The discrimination is stable, but the shape of the confidence distribution is seed-sensitive.

I pre-registered this across 2 studies (with noted deviations) and have all my code available (Code: github.com/synthiumjp/metacog-engineering). I tried to make it as rigourous and replicable as possible. The pre-print is here: https://zenodo.org/records/20436841

submitted by /u/Synthium-
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#rows.com#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#financial modeling with spreadsheets#no-code spreadsheet solutions#enterprise-level spreadsheet solutions#LLM#probe-targeted fine-tuning#LoRa#verbal confidence calibration#hidden states#AUROC#activation patching#metacognitive signal#softmax distribution#argmax text#confidence distribution#seed-level replication#model routing