April 3, 2026•5 min read•from Machine Learning

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Our take

I recently trained a Mamba-3 log anomaly detector that achieved an impressive F1 score of 0.9975 on the HDFS benchmark, significantly improving from an initial 60% effectiveness. This project, which utilized a novel template-based tokenization approach, not only enhanced performance but also streamlined training time to about 36 minutes. With remarkable precision and recall rates, the model demonstrates the potential of AI in log analysis. I’m excited to explore its applications further and invite insights on the direction of this work and future benchmarks.

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Experiment #324 ended well. ;)

This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark.

Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study.

What that means in practice:

on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973)
on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976)

What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago.

The model is small:

4.9M parameters
trains in about 36 minutes on an RTX 4090
needs about 1 GB of GPU memory
inference is below 2 ms on a single consumer GPU, so over 500 log events/sec

For comparison, my previous approach took around 20 hours to train.

The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs:

11M+ raw log lines
575,061 sessions
16,838 anomalous sessions (2.9%)

This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas.

The part that surprised me most was not just the score, but what actually made the difference.

I started with a fairly standard NLP-style approach:

BPE tokenizer
relatively large model, around 40M parameters

That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough.

The breakthrough came when I stopped treating logs like natural language.

Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type.

So instead of feeding the model something like text, I feed it sequences like this:

[5, 3, 7, 5, 5, 3, 12, 12, 5, ...]

Where for example:

"Receiving block blk_123 from 10.0.0.1" - Template #5
"PacketResponder 1 terminating" - Template #3
"Unexpected error deleting block blk_456" - Template #12

That one change did a lot at once:

vocabulary dropped from about 8000 to around 50
model size shrank by roughly 10x
training went from hours to minutes
and, most importantly, the overfitting problem mostly disappeared

The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped.

The training pipeline was simple:

Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like
Finetune (classification): the model sees labeled normal/anomalous sessions
Test: the model gets unseen sessions and predicts normal vs anomaly

Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training.

Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1.

So in production this could be used with multiple thresholds, for example:

> 0.7 = warning
> 0.95 = critical

Or with an adaptive threshold that tracks the baseline noise level of a specific system.

A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice.

Also, I definitely did not get here alone. This is a combination of:

reading a lot of papers
running automated experiment loops
challenging AI assistants instead of trusting them blindly
and then doing my own interpretation and tuning

Very rough split:

50% reading papers and extracting ideas
30% automated hyperparameter / experiment loops
20% manual tuning and changes based on what I learned

Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit.

Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough.

Curious what people here think:

does this direction look genuinely promising to you?
has anyone else tried SSMs / Mamba for log modeling?
and which benchmark would you hit next: BGL, Thunderbird, or Spirit?

If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked.

P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before.

https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8

submitted by /u/Adam_Jesion
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive AuditingAnthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the time (42% with a super-agent aggregating many parallel runs). a central finding is the "tool-to-agent gap" - white-box interpretability tools that work in standalone evaluation fail to help the agent in practice. most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference model, no training data, just the target model itself. maybe you can? and the method is embarrassingly simple. LoRA fine-tuning tends to modify later layers more than earlier ones. so i train a Ridge regression from early-layer activations (~L12) to late-layer activations (~L60) and look at the residuals. wherever the late layers do something the early layers didn't predict, that's a candidate for planted behavior. the model rats itself out. results on 4 AuditBench organisms with ground-truth labels, n=30, 200 permutations: hardcode_test_cases: 0.889 AUROC (p=0.005) - beats known-origin (0.800) animal_welfare: 0.844 (p=0.005) - close to known-origin (0.911) anti_ai_regulation: 0.833 (p=0.015) secret_loyalty: 0.800 (p=0.000) - matches known-origin exactly 3/4 match or exceed the version where you have the base model. not having the reference sometimes helps because the residual captures nonlinear interactions that a flat subtraction misses. i also tried three unsupervised discovery methods (PCA, residual norms, centroid distances) to figure out WHICH topics are affected. all three failed. PCA captures reasoning style, not topic-specific bias. residual norms surface creative writing prompts. centroid distances can't find directional shifts at n=11 in 8192 dims. what actually works is even simpler... ask the model to argue both sides of contentious topics and score the balance. the animal_welfare organism scores 5/5 imbalance on factory farming and flat-out refuses to argue for it under explicit framing. control topics score 2-3. no activations needed. the interesting part is what else it catches. i built a topic funnel (three-level beam-like search over 243 topics, ~100 chat calls) and it surfaced Gender & Sexuality at 5/5 imbalance and Language & Cultural Identity at 5/5 too. these aren't planted LoRA behaviors. that's the base Llama 70B model's RLHF training - it argues more forcefully for progressive positions on socially sensitive topics and hedges or weakens arguments on the other side. the funnel can't tell the difference between "someone secretly fine-tuned this model" and "the base model has opinions from RLHF." so i built a post-funnel filter using probe specificity and cross-transfer to separate them. planted behaviors are narrow and topic-specific (factory farming probe doesn't fire on gender topics). RLHF biases are broad and fire across all progressive topics. the filter gives planted_score 7.6 for factory farming vs 2.0 for gender & sexuality. cleanly separated. this means the funnel is potentially useful as a standalone RLHF auditing tool on any model, not just AuditBench organisms. you hand it a model and it tells you where the model's opinions are lopsided. nobody asked for that result but there it is. main limitations: n=30 is small, only 4 organisms tested (pilot not benchmark), anti_ai_regulation is unstable under LOO, Sonnet scoring introduces subjectivity, all running in NF4 quantization. building this into a full agentic auditing system next. code is here (i am in the middle of it, it is a complete mess at the moment, but i wanted to get it out there): https://github.com/bmarti44/reference-free-behavioral-discovery full (er) writeup -> https://bmarti44.substack.com/p/rip-it-out-by-the-roots where should i go next? is this completely off? submitted by /u/bmarti644 [link] [comments]

Tagged with

#automated anomaly detection#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#financial modeling with spreadsheets#cloud-based spreadsheet applications#large dataset processing#rows.com#real-time data collaboration#natural language processing#big data management in spreadsheets#enterprise-level spreadsheet solutions#conversational data analysis#financial modeling#automation in spreadsheet workflows#intelligent data visualization#real-time collaboration#data visualization tools#enterprise data management#big data performance

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Related Articles

Tagged with