5 min readfrom Machine Learning

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Our take

I recently trained a Mamba-3 log anomaly detector that achieved an impressive F1 score of 0.9975 on the HDFS benchmark, significantly improving from an initial 60% effectiveness. This project, which utilized a novel template-based tokenization approach, not only enhanced performance but also streamlined training time to about 36 minutes. With remarkable precision and recall rates, the model demonstrates the potential of AI in log analysis. I’m excited to explore its applications further and invite insights on the direction of this work and future benchmarks.
[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go
[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Experiment #324 ended well. ;)

This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark.

Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study.

What that means in practice:

  • on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973)
  • on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976)

What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago.

The model is small:

  • 4.9M parameters
  • trains in about 36 minutes on an RTX 4090
  • needs about 1 GB of GPU memory
  • inference is below 2 ms on a single consumer GPU, so over 500 log events/sec

For comparison, my previous approach took around 20 hours to train.

The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs:

  • 11M+ raw log lines
  • 575,061 sessions
  • 16,838 anomalous sessions (2.9%)

This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas.

The part that surprised me most was not just the score, but what actually made the difference.

I started with a fairly standard NLP-style approach:

  • BPE tokenizer
  • relatively large model, around 40M parameters

That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough.

The breakthrough came when I stopped treating logs like natural language.

Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type.

So instead of feeding the model something like text, I feed it sequences like this:

[5, 3, 7, 5, 5, 3, 12, 12, 5, ...]

Where for example:

  • "Receiving block blk_123 from 10.0.0.1" - Template #5
  • "PacketResponder 1 terminating" - Template #3
  • "Unexpected error deleting block blk_456" - Template #12

That one change did a lot at once:

  • vocabulary dropped from about 8000 to around 50
  • model size shrank by roughly 10x
  • training went from hours to minutes
  • and, most importantly, the overfitting problem mostly disappeared

The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped.

The training pipeline was simple:

  • Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like
  • Finetune (classification): the model sees labeled normal/anomalous sessions
  • Test: the model gets unseen sessions and predicts normal vs anomaly

Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training.

Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1.

So in production this could be used with multiple thresholds, for example:

  • > 0.7 = warning
  • > 0.95 = critical

Or with an adaptive threshold that tracks the baseline noise level of a specific system.

A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice.

Also, I definitely did not get here alone. This is a combination of:

  • reading a lot of papers
  • running automated experiment loops
  • challenging AI assistants instead of trusting them blindly
  • and then doing my own interpretation and tuning

Very rough split:

  • 50% reading papers and extracting ideas
  • 30% automated hyperparameter / experiment loops
  • 20% manual tuning and changes based on what I learned

Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit.

Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough.

Curious what people here think:

  • does this direction look genuinely promising to you?
  • has anyone else tried SSMs / Mamba for log modeling?
  • and which benchmark would you hit next: BGL, Thunderbird, or Spirit?

If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked.

P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before.

https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8

submitted by /u/Adam_Jesion
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#automated anomaly detection#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#financial modeling with spreadsheets#cloud-based spreadsheet applications#large dataset processing#rows.com#real-time data collaboration#natural language processing#big data management in spreadsheets#enterprise-level spreadsheet solutions#conversational data analysis#financial modeling#automation in spreadsheet workflows#intelligent data visualization#real-time collaboration#data visualization tools#enterprise data management#big data performance