2 min readfrom Machine Learning

Isolation Forest + eBPF events to create a Linux based endpoint detection system [P]

Hey everyone. I’ve been working on a machine learning project called guardd and wanted to get some feedback on the ML side of it.

It’s basically a host-based anomaly detection system for Linux using Isolation Forest. I’m collecting exec and network events, grouping them into 60 second windows, then turning that into feature vectors that get scored by the model.

Right now the features are things like counts of exec and network events, how many unique processes, files, IPs and ports show up in a window, some parent-child relationship patterns, a few simple ratios between features, and also some “new vs baseline” tracking like processes or relationships that weren’t seen during training.

Training is fully unsupervised. It collects baseline data, trains an Isolation Forest, then uses score_samples during detection. The threshold is just based on a percentile from the training score distribution.

The main issue right now is false positives, especially from stuff like browsers. Anything with a lot of variance can end up looking anomalous depending on what ended up in the baseline, so the model is pretty sensitive to training data.

Right now I’m looking at adding some time-based features like time of day or activity patterns, improving normalization a bit, and trying to handle bursty behavior better.

Curious what people think about feature design for this kind of data, how to make Isolation Forest less sensitive to noisy but normal behavior, and whether staying fully unsupervised makes sense here or if moving toward something more hybrid would be better.

Would appreciate any thoughts on the approach.

Repo is here: https://github.com/benny-e/guardd.git

submitted by /u/No-Insurance-4417
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#automated anomaly detection
#rows.com
#cloud-based spreadsheet applications
#big data management in spreadsheets
#conversational data analysis
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#natural language processing for spreadsheets
#machine learning in spreadsheet applications
#real-time collaboration
#financial modeling with spreadsheets
#Isolation Forest
#Linux