5 min readfrom Machine Learning

Backcasting forecast errors: model collapsing to mean [P]

Our take

Backcasting forecast errors in time series analysis can be complex, especially when dealing with multiple horizons. In this project, I'm tackling daily forecasts spanning from 2020 to 2026, aiming to predict the difference between forecasted and actual values prior to 2020. Despite employing various features and a Random Forest model, my predictions are collapsing toward the mean, resulting in low variance and poor tail estimation. I'm looking for insights or strategies to enhance model performance and capture more meaningful signal in the data.

Hey everyone,

I am kind of desperate for help right now on my current project. I'll try and be as clear as possible.

I'm working on a time series backcasting problem. The values I want to backcast are forecasts (not ML forecast, but think of weather forecasts) at different horizon (from 1 to 14). So to be clear, at a date D, I have 14 forecasts (forecast at D+1,..., D+14). I have such forecasts from 2020 to 2026 (each row represents a day, each (date, horizon) key is unique). So I have 14 dates duplicated as blocks because each row consists of on unique(date, horizon) -> target_date. I hope this is clear enough.

So the goal is to backcast those forecasts before 2020 (say 2019-2020 for simplicity). Besides forecasts values and horizon columns, I have "actuals" that are the true measured values for a particular variable (say temperature), and "normals" which is a smooth curves representing the climatology norm for a particular data. This "normals" column captures the seasonality, trend, and every other repetitive and predictable patterns.

So to be clear I have :

* dates (of forecast emission) | actuals | normals | horizon | forecasts *

And to really emphasise this point : dates, actuals and normals are the same for 14 consecutive rows (One row equals one horizon).

The target I want to predict is the following : forecast - actual_at_forecast_date

So i want to predict the true error observed (say i had predicted 20 (forecast) for today and I measure 18 (actual) then my target is +2).

So far, I've done the following :

- Transform target to remove annual seasonality, long-term trend and level-scaling

- Engineered classic features such as anomaly (actual-normal), lagged anomalies, rolling stats (std, mean, median, quantiles)

- Engineered target encoding features such as target_encoding_horizon_x_month

- RandomForest with max_depth 10-15, min_leaf 10, max features "sqrt", n_estimators 300

My train/val folds are reversed because I wanted to best evaluate on a backcasting framework. I made sure there is no leakage.

FINALLY:

My main problem is that, even with a LOT of features combination, trying a LOT of tuning, my prediction is very shallow and shrinking to the mean (the std and q10, q90 are off by a lot). So given I try to predict forecast_error which is centered on 0, I start to think that I only capture noise because my predictions really won't fit anything. MAE is getting worse with higher horizon forecasts which is only natural but even for horizon 1 my prediction is as good as predicting only 0s MAE-wised. Please if anyone has ideas that I can explore on my own I would be so grateful. I know you don't have all the details here but if you have experience with backcasting and has some recommendations I would be so grateful.

Hey everyone,

I'm working on a time series backcasting problem and I'm running into a fairly stubborn issue. I'd really appreciate any insights from people who have worked on similar setups.

Problem setup

I have daily-issued forecasts with multiple horizons:

  • At each date D, I have forecasts for D+1, ..., D+14
  • Data spans 2020–2026
  • Each row is a unique (forecast_date, horizon) pair

Toy example:

forecast_date horizon target_date forecast actual normal
2023-01-01 1 2023-01-02 20 18 19
2023-01-01 2 2023-01-03 21 20 19
... ... ... ... ... ...
2023-01-01 14 2023-01-15 25 23 20

Important:

  • forecast_date, actual, and normal are identical across the 14 horizons
  • Only horizon, target_date, and forecast vary

Objective

I want to backcast forecast errors before 2020.

Target:

target = forecast − actual(target_date) 

So if forecast = 20 and actual = 18 → target = +2.

Features

  • forecast, horizon
  • actual, normal
  • anomaly = actual − normal
  • lagged anomalies
  • rolling stats (mean, std, quantiles)
  • target encoding (e.g. horizon × month)

Model

Random Forest:

  • max_depth: 10–15
  • min_samples_leaf: 10
  • max_features: sqrt
  • n_estimators: 300

Validation

  • Time-based splits adapted for backcasting
  • No leakage (checked carefully)

Main issue

Predictions are very shallow and collapse toward 0:

  • Very low variance
  • Poor estimation of tails (q10 / q90)
  • Even for horizon = 1, performance is close to predicting constant 0 (in MAE)

MAE increases with horizon (expected), but overall performance remains weak.

Diagnostics

  • std(predictions) / std(target) ≈ 0.4 at best
  • This ratio decreases with horizon

So the model is clearly under-dispersed.

Interpretation

At this point I suspect:

  • either the signal is very weak
  • or the model is too conservative and fails to capture amplitude

Any help, feedback, or ideas to explore would be greatly appreciated.

Thanks a lot.

submitted by /u/Ambitious-Log-5255
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#financial modeling with spreadsheets#no-code spreadsheet solutions#real-time data collaboration#row zero#big data performance#real-time collaboration#rows.com#big data management in spreadsheets#conversational data analysis#automated anomaly detection#intelligent data visualization#data visualization tools#enterprise data management#data analysis tools#data cleaning solutions#enterprise-level spreadsheet solutions#cloud-based spreadsheet applications