From: Jérôme Benoit <jerome.benoit@piment-noir.org>
Date: Sun, 8 Feb 2026 23:55:14 +0000 (+0100)
Subject: feat(quickadapter): add TimeSeriesSplit as alternative data splitting method (#48)
X-Git-Url: https://git.piment-noir.org/?a=commitdiff_plain;h=c927facc96d966e766924267bf2d13cb2e84e302;p=freqai-strategies.git

feat(quickadapter): add TimeSeriesSplit as alternative data splitting method (#48)

* feat(model): add TimeSeriesSplit support via train() override

* docs(readme): document TimeSeriesSplit configuration options

* fix: address PR review comments

- Use test_size parameter in TimeSeriesSplit
- Remove unused dk parameter from _make_timeseries_split_datasets()
- Assign dk.data_dictionary = dd before logging
- Fix typo: train_test_test -> train_test_split in README

* docs: integrate data_split_parameters into tunables table

Remove standalone section and add parameters to existing table
with freqai. prefix for consistency.

* refactor: use FreqAI APIs for weight calculation and data dictionary

- Use dk.set_weights_higher_recent() instead of duplicating weight formula
- Use dk.build_data_dictionary() for consistent data structure
- Respects feature_parameters.weight_factor configuration
- Fix bug: was using data_kitchen_thread_count instead of weight_factor

* refactor: extract _apply_pipelines() to reduce code duplication

- Move pipeline definition and application logic to helper method
- Reduces train() override complexity while keeping same behavior
- Helper can be reused by future custom split implementations

* style: harmonize namespace and remove inline comments

- Rename DATA_SPLIT_METHODS to _DATA_SPLIT_METHODS (private tuple pattern)
- Reference DATA_SPLIT_METHOD_DEFAULT from _DATA_SPLIT_METHODS[0]
- Remove 22 inline comments to match self-documenting codebase style

* fix: align TimeSeriesSplit weight calculation with FreqAI semantics

Calculate weights on combined train+test set before splitting to maintain
temporal weight continuity, matching FreqAI's make_train_test_datasets behavior.

* feat: add gap=0 warning and improve TimeSeriesSplit validation

- Warn when gap=0 about look-ahead bias risk (reference label_period_candles)
- Add _compute_timeseries_min_samples() for accurate minimum sample calculation
- Account for gap and test_size in minimum sample validation
- Improve error message with all relevant parameters

* style: harmonize error messages with codebase conventions

- Use 'Invalid {param} value {value!r}: {constraint}' pattern
- Align with existing validation error format (lines 718, 1145)

* style: add cached set accessor for data split methods

- Add _data_split_methods_set() with @staticmethod @lru_cache
- Use QuickAdapterRegressorV3 prefix for class attribute access
- Use cached set for O(1) membership check in validation

* fix: address PR review comments for TimeSeriesSplit

- Use dd consistently in training logs instead of dk.data_dictionary
- Use self.data_split_parameters consistently in _apply_pipelines
- Add explicit type coercion for n_splits, gap, max_train_size
- Add validation for gap >= 0 and max_train_size >= 1
- Improve test_size validation: float in (0,1) as fraction, int >= 1 as count
- Fix _compute_timeseries_min_samples formula: (n_splits+1)*test_size + n_splits*gap
- Optimize tscv.split() iteration to avoid unnecessary list materialization

* fix: correct min_samples formula to match sklearn validation

sklearn validates: n_samples - gap - (test_size * n_splits) > 0
Correct formula: test_size * n_splits + gap + 1

* feat: auto-calculate TimeSeriesSplit gap from label_period_candles

When gap=0 is configured, automatically set gap to label_period_candles
to prevent look-ahead bias from overlapping label windows. This ensures
temporal separation between train and test sets without requiring manual
configuration.

* fix: remove redundant time import shadowing module

* fix: correct min_samples formula for dynamic test_size and document test_size param

* refactor: remove redundant TimeSeriesSplit min_samples validation

* docs: clarify test_size default per split method

* refactor: move DependencyException import to file header

* style: use class name for class constant access

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* docs: use Python None instead of null in README

* docs: fix train_test_split description (sequential, not random)

* fix: use explicit None check for max_train_size validation

* docs: clarify timeseries_split as chronological split, not cross-validation

* refactor(quickadapter): shorten log prefixes and tailor empty test set error by split method

* refactor(quickadapter): use index pattern for timeseries_split method constant

Replace string literals with index access pattern following existing
codebase convention for _DATA_SPLIT_METHODS.

Also renames variables for semantic clarity:
- test_size_param -> test_size
- feat_dict -> feature_parameters

* refactor(quickadapter): use _TEST_SIZE constant instead of hardcoded 0.1

* chore(quickadapter): bump version to 3.11.2

* fix(quickadapter): restore test_size parameter in TimeSeriesSplit

The test_size variable from data_split_parameters was being
immediately overwritten by a type annotation line, making it
always None regardless of user configuration.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---

diff --git a/README.md b/README.md
index 09b35b4..38942e8 100644
--- a/README.md
+++ b/README.md
@@ -59,6 +59,12 @@ docker compose up -d --build
 | reversal_confirmation.max_natr_multiplier_fraction             | 0.075                         | float [0,1]                                                                                                                                  | Upper bound fraction (>= lower bound) for volatility adjusted reversal threshold.                                                                                                                                                                                                                                                                                                                                                                                                  |
 | _Regressor model_                                              |                               |                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
 | freqai.regressor                                               | `xgboost`                     | enum {`xgboost`,`lightgbm`,`histgradientboostingregressor`,`ngboost`,`catboost`}                                                             | Machine learning regressor algorithm.                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| _Data split parameters_                                        |                               |                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| freqai.data_split_parameters.method                            | `train_test_split`            | enum {`train_test_split`,`timeseries_split`}                                                                                                 | Data splitting strategy. `train_test_split` for sequential split, `timeseries_split` for chronological split with configurable gap.                                                                                                                                                                                                                                                                                                                                                 |
+| freqai.data_split_parameters.test_size                         | 0.1 / None                    | float (0,1) \| int >= 1 \| None                                                                                                              | Test set size. Float for fraction, int for count. Default: 0.1 for `train_test_split`, None for `timeseries_split` (sklearn dynamic sizing).                                                                                                                                                                                                                                                                                                                                       |
+| freqai.data_split_parameters.n_splits                          | 5                             | int >= 2                                                                                                                                     | Controls train/test proportions for `timeseries_split` (higher = larger train set).                                                                                                                                                                                                                                                                                                                                                                                                 |
+| freqai.data_split_parameters.gap                               | 0                             | int >= 0                                                                                                                                     | Samples to exclude between train/test for `timeseries_split`. When 0, auto-calculated from `label_period_candles` to prevent look-ahead bias.                                                                                                                                                                                                                                                                                                                                      |
+| freqai.data_split_parameters.max_train_size                    | None                          | int >= 1 \| None                                                                                                                             | Maximum training set size for `timeseries_split`. When set, creates a sliding window instead of expanding train set. None = no limit.                                                                                                                                                                                                                                                                                                                                              |
 | _Label smoothing_                                              |                               |                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
 | freqai.label_smoothing.method                                  | `gaussian`                    | enum {`none`,`gaussian`,`kaiser`,`triang`,`smm`,`sma`,`savgol`,`gaussian_filter1d`}                                                          | Label smoothing method (`smm`=median, `sma`=mean, `savgol`=SavitzkyâGolay).                                                                                                                                                                                                                                                                                                                                                                                                        |
 | freqai.label_smoothing.window_candles                          | 5                             | int >= 3                                                                                                                                     | Smoothing window length (candles).                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
diff --git a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py
index 832ea2a..d4718dc 100644
--- a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py
+++ b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py
@@ -17,10 +17,12 @@ import skimage
 import sklearn
 from datasieve.pipeline import Pipeline
 from datasieve.transforms import SKLearnWrapper
+from freqtrade.exceptions import DependencyException
 from freqtrade.freqai.base_models.BaseRegressionModel import BaseRegressionModel
 from freqtrade.freqai.data_kitchen import FreqaiDataKitchen
 from numpy.typing import NDArray
 from optuna.study.study import ObjectiveFuncType
+from sklearn.model_selection import TimeSeriesSplit
 from sklearn.preprocessing import (
     MaxAbsScaler,
     MinMaxScaler,
@@ -94,7 +96,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
     https://github.com/sponsors/robcaulk
     """
 
-    version = "3.11.1"
+    version = "3.11.2"
 
     _TEST_SIZE: Final[float] = 0.1
 
@@ -229,6 +231,15 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
     OPTUNA_SPACE_FRACTION_DEFAULT: Final[float] = 0.4
     OPTUNA_SEED_DEFAULT: Final[int] = 1
 
+    _DATA_SPLIT_METHODS: Final[tuple[str, ...]] = (
+        "train_test_split",
+        "timeseries_split",
+    )
+    DATA_SPLIT_METHOD_DEFAULT: Final[str] = _DATA_SPLIT_METHODS[0]
+    TIMESERIES_N_SPLITS_DEFAULT: Final[int] = 5
+    TIMESERIES_GAP_DEFAULT: Final[int] = 0
+    TIMESERIES_MAX_TRAIN_SIZE_DEFAULT: Final[int | None] = None
+
     @staticmethod
     @lru_cache(maxsize=None)
     def _extrema_selection_methods_set() -> set[ExtremaSelectionMethod]:
@@ -304,6 +315,11 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
     def _power_mean_metrics_set() -> set[str]:
         return set(QuickAdapterRegressorV3._POWER_MEAN_MAP.keys())
 
+    @staticmethod
+    @lru_cache(maxsize=None)
+    def _data_split_methods_set() -> set[str]:
+        return set(QuickAdapterRegressorV3._DATA_SPLIT_METHODS)
+
     @staticmethod
     def _get_selection_category(method: str) -> Optional[str]:
         for (
@@ -964,7 +980,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
         logger.info(f"Model Version: {self.version}")
         logger.info(f"Regressor: {self.regressor}")
 
-        logger.info("Optuna Hyperopt Configuration:")
+        logger.info("Optuna Hyperopt:")
         optuna_config = self._optuna_config
         logger.info(f"  enabled: {optuna_config.get('enabled')}")
         if optuna_config.get("enabled"):
@@ -1027,7 +1043,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
         label_pipeline = self.label_pipeline
         label_prediction = self.label_prediction
         for label_col in LABEL_COLUMNS:
-            logger.info(f"Label Configuration [{label_col}]:")
+            logger.info(f"Label [{label_col}]:")
 
             col_pipeline = get_label_column_config(
                 label_col, label_pipeline["default"], label_pipeline["columns"]
@@ -1116,7 +1132,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
         feature_range = self.ft_params.get(
             "range", QuickAdapterRegressorV3.RANGE_DEFAULT
         )
-        logger.info("Feature Parameters Configuration:")
+        logger.info("Feature Parameters:")
         logger.info(f"  scaler: {scaler}")
         logger.info(
             f"  range: ({format_number(feature_range[0])}, {format_number(feature_range[1])})"
@@ -1313,6 +1329,275 @@ class QuickAdapterRegressorV3(BaseRegressionModel):
             ]
         )
 
+    def train(
+        self, unfiltered_df: pd.DataFrame, pair: str, dk: FreqaiDataKitchen, **kwargs
+    ) -> Any:
+        """
+        Filter the training data and train a model to it.
+
+        Supports two data split methods:
+        - 'train_test_split' (default): Delegates to BaseRegressionModel.train()
+        - 'timeseries_split': Chronological split with configurable gap. Uses the final
+          fold from sklearn's TimeSeriesSplit.
+
+        :param unfiltered_df: Full dataframe for the current training period
+        :param pair: Trading pair being trained
+        :param dk: FreqaiDataKitchen object containing configuration
+        :return: Trained model
+        """
+        method = self.data_split_parameters.get(
+            "method", QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT
+        )
+
+        if method not in QuickAdapterRegressorV3._data_split_methods_set():
+            raise ValueError(
+                f"Invalid data_split_parameters.method value {method!r}: "
+                f"supported values are {', '.join(QuickAdapterRegressorV3._DATA_SPLIT_METHODS)}"
+            )
+
+        logger.info(f"Using data split method: {method}")
+
+        if method == QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT:
+            return super().train(unfiltered_df, pair, dk, **kwargs)
+
+        elif (
+            method == QuickAdapterRegressorV3._DATA_SPLIT_METHODS[1]
+        ):  # timeseries_split
+            logger.info(
+                f"-------------------- Starting training {pair} --------------------"
+            )
+
+            start_time = time.time()
+
+            features_filtered, labels_filtered = dk.filter_features(
+                unfiltered_df,
+                dk.training_features_list,
+                dk.label_list,
+                training_filter=True,
+            )
+
+            start_date = unfiltered_df["date"].iloc[0].strftime("%Y-%m-%d")
+            end_date = unfiltered_df["date"].iloc[-1].strftime("%Y-%m-%d")
+            logger.info(
+                f"-------------------- Training on data from {start_date} to "
+                f"{end_date} --------------------"
+            )
+
+            dd = self._make_timeseries_split_datasets(
+                features_filtered, labels_filtered, dk
+            )
+
+            if (
+                not self.freqai_info.get("fit_live_predictions_candles", 0)
+                or not self.live
+            ):
+                dk.fit_labels()
+
+            dd = self._apply_pipelines(dd, dk, pair)
+
+            logger.info(
+                f"Training model on {len(dd['train_features'].columns)} features"
+            )
+            logger.info(f"Training model on {len(dd['train_features'])} data points")
+
+            model = self.fit(dd, dk, **kwargs)
+
+            end_time = time.time()
+
+            logger.info(
+                f"-------------------- Done training {pair} "
+                f"({end_time - start_time:.2f} secs) --------------------"
+            )
+
+            return model
+
+    def _apply_pipelines(
+        self,
+        dd: dict,
+        dk: FreqaiDataKitchen,
+        pair: str,
+    ) -> dict:
+        """
+        Apply feature and label pipelines to train/test data.
+
+        This helper reduces code duplication between train() methods that need
+        custom data splitting but share the same pipeline application logic.
+
+        :param dd: data_dictionary with train/test features/labels/weights
+        :param dk: FreqaiDataKitchen instance
+        :param pair: Trading pair (for error messages)
+        :return: data_dictionary with transformed features/labels
+        """
+        dk.feature_pipeline = self.define_data_pipeline(threads=dk.thread_count)
+        dk.label_pipeline = self.define_label_pipeline(threads=dk.thread_count)
+
+        (dd["train_features"], dd["train_labels"], dd["train_weights"]) = (
+            dk.feature_pipeline.fit_transform(
+                dd["train_features"], dd["train_labels"], dd["train_weights"]
+            )
+        )
+
+        dd["train_labels"], _, _ = dk.label_pipeline.fit_transform(dd["train_labels"])
+
+        if (
+            self.data_split_parameters.get(
+                "test_size", QuickAdapterRegressorV3._TEST_SIZE
+            )
+            != 0
+        ):
+            if dd["test_labels"].shape[0] == 0:
+                method = self.data_split_parameters.get(
+                    "method", QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT
+                )
+                if (
+                    method == QuickAdapterRegressorV3._DATA_SPLIT_METHODS[1]
+                ):  # timeseries_split
+                    n_splits = self.data_split_parameters.get(
+                        "n_splits", QuickAdapterRegressorV3.TIMESERIES_N_SPLITS_DEFAULT
+                    )
+                    gap = self.data_split_parameters.get(
+                        "gap", QuickAdapterRegressorV3.TIMESERIES_GAP_DEFAULT
+                    )
+                    max_train_size = self.data_split_parameters.get("max_train_size")
+                    test_size = self.data_split_parameters.get("test_size")
+                    error_msg = (
+                        f"{pair}: test set is empty after filtering. "
+                        f"Possible causes: n_splits too high, gap too large, "
+                        f"max_train_size too restrictive, or insufficient data. "
+                        f"Current parameters: n_splits={n_splits}, gap={gap}, "
+                        f"max_train_size={max_train_size}, test_size={test_size}. "
+                        f"Try reducing n_splits/gap or increasing data period."
+                    )
+                else:
+                    test_size = self.data_split_parameters.get(
+                        "test_size", QuickAdapterRegressorV3._TEST_SIZE
+                    )
+                    error_msg = (
+                        f"{pair}: test set is empty after filtering. "
+                        f"Possible causes: overly strict SVM thresholds or insufficient data. "
+                        f"Current test_size={test_size}. "
+                        f"Try reducing test_size or relaxing SVM conditions."
+                    )
+                raise DependencyException(error_msg)
+            else:
+                (dd["test_features"], dd["test_labels"], dd["test_weights"]) = (
+                    dk.feature_pipeline.transform(
+                        dd["test_features"], dd["test_labels"], dd["test_weights"]
+                    )
+                )
+                dd["test_labels"], _, _ = dk.label_pipeline.transform(dd["test_labels"])
+
+        dk.data_dictionary = dd
+
+        return dd
+
+    def _make_timeseries_split_datasets(
+        self,
+        filtered_dataframe: pd.DataFrame,
+        labels: pd.DataFrame,
+        dk: FreqaiDataKitchen,
+    ) -> dict:
+        """
+        Chronological train/test split using the final fold from sklearn's TimeSeriesSplit.
+
+        n_splits controls train/test proportions (higher = larger train set).
+        gap excludes samples between train/test; when 0, auto-calculated from
+        label_period_candles. max_train_size enables sliding window mode.
+
+        :param filtered_dataframe: Feature data to split
+        :param labels: Label data to split
+        :param dk: FreqaiDataKitchen instance for weight calculation and data building
+        :return: data_dictionary with train/test features/labels/weights
+        """
+        n_splits = int(
+            self.data_split_parameters.get(
+                "n_splits", QuickAdapterRegressorV3.TIMESERIES_N_SPLITS_DEFAULT
+            )
+        )
+        gap = int(
+            self.data_split_parameters.get(
+                "gap", QuickAdapterRegressorV3.TIMESERIES_GAP_DEFAULT
+            )
+        )
+        max_train_size = self.data_split_parameters.get(
+            "max_train_size", QuickAdapterRegressorV3.TIMESERIES_MAX_TRAIN_SIZE_DEFAULT
+        )
+        max_train_size = int(max_train_size) if max_train_size is not None else None
+
+        if n_splits < 2:
+            raise ValueError(
+                f"Invalid data_split_parameters.n_splits value {n_splits!r}: must be >= 2"
+            )
+        if gap < 0:
+            raise ValueError(
+                f"Invalid data_split_parameters.gap value {gap!r}: must be >= 0"
+            )
+        if max_train_size is not None and max_train_size < 1:
+            raise ValueError(
+                f"Invalid data_split_parameters.max_train_size value {max_train_size!r}: "
+                f"must be >= 1 or None"
+            )
+
+        test_size = self.data_split_parameters.get("test_size", None)
+        if test_size is not None:
+            if isinstance(test_size, float) and 0 < test_size < 1:
+                test_size = int(len(filtered_dataframe) * test_size)
+            elif isinstance(test_size, int) and test_size >= 1:
+                pass
+            else:
+                raise ValueError(
+                    f"Invalid data_split_parameters.test_size value {test_size!r}: "
+                    f"must be float in (0, 1) as fraction, int >= 1 as count, or None"
+                )
+            if test_size < 1:
+                raise ValueError(
+                    f"Computed test_size ({test_size}) is too small. "
+                    f"Increase test_size or provide more data."
+                )
+
+        if gap == 0:
+            label_period_candles = int(
+                self.ft_params.get("label_period_candles", self._label_defaults[0])
+            )
+            gap = label_period_candles
+            logger.info(
+                f"TimeSeriesSplit gap auto-calculated from label_period_candles: {gap}"
+            )
+
+        tscv = TimeSeriesSplit(
+            n_splits=n_splits,
+            gap=gap,
+            max_train_size=max_train_size,
+            test_size=test_size,
+        )
+        train_idx: np.ndarray = np.array([])
+        test_idx: np.ndarray = np.array([])
+        for train_idx, test_idx in tscv.split(filtered_dataframe):
+            pass
+
+        train_features = filtered_dataframe.iloc[train_idx]
+        test_features = filtered_dataframe.iloc[test_idx]
+        train_labels = labels.iloc[train_idx]
+        test_labels = labels.iloc[test_idx]
+
+        feature_parameters = self.freqai_info.get("feature_parameters", {})
+        if feature_parameters.get("weight_factor", 0) > 0:
+            total_weights = dk.set_weights_higher_recent(len(train_idx) + len(test_idx))
+            train_weights = total_weights[: len(train_idx)]
+            test_weights = total_weights[len(train_idx) :]
+        else:
+            train_weights = np.ones(len(train_idx))
+            test_weights = np.ones(len(test_idx))
+
+        return dk.build_data_dictionary(
+            train_features,
+            test_features,
+            train_labels,
+            test_labels,
+            train_weights,
+            test_weights,
+        )
+
     def fit(
         self, data_dictionary: dict[str, Any], dk: FreqaiDataKitchen, **kwargs
     ) -> Any:
diff --git a/quickadapter/user_data/strategies/QuickAdapterV3.py b/quickadapter/user_data/strategies/QuickAdapterV3.py
index 2c28d24..789bd9c 100644
--- a/quickadapter/user_data/strategies/QuickAdapterV3.py
+++ b/quickadapter/user_data/strategies/QuickAdapterV3.py
@@ -109,7 +109,7 @@ class QuickAdapterV3(IStrategy):
     _PLOT_EXTREMA_MIN_EPS: Final[float] = 0.01
 
     def version(self) -> str:
-        return "3.11.1"
+        return "3.11.2"
 
     timeframe = "5m"
     timeframe_minutes = timeframe_to_minutes(timeframe)
@@ -470,7 +470,7 @@ class QuickAdapterV3(IStrategy):
         label_weighting = self.label_weighting
         label_smoothing = self.label_smoothing
         for label_col in LABEL_COLUMNS:
-            logger.info(f"Label Configuration [{label_col}]:")
+            logger.info(f"Label [{label_col}]:")
 
             col_weighting = get_label_column_config(
                 label_col, label_weighting["default"], label_weighting["columns"]