From: Jérôme Benoit Date: Sun, 8 Feb 2026 23:55:14 +0000 (+0100) Subject: feat(quickadapter): add TimeSeriesSplit as alternative data splitting method (#48) X-Git-Url: https://git.piment-noir.org/?a=commitdiff_plain;h=c927facc96d966e766924267bf2d13cb2e84e302;p=freqai-strategies.git feat(quickadapter): add TimeSeriesSplit as alternative data splitting method (#48) * feat(model): add TimeSeriesSplit support via train() override * docs(readme): document TimeSeriesSplit configuration options * fix: address PR review comments - Use test_size parameter in TimeSeriesSplit - Remove unused dk parameter from _make_timeseries_split_datasets() - Assign dk.data_dictionary = dd before logging - Fix typo: train_test_test -> train_test_split in README * docs: integrate data_split_parameters into tunables table Remove standalone section and add parameters to existing table with freqai. prefix for consistency. * refactor: use FreqAI APIs for weight calculation and data dictionary - Use dk.set_weights_higher_recent() instead of duplicating weight formula - Use dk.build_data_dictionary() for consistent data structure - Respects feature_parameters.weight_factor configuration - Fix bug: was using data_kitchen_thread_count instead of weight_factor * refactor: extract _apply_pipelines() to reduce code duplication - Move pipeline definition and application logic to helper method - Reduces train() override complexity while keeping same behavior - Helper can be reused by future custom split implementations * style: harmonize namespace and remove inline comments - Rename DATA_SPLIT_METHODS to _DATA_SPLIT_METHODS (private tuple pattern) - Reference DATA_SPLIT_METHOD_DEFAULT from _DATA_SPLIT_METHODS[0] - Remove 22 inline comments to match self-documenting codebase style * fix: align TimeSeriesSplit weight calculation with FreqAI semantics Calculate weights on combined train+test set before splitting to maintain temporal weight continuity, matching FreqAI's make_train_test_datasets behavior. * feat: add gap=0 warning and improve TimeSeriesSplit validation - Warn when gap=0 about look-ahead bias risk (reference label_period_candles) - Add _compute_timeseries_min_samples() for accurate minimum sample calculation - Account for gap and test_size in minimum sample validation - Improve error message with all relevant parameters * style: harmonize error messages with codebase conventions - Use 'Invalid {param} value {value!r}: {constraint}' pattern - Align with existing validation error format (lines 718, 1145) * style: add cached set accessor for data split methods - Add _data_split_methods_set() with @staticmethod @lru_cache - Use QuickAdapterRegressorV3 prefix for class attribute access - Use cached set for O(1) membership check in validation * fix: address PR review comments for TimeSeriesSplit - Use dd consistently in training logs instead of dk.data_dictionary - Use self.data_split_parameters consistently in _apply_pipelines - Add explicit type coercion for n_splits, gap, max_train_size - Add validation for gap >= 0 and max_train_size >= 1 - Improve test_size validation: float in (0,1) as fraction, int >= 1 as count - Fix _compute_timeseries_min_samples formula: (n_splits+1)*test_size + n_splits*gap - Optimize tscv.split() iteration to avoid unnecessary list materialization * fix: correct min_samples formula to match sklearn validation sklearn validates: n_samples - gap - (test_size * n_splits) > 0 Correct formula: test_size * n_splits + gap + 1 * feat: auto-calculate TimeSeriesSplit gap from label_period_candles When gap=0 is configured, automatically set gap to label_period_candles to prevent look-ahead bias from overlapping label windows. This ensures temporal separation between train and test sets without requiring manual configuration. * fix: remove redundant time import shadowing module * fix: correct min_samples formula for dynamic test_size and document test_size param * refactor: remove redundant TimeSeriesSplit min_samples validation * docs: clarify test_size default per split method * refactor: move DependencyException import to file header * style: use class name for class constant access * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * docs: use Python None instead of null in README * docs: fix train_test_split description (sequential, not random) * fix: use explicit None check for max_train_size validation * docs: clarify timeseries_split as chronological split, not cross-validation * refactor(quickadapter): shorten log prefixes and tailor empty test set error by split method * refactor(quickadapter): use index pattern for timeseries_split method constant Replace string literals with index access pattern following existing codebase convention for _DATA_SPLIT_METHODS. Also renames variables for semantic clarity: - test_size_param -> test_size - feat_dict -> feature_parameters * refactor(quickadapter): use _TEST_SIZE constant instead of hardcoded 0.1 * chore(quickadapter): bump version to 3.11.2 * fix(quickadapter): restore test_size parameter in TimeSeriesSplit The test_size variable from data_split_parameters was being immediately overwritten by a type annotation line, making it always None regardless of user configuration. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- diff --git a/README.md b/README.md index 09b35b4..38942e8 100644 --- a/README.md +++ b/README.md @@ -59,6 +59,12 @@ docker compose up -d --build | reversal_confirmation.max_natr_multiplier_fraction | 0.075 | float [0,1] | Upper bound fraction (>= lower bound) for volatility adjusted reversal threshold. | | _Regressor model_ | | | | | freqai.regressor | `xgboost` | enum {`xgboost`,`lightgbm`,`histgradientboostingregressor`,`ngboost`,`catboost`} | Machine learning regressor algorithm. | +| _Data split parameters_ | | | | +| freqai.data_split_parameters.method | `train_test_split` | enum {`train_test_split`,`timeseries_split`} | Data splitting strategy. `train_test_split` for sequential split, `timeseries_split` for chronological split with configurable gap. | +| freqai.data_split_parameters.test_size | 0.1 / None | float (0,1) \| int >= 1 \| None | Test set size. Float for fraction, int for count. Default: 0.1 for `train_test_split`, None for `timeseries_split` (sklearn dynamic sizing). | +| freqai.data_split_parameters.n_splits | 5 | int >= 2 | Controls train/test proportions for `timeseries_split` (higher = larger train set). | +| freqai.data_split_parameters.gap | 0 | int >= 0 | Samples to exclude between train/test for `timeseries_split`. When 0, auto-calculated from `label_period_candles` to prevent look-ahead bias. | +| freqai.data_split_parameters.max_train_size | None | int >= 1 \| None | Maximum training set size for `timeseries_split`. When set, creates a sliding window instead of expanding train set. None = no limit. | | _Label smoothing_ | | | | | freqai.label_smoothing.method | `gaussian` | enum {`none`,`gaussian`,`kaiser`,`triang`,`smm`,`sma`,`savgol`,`gaussian_filter1d`} | Label smoothing method (`smm`=median, `sma`=mean, `savgol`=Savitzky–Golay). | | freqai.label_smoothing.window_candles | 5 | int >= 3 | Smoothing window length (candles). | diff --git a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py index 832ea2a..d4718dc 100644 --- a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py +++ b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py @@ -17,10 +17,12 @@ import skimage import sklearn from datasieve.pipeline import Pipeline from datasieve.transforms import SKLearnWrapper +from freqtrade.exceptions import DependencyException from freqtrade.freqai.base_models.BaseRegressionModel import BaseRegressionModel from freqtrade.freqai.data_kitchen import FreqaiDataKitchen from numpy.typing import NDArray from optuna.study.study import ObjectiveFuncType +from sklearn.model_selection import TimeSeriesSplit from sklearn.preprocessing import ( MaxAbsScaler, MinMaxScaler, @@ -94,7 +96,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel): https://github.com/sponsors/robcaulk """ - version = "3.11.1" + version = "3.11.2" _TEST_SIZE: Final[float] = 0.1 @@ -229,6 +231,15 @@ class QuickAdapterRegressorV3(BaseRegressionModel): OPTUNA_SPACE_FRACTION_DEFAULT: Final[float] = 0.4 OPTUNA_SEED_DEFAULT: Final[int] = 1 + _DATA_SPLIT_METHODS: Final[tuple[str, ...]] = ( + "train_test_split", + "timeseries_split", + ) + DATA_SPLIT_METHOD_DEFAULT: Final[str] = _DATA_SPLIT_METHODS[0] + TIMESERIES_N_SPLITS_DEFAULT: Final[int] = 5 + TIMESERIES_GAP_DEFAULT: Final[int] = 0 + TIMESERIES_MAX_TRAIN_SIZE_DEFAULT: Final[int | None] = None + @staticmethod @lru_cache(maxsize=None) def _extrema_selection_methods_set() -> set[ExtremaSelectionMethod]: @@ -304,6 +315,11 @@ class QuickAdapterRegressorV3(BaseRegressionModel): def _power_mean_metrics_set() -> set[str]: return set(QuickAdapterRegressorV3._POWER_MEAN_MAP.keys()) + @staticmethod + @lru_cache(maxsize=None) + def _data_split_methods_set() -> set[str]: + return set(QuickAdapterRegressorV3._DATA_SPLIT_METHODS) + @staticmethod def _get_selection_category(method: str) -> Optional[str]: for ( @@ -964,7 +980,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel): logger.info(f"Model Version: {self.version}") logger.info(f"Regressor: {self.regressor}") - logger.info("Optuna Hyperopt Configuration:") + logger.info("Optuna Hyperopt:") optuna_config = self._optuna_config logger.info(f" enabled: {optuna_config.get('enabled')}") if optuna_config.get("enabled"): @@ -1027,7 +1043,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel): label_pipeline = self.label_pipeline label_prediction = self.label_prediction for label_col in LABEL_COLUMNS: - logger.info(f"Label Configuration [{label_col}]:") + logger.info(f"Label [{label_col}]:") col_pipeline = get_label_column_config( label_col, label_pipeline["default"], label_pipeline["columns"] @@ -1116,7 +1132,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel): feature_range = self.ft_params.get( "range", QuickAdapterRegressorV3.RANGE_DEFAULT ) - logger.info("Feature Parameters Configuration:") + logger.info("Feature Parameters:") logger.info(f" scaler: {scaler}") logger.info( f" range: ({format_number(feature_range[0])}, {format_number(feature_range[1])})" @@ -1313,6 +1329,275 @@ class QuickAdapterRegressorV3(BaseRegressionModel): ] ) + def train( + self, unfiltered_df: pd.DataFrame, pair: str, dk: FreqaiDataKitchen, **kwargs + ) -> Any: + """ + Filter the training data and train a model to it. + + Supports two data split methods: + - 'train_test_split' (default): Delegates to BaseRegressionModel.train() + - 'timeseries_split': Chronological split with configurable gap. Uses the final + fold from sklearn's TimeSeriesSplit. + + :param unfiltered_df: Full dataframe for the current training period + :param pair: Trading pair being trained + :param dk: FreqaiDataKitchen object containing configuration + :return: Trained model + """ + method = self.data_split_parameters.get( + "method", QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT + ) + + if method not in QuickAdapterRegressorV3._data_split_methods_set(): + raise ValueError( + f"Invalid data_split_parameters.method value {method!r}: " + f"supported values are {', '.join(QuickAdapterRegressorV3._DATA_SPLIT_METHODS)}" + ) + + logger.info(f"Using data split method: {method}") + + if method == QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT: + return super().train(unfiltered_df, pair, dk, **kwargs) + + elif ( + method == QuickAdapterRegressorV3._DATA_SPLIT_METHODS[1] + ): # timeseries_split + logger.info( + f"-------------------- Starting training {pair} --------------------" + ) + + start_time = time.time() + + features_filtered, labels_filtered = dk.filter_features( + unfiltered_df, + dk.training_features_list, + dk.label_list, + training_filter=True, + ) + + start_date = unfiltered_df["date"].iloc[0].strftime("%Y-%m-%d") + end_date = unfiltered_df["date"].iloc[-1].strftime("%Y-%m-%d") + logger.info( + f"-------------------- Training on data from {start_date} to " + f"{end_date} --------------------" + ) + + dd = self._make_timeseries_split_datasets( + features_filtered, labels_filtered, dk + ) + + if ( + not self.freqai_info.get("fit_live_predictions_candles", 0) + or not self.live + ): + dk.fit_labels() + + dd = self._apply_pipelines(dd, dk, pair) + + logger.info( + f"Training model on {len(dd['train_features'].columns)} features" + ) + logger.info(f"Training model on {len(dd['train_features'])} data points") + + model = self.fit(dd, dk, **kwargs) + + end_time = time.time() + + logger.info( + f"-------------------- Done training {pair} " + f"({end_time - start_time:.2f} secs) --------------------" + ) + + return model + + def _apply_pipelines( + self, + dd: dict, + dk: FreqaiDataKitchen, + pair: str, + ) -> dict: + """ + Apply feature and label pipelines to train/test data. + + This helper reduces code duplication between train() methods that need + custom data splitting but share the same pipeline application logic. + + :param dd: data_dictionary with train/test features/labels/weights + :param dk: FreqaiDataKitchen instance + :param pair: Trading pair (for error messages) + :return: data_dictionary with transformed features/labels + """ + dk.feature_pipeline = self.define_data_pipeline(threads=dk.thread_count) + dk.label_pipeline = self.define_label_pipeline(threads=dk.thread_count) + + (dd["train_features"], dd["train_labels"], dd["train_weights"]) = ( + dk.feature_pipeline.fit_transform( + dd["train_features"], dd["train_labels"], dd["train_weights"] + ) + ) + + dd["train_labels"], _, _ = dk.label_pipeline.fit_transform(dd["train_labels"]) + + if ( + self.data_split_parameters.get( + "test_size", QuickAdapterRegressorV3._TEST_SIZE + ) + != 0 + ): + if dd["test_labels"].shape[0] == 0: + method = self.data_split_parameters.get( + "method", QuickAdapterRegressorV3.DATA_SPLIT_METHOD_DEFAULT + ) + if ( + method == QuickAdapterRegressorV3._DATA_SPLIT_METHODS[1] + ): # timeseries_split + n_splits = self.data_split_parameters.get( + "n_splits", QuickAdapterRegressorV3.TIMESERIES_N_SPLITS_DEFAULT + ) + gap = self.data_split_parameters.get( + "gap", QuickAdapterRegressorV3.TIMESERIES_GAP_DEFAULT + ) + max_train_size = self.data_split_parameters.get("max_train_size") + test_size = self.data_split_parameters.get("test_size") + error_msg = ( + f"{pair}: test set is empty after filtering. " + f"Possible causes: n_splits too high, gap too large, " + f"max_train_size too restrictive, or insufficient data. " + f"Current parameters: n_splits={n_splits}, gap={gap}, " + f"max_train_size={max_train_size}, test_size={test_size}. " + f"Try reducing n_splits/gap or increasing data period." + ) + else: + test_size = self.data_split_parameters.get( + "test_size", QuickAdapterRegressorV3._TEST_SIZE + ) + error_msg = ( + f"{pair}: test set is empty after filtering. " + f"Possible causes: overly strict SVM thresholds or insufficient data. " + f"Current test_size={test_size}. " + f"Try reducing test_size or relaxing SVM conditions." + ) + raise DependencyException(error_msg) + else: + (dd["test_features"], dd["test_labels"], dd["test_weights"]) = ( + dk.feature_pipeline.transform( + dd["test_features"], dd["test_labels"], dd["test_weights"] + ) + ) + dd["test_labels"], _, _ = dk.label_pipeline.transform(dd["test_labels"]) + + dk.data_dictionary = dd + + return dd + + def _make_timeseries_split_datasets( + self, + filtered_dataframe: pd.DataFrame, + labels: pd.DataFrame, + dk: FreqaiDataKitchen, + ) -> dict: + """ + Chronological train/test split using the final fold from sklearn's TimeSeriesSplit. + + n_splits controls train/test proportions (higher = larger train set). + gap excludes samples between train/test; when 0, auto-calculated from + label_period_candles. max_train_size enables sliding window mode. + + :param filtered_dataframe: Feature data to split + :param labels: Label data to split + :param dk: FreqaiDataKitchen instance for weight calculation and data building + :return: data_dictionary with train/test features/labels/weights + """ + n_splits = int( + self.data_split_parameters.get( + "n_splits", QuickAdapterRegressorV3.TIMESERIES_N_SPLITS_DEFAULT + ) + ) + gap = int( + self.data_split_parameters.get( + "gap", QuickAdapterRegressorV3.TIMESERIES_GAP_DEFAULT + ) + ) + max_train_size = self.data_split_parameters.get( + "max_train_size", QuickAdapterRegressorV3.TIMESERIES_MAX_TRAIN_SIZE_DEFAULT + ) + max_train_size = int(max_train_size) if max_train_size is not None else None + + if n_splits < 2: + raise ValueError( + f"Invalid data_split_parameters.n_splits value {n_splits!r}: must be >= 2" + ) + if gap < 0: + raise ValueError( + f"Invalid data_split_parameters.gap value {gap!r}: must be >= 0" + ) + if max_train_size is not None and max_train_size < 1: + raise ValueError( + f"Invalid data_split_parameters.max_train_size value {max_train_size!r}: " + f"must be >= 1 or None" + ) + + test_size = self.data_split_parameters.get("test_size", None) + if test_size is not None: + if isinstance(test_size, float) and 0 < test_size < 1: + test_size = int(len(filtered_dataframe) * test_size) + elif isinstance(test_size, int) and test_size >= 1: + pass + else: + raise ValueError( + f"Invalid data_split_parameters.test_size value {test_size!r}: " + f"must be float in (0, 1) as fraction, int >= 1 as count, or None" + ) + if test_size < 1: + raise ValueError( + f"Computed test_size ({test_size}) is too small. " + f"Increase test_size or provide more data." + ) + + if gap == 0: + label_period_candles = int( + self.ft_params.get("label_period_candles", self._label_defaults[0]) + ) + gap = label_period_candles + logger.info( + f"TimeSeriesSplit gap auto-calculated from label_period_candles: {gap}" + ) + + tscv = TimeSeriesSplit( + n_splits=n_splits, + gap=gap, + max_train_size=max_train_size, + test_size=test_size, + ) + train_idx: np.ndarray = np.array([]) + test_idx: np.ndarray = np.array([]) + for train_idx, test_idx in tscv.split(filtered_dataframe): + pass + + train_features = filtered_dataframe.iloc[train_idx] + test_features = filtered_dataframe.iloc[test_idx] + train_labels = labels.iloc[train_idx] + test_labels = labels.iloc[test_idx] + + feature_parameters = self.freqai_info.get("feature_parameters", {}) + if feature_parameters.get("weight_factor", 0) > 0: + total_weights = dk.set_weights_higher_recent(len(train_idx) + len(test_idx)) + train_weights = total_weights[: len(train_idx)] + test_weights = total_weights[len(train_idx) :] + else: + train_weights = np.ones(len(train_idx)) + test_weights = np.ones(len(test_idx)) + + return dk.build_data_dictionary( + train_features, + test_features, + train_labels, + test_labels, + train_weights, + test_weights, + ) + def fit( self, data_dictionary: dict[str, Any], dk: FreqaiDataKitchen, **kwargs ) -> Any: diff --git a/quickadapter/user_data/strategies/QuickAdapterV3.py b/quickadapter/user_data/strategies/QuickAdapterV3.py index 2c28d24..789bd9c 100644 --- a/quickadapter/user_data/strategies/QuickAdapterV3.py +++ b/quickadapter/user_data/strategies/QuickAdapterV3.py @@ -109,7 +109,7 @@ class QuickAdapterV3(IStrategy): _PLOT_EXTREMA_MIN_EPS: Final[float] = 0.01 def version(self) -> str: - return "3.11.1" + return "3.11.2" timeframe = "5m" timeframe_minutes = timeframe_to_minutes(timeframe) @@ -470,7 +470,7 @@ class QuickAdapterV3(IStrategy): label_weighting = self.label_weighting label_smoothing = self.label_smoothing for label_col in LABEL_COLUMNS: - logger.info(f"Label Configuration [{label_col}]:") + logger.info(f"Label [{label_col}]:") col_weighting = get_label_column_config( label_col, label_weighting["default"], label_weighting["columns"]