From: Jérôme Benoit Date: Thu, 4 Jun 2026 22:29:58 +0000 (+0200) Subject: feat(label_weighting): adaptive k-NN bandwidth for gaussian off-pivot fill (#77) X-Git-Url: https://git.piment-noir.org/?a=commitdiff_plain;h=d54e9e5608f8b8e0bb6f1bc891b6f7eff45381f8;p=freqai-strategies.git feat(label_weighting): adaptive k-NN bandwidth for gaussian off-pivot fill (#77) * feat(label_weighting): adaptive k-NN bandwidth for gaussian off-pivot fill Address the crushing of weaker pivots by stronger neighbors when pivots fall within ~sigma_candles of each other in fill_method='gaussian'. The per-row max aggregator preserves the upper bound Out[i] <= max_p w_p but a wide constant sigma lets a strong neighbor's Gaussian dominate a weak pivot's tail. Add a k-nearest-neighbor bandwidth selector (Loftsgaarden & Quesenberry 1965; Silverman 1986, paragraph 5.2) that adapts each pivot's sigma to local pivot density: sigma_p = clip(alpha * d_k(p), sigma_min, sigma_max) where d_k(p) is the index distance to the k-th pivot neighbor. The upper bound on Out[i] is preserved (no over-amplification) and dense clusters automatically contract their Gaussians to stop overlapping. Implementation: - Pivots are emitted chronologically by zigzag, so the 1D k-NN reduces to a sliding k-window over sorted indices, O(M) without a spatial index. - _gaussian_fill_weights accepts a per-pivot sigma vector via NumPy broadcasting; the existing chunked exp/multiply/max kernel is unchanged. - Default fill_bandwidth='fixed' preserves byte-for-byte the previous algorithm. Tunables (added to DEFAULTS_LABEL_WEIGHTING, validated via _WEIGHTING_SPECS): - fill_bandwidth: 'fixed' | 'knn' (default 'fixed') - fill_bandwidth_neighbors: int >= 1 (default 1) - fill_bandwidth_alpha: float > 0 (default 1.0) - fill_sigma_min_candles: float >= 0.5 (default 0.5) README updated. * fix(label_weighting): correct gaussian kNN bandwidth * chore(quickadapter): bump strategy and regressor version 3.11.12 -> 3.11.13 --- diff --git a/README.md b/README.md index 5947dc3..8f898fa 100644 --- a/README.md +++ b/README.md @@ -82,7 +82,11 @@ docker compose up -d --build | freqai.label_weighting.fill_method | `zero` | enum {`zero`,`epsilon`,`gaussian`} | Off-pivot weighting scheme. `zero` hard-zeros off-pivot rows; `epsilon` applies a flat baseline `fill_epsilon * (pivot_weights)`; `gaussian` applies heatmap-style decay around each pivot. Switching away from `zero` may require retuning tree-leaf regularization (`min_child_weight`, `lambda`) and resetting any prior Optuna study. Changing this parameter requires deleting trained models. | | freqai.label_weighting.fill_epsilon | 0.001 | float [0,1] | Off-pivot fraction of the pivot baseline. Ignored when `fill_method != "epsilon"`. | | freqai.label_weighting.fill_epsilon_baseline | `mean` | enum {`mean`,`median`} | Pivot baseline statistic. `mean` tracks central tendency; `median` is robust against pivot-weight skew. Ignored when `fill_method != "epsilon"`. | -| freqai.label_weighting.fill_sigma_candles | 3.0 | float >= 0.5 | Gaussian standard deviation in candles for `fill_method == "gaussian"`. Lower bound 0.5 prevents underflow that silently degrades to `zero` mode. Ignored when `fill_method != "gaussian"`. | +| freqai.label_weighting.fill_sigma_candles | 3.0 | float >= 0.5 | Gaussian standard deviation in candles for `fill_method == "gaussian"`. Acts as the upper bound on per-pivot sigma when `fill_bandwidth == "knn"`. Lower bound 0.5 prevents severe underflow in the Gaussian tail. Ignored when `fill_method != "gaussian"`. | +| freqai.label_weighting.fill_sigma_min_candles | 0.5 | float >= 0.5 | Lower bound on per-pivot sigma in candles when `fill_bandwidth == "knn"`. Clipped to `fill_sigma_candles` when larger. Ignored when `fill_method != "gaussian"` or `fill_bandwidth != "knn"`. | +| freqai.label_weighting.fill_bandwidth | `fixed` | enum {`fixed`,`knn`} | Per-pivot Gaussian bandwidth selector. `fixed` applies a constant `fill_sigma_candles` to every pivot (legacy behavior). `knn` adapts each pivot's sigma to local pivot density via `sigma_p = clip(fill_bandwidth_alpha * d_k(p), fill_sigma_min_candles, fill_sigma_candles)` where `d_k(p)` is the index distance to the `k`-th nearest pivot neighbor (Loftsgaarden & Quesenberry 1965; Silverman 1986, §5.2). Mitigates the crushing of weaker pivots by stronger neighbors in dense clusters. Ignored when `fill_method != "gaussian"`. | +| freqai.label_weighting.fill_bandwidth_neighbors | 1 | int >= 1 | `k` for the k-nearest-neighbor bandwidth selector. Ignored when `fill_method != "gaussian"` or `fill_bandwidth != "knn"`. | +| freqai.label_weighting.fill_bandwidth_alpha | 1.0 | float > 0 | Multiplicative factor on the k-th neighbor distance. Smaller values produce sharper, more separated Gaussians; larger values approach the `fixed` behavior. Ignored when `fill_method != "gaussian"` or `fill_bandwidth != "knn"`. | | _Label pipeline_ | | | | | freqai.label_pipeline.standardization | `none` | enum {`none`,`zscore`,`robust`,`mmad`,`power_yj`} | Standardization method applied to labels before normalization. `none`=w, `zscore`=(w-μ)/σ, `robust`=(w-median)/(Q₃-Q₁), `mmad`=(w-median)/(MAD·k), `power_yj`=YJ(w). | | freqai.label_pipeline.robust_quantiles | [0.25, 0.75] | list[float] where 0 <= Q1 < Q3 <= 1 | Quantile range for robust standardization, Q1 and Q3. | diff --git a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py index 70a6424..3a55d2e 100644 --- a/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py +++ b/quickadapter/user_data/freqaimodels/QuickAdapterRegressorV3.py @@ -102,7 +102,7 @@ class QuickAdapterRegressorV3(BaseRegressionModel): https://github.com/sponsors/robcaulk """ - version = "3.11.12" + version = "3.11.13" _TEST_SIZE: Final[float] = 0.1 diff --git a/quickadapter/user_data/strategies/LabelTransformer.py b/quickadapter/user_data/strategies/LabelTransformer.py index 16fb833..73481d1 100644 --- a/quickadapter/user_data/strategies/LabelTransformer.py +++ b/quickadapter/user_data/strategies/LabelTransformer.py @@ -77,6 +77,12 @@ FILL_EPSILON_BASELINES: Final[tuple[FillEpsilonBaseline, ...]] = ( "median", # 1 - robust against pivot-weight skew ) +FillBandwidth = Literal["fixed", "knn"] +FILL_BANDWIDTHS: Final[tuple[FillBandwidth, ...]] = ( + "fixed", # 0 - constant sigma = fill_sigma_candles + "knn", # 1 - per-pivot sigma from k-nearest-neighbor index distance +) + StandardizationType = Literal["none", "zscore", "robust", "mmad", "power_yj"] STANDARDIZATION_TYPES: Final[tuple[StandardizationType, ...]] = ( "none", # 0 - w @@ -103,6 +109,10 @@ DEFAULTS_LABEL_WEIGHTING: Final[dict[str, Any]] = { "fill_epsilon": 1e-3, "fill_epsilon_baseline": FILL_EPSILON_BASELINES[0], # "mean" "fill_sigma_candles": 3.0, + "fill_sigma_min_candles": 0.5, + "fill_bandwidth": FILL_BANDWIDTHS[0], # "fixed" + "fill_bandwidth_neighbors": 1, + "fill_bandwidth_alpha": 1.0, } DEFAULTS_LABEL_PIPELINE: Final[dict[str, Any]] = { diff --git a/quickadapter/user_data/strategies/QuickAdapterV3.py b/quickadapter/user_data/strategies/QuickAdapterV3.py index c5a5e62..d2c7ee8 100644 --- a/quickadapter/user_data/strategies/QuickAdapterV3.py +++ b/quickadapter/user_data/strategies/QuickAdapterV3.py @@ -115,7 +115,7 @@ class QuickAdapterV3(IStrategy): _ANNOTATION_LINE_OFFSET_CANDLES: Final[int] = 10 def version(self) -> str: - return "3.11.12" + return "3.11.13" timeframe = "5m" timeframe_minutes = timeframe_to_minutes(timeframe) @@ -512,6 +512,16 @@ class QuickAdapterV3(IStrategy): logger.info( f" fill_sigma_candles: {format_number(col_weighting['fill_sigma_candles'])}" ) + logger.info( + f" fill_sigma_min_candles: {format_number(col_weighting['fill_sigma_min_candles'])}" + ) + logger.info(f" fill_bandwidth: {col_weighting['fill_bandwidth']}") + logger.info( + f" fill_bandwidth_neighbors: {col_weighting['fill_bandwidth_neighbors']}" + ) + logger.info( + f" fill_bandwidth_alpha: {format_number(col_weighting['fill_bandwidth_alpha'])}" + ) col_smoothing = get_label_column_config( label_col, label_smoothing["default"], label_smoothing["columns"] diff --git a/quickadapter/user_data/strategies/Utils.py b/quickadapter/user_data/strategies/Utils.py index aa28d7c..a489c90 100644 --- a/quickadapter/user_data/strategies/Utils.py +++ b/quickadapter/user_data/strategies/Utils.py @@ -33,6 +33,7 @@ from LabelTransformer import ( DEFAULTS_LABEL_SMOOTHING, DEFAULTS_LABEL_WEIGHTING, EXTREMA_SELECTION_METHODS, + FILL_BANDWIDTHS, FILL_EPSILON_BASELINES, FILL_METHODS, NORMALIZATION_TYPES, @@ -205,6 +206,16 @@ _WEIGHTING_SPECS: Final[dict[str, _ParamSpec]] = { "fill_sigma_candles": _ParamSpec( _NumericValidator(min_value=0.5), output_type=float ), + "fill_sigma_min_candles": _ParamSpec( + _NumericValidator(min_value=0.5), output_type=float + ), + "fill_bandwidth": _ParamSpec(_EnumValidator(FILL_BANDWIDTHS)), + "fill_bandwidth_neighbors": _ParamSpec( + _NumericValidator(min_value=1, require_int=True), output_type=int + ), + "fill_bandwidth_alpha": _ParamSpec( + _NumericValidator(min_value=0, min_exclusive=True), output_type=float + ), } _PIPELINE_SPECS: Final[dict[str, _ParamSpec]] = { @@ -1079,21 +1090,88 @@ _GAUSSIAN_FILL_CHUNK_BUDGET: Final[int] = 50_000_000 _GAUSSIAN_FILL_DENSITY_WARN: Final[float] = 0.1 +def _compute_pivot_sigmas( + pivot_indices: NDArray[np.floating], + sigma_candles: float, + bandwidth: str, + neighbors: int, + alpha: float, + sigma_min_candles: float, +) -> NDArray[np.floating]: + """Per-pivot Gaussian standard deviation in candles. + + For ``bandwidth == "fixed"`` returns a scalar broadcast (constant ``sigma_candles``). + For ``bandwidth == "knn"`` applies a k-nearest-neighbor bandwidth selector + (Loftsgaarden & Quesenberry 1965; Silverman 1986, §5.2): + + sigma_p = clip( alpha * d_k(p), sigma_min_candles, sigma_candles ) + + where ``d_k(p)`` is the index distance from pivot ``p`` to its ``k``-th + nearest pivot neighbor. Only the ``k`` candidates on either side can contain + the ``k``-th nearest neighbor on the 1D candle index. + """ + M = pivot_indices.size + if bandwidth == FILL_BANDWIDTHS[0] or M <= 1: # "fixed" or trivial + return np.full(M, float(sigma_candles), dtype=float) + if bandwidth != FILL_BANDWIDTHS[1]: # "knn" + raise ValueError( + f"Invalid fill_bandwidth value {bandwidth!r}: " + f"supported values are {', '.join(FILL_BANDWIDTHS)}" + ) + + sorted_idx = np.argsort(pivot_indices, kind="stable") + sorted_positions = pivot_indices[sorted_idx] + k = min(int(neighbors), M - 1) + + d_k_sorted = np.empty(M, dtype=float) + for i, position in enumerate(sorted_positions): + left = max(0, i - k) + right = min(M, i + k + 1) + candidate_distances = np.abs( + np.concatenate( + ( + sorted_positions[left:i] - position, + sorted_positions[i + 1 : right] - position, + ) + ) + ) + d_k_sorted[i] = np.partition(candidate_distances, k - 1)[k - 1] + d_k = np.empty(M, dtype=float) + d_k[sorted_idx] = d_k_sorted + + sigmas = float(alpha) * d_k + sigma_max = float(sigma_candles) + sigma_min = float(sigma_min_candles) + if sigma_min > sigma_max: + sigma_min = sigma_max + return np.clip(sigmas, sigma_min, sigma_max) + + def _gaussian_fill_weights( n_values: int, pivot_indices: NDArray[np.integer], pivot_weights: NDArray[np.floating], sigma_candles: float, *, + bandwidth: str = FILL_BANDWIDTHS[0], + bandwidth_neighbors: int = 1, + bandwidth_alpha: float = 1.0, + sigma_min_candles: float = 0.5, logger: Logger | None = None, ) -> NDArray[np.floating]: """Per-row max of Gaussian-decayed pivot weights. - Out[i] = max over p of ``w_p * exp(-(i - p)**2 / (2 * sigma**2))``. - With clustered pivots within ``~sigma_candles``, the per-row max - lets a stronger neighbor dominate weaker ones; pick - ``sigma_candles <= label_period_candles / 2`` to preserve pivot - identity. + Out[i] = max over p of ``w_p * exp(-(i - p)**2 / (2 * sigma_p**2))``. + + With ``bandwidth == "fixed"``, ``sigma_p == sigma_candles`` for every + pivot. Clustered pivots within ``~sigma_candles`` then let the strongest + neighbor dominate weaker ones in the per-row max ("crushing" effect): + pick ``sigma_candles <= label_period_candles / 2`` to mitigate. + + With ``bandwidth == "knn"``, ``sigma_p`` contracts to ``alpha * d_k(p)`` + (clipped to ``[sigma_min_candles, sigma_candles]``) so neighboring + Gaussians overlap less in dense regions, mitigating the crushing effect + while preserving the upper bound ``Out[i] <= max_p w_p``. """ if sigma_candles < 0.5: raise ValueError( @@ -1107,7 +1185,15 @@ def _gaussian_fill_weights( ) pivot_indices_array = pivot_indices.astype(float) pivot_weights_row = pivot_weights.astype(float)[np.newaxis, :] - inv_two_sigma_sq = 0.5 / (sigma_candles * sigma_candles) + pivot_sigmas = _compute_pivot_sigmas( + pivot_indices=pivot_indices_array, + sigma_candles=sigma_candles, + bandwidth=bandwidth, + neighbors=bandwidth_neighbors, + alpha=bandwidth_alpha, + sigma_min_candles=sigma_min_candles, + ) + inv_two_sigma_sq_row = (0.5 / (pivot_sigmas * pivot_sigmas))[np.newaxis, :] M = pivot_indices_array.size if ( logger is not None @@ -1125,11 +1211,15 @@ def _gaussian_fill_weights( chunk = max(1, _GAUSSIAN_FILL_CHUNK_BUDGET // max(M, 1)) if logger is not None and chunk < n_values: logger.debug( - "gaussian_fill: N=%d, M=%d, chunk=%d, ~%.0f MB peak buffer", + "gaussian_fill: N=%d, M=%d, chunk=%d, ~%.0f MB peak buffer, " + "bandwidth=%s, sigma=[%.2f, %.2f]", n_values, M, chunk, chunk * M * 8 / 1e6, + bandwidth, + float(pivot_sigmas.min()), + float(pivot_sigmas.max()), ) out = np.zeros(n_values, dtype=float) for start in range(0, n_values, chunk): @@ -1137,7 +1227,7 @@ def _gaussian_fill_weights( positions = np.arange(start, stop, dtype=float) buf = positions[:, np.newaxis] - pivot_indices_array[np.newaxis, :] np.multiply(buf, buf, out=buf) - np.multiply(buf, -inv_two_sigma_sq, out=buf) + np.multiply(buf, -inv_two_sigma_sq_row, out=buf) np.exp(buf, out=buf) np.multiply(buf, pivot_weights_row, out=buf) np.max(buf, axis=1, out=out[start:stop]) @@ -1362,6 +1452,10 @@ def compute_label_weights( pivot_indices=indices_array[valid_mask], pivot_weights=weights[valid_mask], sigma_candles=label_weighting["fill_sigma_candles"], + bandwidth=label_weighting["fill_bandwidth"], + bandwidth_neighbors=label_weighting["fill_bandwidth_neighbors"], + bandwidth_alpha=label_weighting["fill_bandwidth_alpha"], + sigma_min_candles=label_weighting["fill_sigma_min_candles"], logger=logger, ) else: