- [Reward & Shaping](#reward--shaping)
- [Diagnostics & Validation](#diagnostics--validation)
- [Overrides](#overrides)
- - [Reward Parameter Cheat Sheet](#reward-parameter-cheat-sheet)
+ - [Reward Tunables Reference](#reward-tunables-reference)
- [Exit Attenuation Kernels](#exit-attenuation-kernels)
- [Transform Functions](#transform-functions)
- [Skipping Feature Analysis](#skipping-feature-analysis)
scalars (`profit_aim`, `risk_reward_ratio`, `action_masking`). Conflicts:
individual flags vs `--params` ⇒ `--params` wins.
-### Reward Parameter Cheat Sheet
+### Reward Tunables Reference
#### Core
##### PnL Target
-| Parameter | Default | Description |
-| ------------------- | ------- | ----------------------------- |
-| `profit_aim` | 0.03 | Profit target threshold |
-| `risk_reward_ratio` | 2.0 | Risk/reward multiplier |
-| `win_reward_factor` | 2.0 | Profit target bonus factor |
-| `pnl_factor_beta` | 0.5 | PnL amplification sensitivity |
+| Parameter | Default | Description |
+| ------------------------------- | ------- | ----------------------------- |
+| `profit_aim` | 0.03 | Profit target threshold |
+| `risk_reward_ratio` | 2.0 | Risk/reward multiplier |
+| `win_reward_factor` | 2.0 | Profit target bonus factor |
+| `pnl_amplification_sensitivity` | 0.5 | PnL amplification sensitivity |
**Note:** In ReforceXY, `risk_reward_ratio` maps to `rr`.
- If `pnl_target ≤ 0`: `pnl_target_coefficient = 1.0`
- If `pnl_ratio > 1.0`:
- `pnl_target_coefficient = 1.0 + win_reward_factor × tanh(pnl_factor_beta × (pnl_ratio − 1.0))`
+ `pnl_target_coefficient = 1.0 + win_reward_factor × tanh(pnl_amplification_sensitivity × (pnl_ratio − 1.0))`
- If `pnl_ratio < −(1.0 / risk_reward_ratio)`:
- `pnl_target_coefficient = 1.0 + (win_reward_factor × risk_reward_ratio) × tanh(pnl_factor_beta × (|pnl_ratio| − 1.0))`
+ `pnl_target_coefficient = 1.0 + (win_reward_factor × risk_reward_ratio) × tanh(pnl_amplification_sensitivity × (|pnl_ratio| − 1.0))`
- Else: `pnl_target_coefficient = 1.0`
##### Efficiency
`risk_reward_ratio`, `action_masking`.
**Reward tunables** (tunable via either direct flag or `--params`) correspond to
-those listed under Reward Parameter Cheat Sheet: Core, Duration Penalties, Exit
+those listed under Reward Tunables Reference: Core, Duration Penalties, Exit
Attenuation, Efficiency, Validation, PBRS, Hold/Entry/Exit Potential Transforms.
## Examples
"efficiency_center": 0.5,
# Profit factor defaults
"win_reward_factor": 2.0,
- "pnl_factor_beta": 0.5,
+ "pnl_amplification_sensitivity": 0.5,
# Invariant / safety defaults
"check_invariants": True,
"exit_factor_threshold": 1000.0,
"efficiency_weight": "Efficiency weight",
"efficiency_center": "Efficiency pivot in [0,1]",
"win_reward_factor": "Profit overshoot bonus factor",
- "pnl_factor_beta": "PnL amplification sensitivity",
+ "pnl_amplification_sensitivity": "PnL amplification sensitivity",
"check_invariants": "Enable runtime invariant checks",
"exit_factor_threshold": "Warn if |exit_factor| exceeds",
# PBRS parameters
"efficiency_weight": {"min": 0.0, "max": 2.0},
"efficiency_center": {"min": 0.0, "max": 1.0},
"win_reward_factor": {"min": 0.0},
- "pnl_factor_beta": {"min": 1e-6},
+ "pnl_amplification_sensitivity": {"min": 1e-6},
# PBRS parameter bounds
"potential_gamma": {"min": 0.0, "max": 1.0},
"exit_potential_decay": {"min": 0.0, "max": 1.0},
if pnl_target > 0.0:
win_reward_factor = _get_float_param(params, "win_reward_factor")
- pnl_factor_beta = _get_float_param(params, "pnl_factor_beta")
+ pnl_amplification_sensitivity = _get_float_param(params, "pnl_amplification_sensitivity")
rr = risk_reward_ratio if risk_reward_ratio > 0 else RISK_REWARD_RATIO_DEFAULT
pnl_ratio = pnl / pnl_target
if abs(pnl_ratio) > 1.0:
- base_pnl_target_coefficient = math.tanh(pnl_factor_beta * (abs(pnl_ratio) - 1.0))
+ base_pnl_target_coefficient = math.tanh(
+ pnl_amplification_sensitivity * (abs(pnl_ratio) - 1.0)
+ )
if pnl_ratio > 1.0:
pnl_target_coefficient = 1.0 + win_reward_factor * base_pnl_target_coefficient
elif pnl_ratio < -(1.0 / rr):
center_unrealized = 0.5 * (
context.max_unrealized_profit + context.min_unrealized_profit
)
- beta = _get_float_param(params, "pnl_factor_beta")
+ beta = _get_float_param(params, "pnl_amplification_sensitivity")
next_pnl = float(center_unrealized * math.tanh(beta * next_duration_ratio))
else:
next_pnl = current_pnl
**Setup:**
- PnL: 150% of pnl_target (exceeds target by 50%)
- pnl_target: 0.045 (profit_aim=0.03 * risk_reward_ratio=1.5)
- - Parameters: win_reward_factor=2.0, pnl_factor_beta=0.5
+ - Parameters: win_reward_factor=2.0, pnl_amplification_sensitivity=0.5
**Assertions:**
- Coefficient is finite
- Coefficient > 1.0 (rewards exceeding target)
"""
- params = self.base_params(win_reward_factor=2.0, pnl_factor_beta=0.5)
+ params = self.base_params(win_reward_factor=2.0, pnl_amplification_sensitivity=0.5)
profit_aim = 0.03
risk_reward_ratio = 1.5
pnl_target = profit_aim * risk_reward_ratio
- PnL: -0.06 (exceeds pnl_target magnitude)
- pnl_target: 0.045 (profit_aim=0.03 * risk_reward_ratio=1.5)
- Penalty threshold: pnl < -pnl_target = -0.045
- - Parameters: win_reward_factor=2.0, pnl_factor_beta=0.5
+ - Parameters: win_reward_factor=2.0, pnl_amplification_sensitivity=0.5
**Assertions:**
- Coefficient is finite
- Coefficient > 1.0 (amplifies loss penalty)
"""
- params = self.base_params(win_reward_factor=2.0, pnl_factor_beta=0.5)
+ params = self.base_params(win_reward_factor=2.0, pnl_amplification_sensitivity=0.5)
profit_aim = 0.03
risk_reward_ratio = 1.5
pnl_target = profit_aim * risk_reward_ratio # 0.045
pnl_target = profit_aim * risk_reward_ratio
params = self.base_params(
win_reward_factor=win_reward_factor,
- pnl_factor_beta=beta,
+ pnl_amplification_sensitivity=beta,
efficiency_weight=0.0,
exit_attenuation_mode="linear",
exit_plateau=False,
params = {
"hold_potential_enabled": True,
"unrealized_pnl": True,
- "pnl_factor_beta": 0.5,
+ "pnl_amplification_sensitivity": 0.5,
}
breakdown = calculate_reward_with_defaults(
context,
)
gamma = _get_float_param(
- params, "potential_gamma", DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95)
+ params,
+ "potential_gamma",
+ DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95),
)
expected_next_potential = (
prev_potential / gamma if gamma not in (0.0, None) else prev_potential
potential_gamma=0.9,
)
gamma = _get_float_param(
- params, "potential_gamma", DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95)
+ params,
+ "potential_gamma",
+ DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95),
)
rng = np.random.default_rng(555)
potentials = rng.uniform(0.05, 0.85, size=220)
exit_potential_mode="canonical",
)
gamma = _get_float_param(
- params, "potential_gamma", DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95)
+ params,
+ "potential_gamma",
+ DEFAULT_MODEL_REWARD_PARAMETERS.get("potential_gamma", 0.95),
)
rng = np.random.default_rng(321)
prev_potential = 0.0
DEFAULT_EXIT_LINEAR_SLOPE: Final[float] = 1.0
DEFAULT_EXIT_HALF_LIFE: Final[float] = 0.5
- DEFAULT_PNL_FACTOR_BETA: Final[float] = 0.5
+ DEFAULT_PNL_AMPLIFICATION_SENSITIVITY: Final[float] = 0.5
DEFAULT_WIN_REWARD_FACTOR: Final[float] = 2.0
DEFAULT_EFFICIENCY_WEIGHT: Final[float] = 1.0
DEFAULT_EFFICIENCY_CENTER: Final[float] = 0.5
pnl_target_coefficient = 1.0
if pnl_target > 0.0:
- pnl_factor_beta = float(
+ pnl_amplification_sensitivity = float(
model_reward_parameters.get(
- "pnl_factor_beta", ReforceXY.DEFAULT_PNL_FACTOR_BETA
+ "pnl_amplification_sensitivity",
+ ReforceXY.DEFAULT_PNL_AMPLIFICATION_SENSITIVITY,
)
)
pnl_ratio = pnl / pnl_target
if abs(pnl_ratio) > 1.0:
base_pnl_target_coefficient = math.tanh(
- pnl_factor_beta * (abs(pnl_ratio) - 1.0)
+ pnl_amplification_sensitivity * (abs(pnl_ratio) - 1.0)
)
win_reward_factor = float(
model_reward_parameters.get(