From 1f5bbdebccb36d7b16ced6f42e9442cbcd203689 Mon Sep 17 00:00:00 2001 From: =?utf8?q?J=C3=A9r=C3=B4me=20Benoit?= Date: Sat, 4 Oct 2025 13:12:31 +0200 Subject: [PATCH] docs(reforcexy): refine reward space analysis documentation MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Signed-off-by: Jérôme Benoit --- ReforceXY/reward_space_analysis/README.md | 83 +++++++++++++++++------ 1 file changed, 62 insertions(+), 21 deletions(-) diff --git a/ReforceXY/reward_space_analysis/README.md b/ReforceXY/reward_space_analysis/README.md index cee0b83..3a20a26 100644 --- a/ReforceXY/reward_space_analysis/README.md +++ b/ReforceXY/reward_space_analysis/README.md @@ -9,6 +9,7 @@ This tool helps you understand and validate how the ReforceXY reinforcement learning environment calculates rewards. It generates synthetic trading scenarios to analyze reward behavior across different market conditions. ### Key Features + - ✅ Generate thousands of trading scenarios instantly - ✅ Analyze reward distribution and patterns - ✅ Validate reward logic against expected behavior @@ -24,11 +25,13 @@ This tool helps you understand and validate how the ReforceXY reinforcement lear ## 📦 Prerequisites ### System Requirements + - Python 3.8+ - 4GB RAM minimum (8GB recommended for large analyses) - No GPU required ### Virtual environment setup + Keep the tooling self-contained by creating a virtual environment directly inside `ReforceXY/reward_space_analysis` and installing packages against it: ```shell @@ -49,13 +52,14 @@ python test_reward_alignment.py > Deactivate the environment with `deactivate` when you're done. -Unless otherwise noted, the command examples below assume your current working directory is `ReforceXY/reward_space_analysis` (and the optional virtual environment is activated). +Unless otherwise noted, the command examples below assume your current working directory is `ReforceXY/reward_space_analysis` (and the virtual environment is activated). --- ## 💡 Common Use Cases ### 1. Validate Reward Logic + **Goal:** Ensure rewards behave as expected in different scenarios ```shell @@ -63,11 +67,13 @@ python reward_space_analysis.py --num_samples 20000 --output validation ``` **Check in `statistical_analysis.md`:** + - Long/Short exits should have positive average rewards - Invalid actions should have negative penalties - Idle periods should reduce rewards ### 2. Analyze Parameter Sensitivity + **Goal:** See how reward parameters affect trading behavior ```shell @@ -86,26 +92,29 @@ python reward_space_analysis.py \ **Compare:** Reward distributions between runs in `statistical_analysis.md` ### 3. Debug Reward Issues + **Goal:** Identify why your RL agent behaves unexpectedly ```shell -# Generate detailed analysis (statistical validation is now default) +# Generate detailed analysis python reward_space_analysis.py \ --num_samples 50000 \ --output debug_analysis ``` **Look at:** + - `statistical_analysis.md` - Comprehensive report with: - Feature importance and model diagnostics - Statistical significance of relationships - Hypothesis tests and confidence intervals ### 4. Compare Real vs Synthetic Data + **Goal:** Validate synthetic analysis against real trading ```shell -# First, collect real episodes (see Advanced section) +# First, collect real episodes (see Advanced Usage section) # Then compare: python reward_space_analysis.py \ --num_samples 100000 \ @@ -124,89 +133,107 @@ None - all parameters have sensible defaults. ### Core Simulation Parameters **`--num_samples`** (int, default: 20000) + - Number of synthetic trading scenarios to generate - More samples = more accurate statistics but slower analysis - Recommended: 10,000 (quick test), 50,000 (standard), 100,000+ (detailed) **`--seed`** (int, default: 42) + - Random seed for reproducibility - Use same seed to get identical results across runs **`--max_trade_duration`** (int, default: 128) + - Maximum trade duration in candles (from environment config) - Should match your actual trading environment setting ### Reward Configuration **`--base_factor`** (float, default: 100.0) + - Base reward scaling factor (from environment config) - Should match your environment's base_factor **`--profit_target`** (float, default: 0.03) + - Target profit threshold as decimal (e.g., 0.03 = 3%) - Used for efficiency calculations and holding penalties **`--risk_reward_ratio`** (float, default: 1.0) + - Risk/reward ratio multiplier - Affects profit target adjustment in reward calculations **`--holding_max_ratio`** (float, default: 2.5) + - Multiple of max_trade_duration used for sampling trade/idle durations - Higher = more variety in duration scenarios ### Trading Environment **`--trading_mode`** (choice: spot|margin|futures, default: spot) + - **spot**: Disables short selling - **margin**: Enables short positions - **futures**: Enables short positions **`--action_masking`** (choice: true|false|1|0|yes|no, default: true) + - Enable/disable action masking simulation - Should match your environment configuration ### Output Configuration **`--output`** (path, default: reward_space_outputs) + - Output directory for all generated files - Will be created if it doesn't exist **`--params`** (key=value pairs) + - Override any reward parameter from DEFAULT_MODEL_REWARD_PARAMETERS - Format: `--params key1=value1 key2=value2` - Example: `--params win_reward_factor=3.0 idle_penalty_scale=2.0` **All tunable parameters (override with --params):** -*Invalid action penalty:* +_Invalid action penalty:_ + - `invalid_action` (default: -2.0) - Penalty for invalid actions -*Idle penalty configuration:* +_Idle penalty configuration:_ + - `idle_penalty_scale` (default: 1.0) - Scale of idle penalty - `idle_penalty_power` (default: 1.0) - Power applied to idle penalty scaling -*Holding penalty configuration:* +_Holding penalty configuration:_ + - `holding_duration_ratio_grace` (default: 1.0) - Grace ratio (≤1) before holding penalty increases with duration ratio - `holding_penalty_scale` (default: 0.3) - Scale of holding penalty - `holding_penalty_power` (default: 1.0) - Power applied to holding penalty scaling -*Exit factor configuration:* +_Exit factor configuration:_ + - `exit_factor_mode` (default: piecewise) - Time attenuation mode for exit factor (legacy|sqrt|linear|power|piecewise|half_life) - `exit_linear_slope` (default: 1.0) - Slope for linear exit attenuation - `exit_piecewise_grace` (default: 1.0) - Grace region for piecewise exit attenuation - `exit_piecewise_slope` (default: 1.0) - Slope after grace for piecewise mode - `exit_power_tau` (default: 0.5) - Tau in (0,1] to derive alpha for power mode -- `exit_half_life` (default: 0.5) - Half-life for exponential attenuation exit mode +- `exit_half_life` (default: 0.5) - Half-life for exponential decay exit mode + +_Efficiency configuration:_ -*Efficiency configuration:* - `efficiency_weight` (default: 0.75) - Weight for efficiency factor in exit reward - `efficiency_center` (default: 0.75) - Center for efficiency factor sigmoid -*Profit factor configuration:* +_Profit factor configuration:_ + - `win_reward_factor` (default: 2.0) - Amplification for PnL above target - `pnl_factor_beta` (default: 0.5) - Sensitivity of amplification around target **`--real_episodes`** (path, optional) + - Path to real episode rewards pickle file for distribution comparison - Enables distribution shift analysis (KL divergence, JS distance, Wasserstein distance) - Example: `../user_data/models/ReforceXY-PPO/sub_train_SYMBOL_DATE/episode_rewards.pkl` @@ -248,6 +275,7 @@ The analysis generates the following output files: ### Main Report **`statistical_analysis.md`** - Comprehensive statistical analysis containing: + - **Global Statistics** - Reward distributions and component activation rates - **Sample Representativity** - Coverage of critical market scenarios - **Component Analysis** - Relationships between rewards and conditions @@ -257,11 +285,11 @@ The analysis generates the following output files: ### Data Exports -**`reward_samples.csv`** - Raw synthetic samples for custom analysis - -**`feature_importance.csv`** - Feature importance rankings from random forest model - -**`partial_dependence_*.csv`** - Partial dependence data for key features +| File | Description | +| -------------------------- | ---------------------------------------------------- | +| `reward_samples.csv` | Raw synthetic samples for custom analysis | +| `feature_importance.csv` | Feature importance rankings from random forest model | +| `partial_dependence_*.csv` | Partial dependence data for key features | --- @@ -286,6 +314,7 @@ python reward_space_analysis.py \ ``` ### Real Data Comparison + For production validation, compare synthetic analysis with real trading episodes: 1. **Enable logging** in your ReforceXY config @@ -302,6 +331,7 @@ python reward_space_analysis.py \ The report will include distribution shift metrics (KL divergence, JS distance, Wasserstein distance) showing how well synthetic samples represent real trading. ### Batch Analysis + ```shell # Test multiple parameter combinations for factor in 1.5 2.0 2.5 3.0; do @@ -317,11 +347,13 @@ done ## 🧪 Validation & Testing ### Run Regression Tests + ```shell python test_reward_alignment.py ``` **Expected output:** + ``` ✅ ENUMS_MATCH: True ✅ DEFAULT_PARAMS_MATCH: True @@ -336,6 +368,7 @@ python test_stat_coherence.py ``` ### When to Run Tests + - After modifying reward logic - Before important analyses - When results seem unexpected @@ -349,6 +382,7 @@ python test_stat_coherence.py **Symptom:** `ModuleNotFoundError` or import errors **Solution:** + ```shell pip install pandas numpy scipy scikit-learn ``` @@ -358,6 +392,7 @@ pip install pandas numpy scipy scikit-learn **Symptom:** Script completes but no files in output directory **Solution:** + - Check write permissions in output directory - Ensure sufficient disk space (min 100MB free) - Verify Python path is correct @@ -367,6 +402,7 @@ pip install pandas numpy scipy scikit-learn **Symptom:** Rewards don't match expected behavior **Solution:** + - Run `test_reward_alignment.py` to validate logic - Review parameter overrides with `--params` - Check trading mode settings (spot vs margin/futures) @@ -377,6 +413,7 @@ pip install pandas numpy scipy scikit-learn **Symptom:** Analysis takes excessive time to complete **Solution:** + - Reduce `--num_samples` (start with 10,000) - Use `--trading_mode spot` (fewer action combinations) - Close other memory-intensive applications @@ -387,6 +424,7 @@ pip install pandas numpy scipy scikit-learn **Symptom:** `MemoryError` or system freeze **Solution:** + - Reduce sample size to 10,000-20,000 - Use 64-bit Python installation - Add more RAM or configure swap file @@ -416,18 +454,21 @@ python test_stat_coherence.py ### Best Practices **For Beginners:** + - Start with 10,000-20,000 samples for quick iteration - Use default parameters initially - Always run tests after modifying reward logic - Review `statistical_analysis.md` for insights **For Advanced Users:** + - Use 50,000+ samples for statistical significance - Compare multiple parameter sets via batch analysis - Validate synthetic analysis against real trading data with `--real_episodes` - Export CSV files for custom statistical analysis **Performance Optimization:** + - Use SSD storage for faster I/O - Parallelize parameter sweeps across multiple runs - Cache results for repeated analyses @@ -437,10 +478,10 @@ python test_stat_coherence.py For detailed troubleshooting, see [Troubleshooting](#-troubleshooting) section. -| Issue | Quick Solution | -|-------|----------------| -| Memory errors | Reduce `--num_samples` to 10,000-20,000 | -| Slow execution | Use `--trading_mode spot` or reduce samples | +| Issue | Quick Solution | +| ------------------ | ------------------------------------------------------------- | +| Memory errors | Reduce `--num_samples` to 10,000-20,000 | +| Slow execution | Use `--trading_mode spot` or reduce samples | | Unexpected rewards | Run `test_reward_alignment.py` and check `--params` overrides | -| Import errors | Activate venv: `source .venv/bin/activate` | -| No output files | Check write permissions and disk space | +| Import errors | Activate venv: `source .venv/bin/activate` | +| No output files | Check write permissions and disk space | -- 2.43.0