From 1f5bbdebccb36d7b16ced6f42e9442cbcd203689 Mon Sep 17 00:00:00 2001
From: =?utf8?q?J=C3=A9r=C3=B4me=20Benoit?= <jerome.benoit@piment-noir.org>
Date: Sat, 4 Oct 2025 13:12:31 +0200
Subject: [PATCH] docs(reforcexy): refine reward space analysis documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf8
Content-Transfer-Encoding: 8bit

Signed-off-by: JÃ©rÃ´me Benoit <jerome.benoit@piment-noir.org>
---
 ReforceXY/reward_space_analysis/README.md | 83 +++++++++++++++++------
 1 file changed, 62 insertions(+), 21 deletions(-)

diff --git a/ReforceXY/reward_space_analysis/README.md b/ReforceXY/reward_space_analysis/README.md
index cee0b83..3a20a26 100644
--- a/ReforceXY/reward_space_analysis/README.md
+++ b/ReforceXY/reward_space_analysis/README.md
@@ -9,6 +9,7 @@
 This tool helps you understand and validate how the ReforceXY reinforcement learning environment calculates rewards. It generates synthetic trading scenarios to analyze reward behavior across different market conditions.
 
 ### Key Features
+
 - â Generate thousands of trading scenarios instantly
 - â Analyze reward distribution and patterns
 - â Validate reward logic against expected behavior
@@ -24,11 +25,13 @@ This tool helps you understand and validate how the ReforceXY reinforcement lear
 ## ð¦ Prerequisites
 
 ### System Requirements
+
 - Python 3.8+
 - 4GB RAM minimum (8GB recommended for large analyses)
 - No GPU required
 
 ### Virtual environment setup
+
 Keep the tooling self-contained by creating a virtual environment directly inside `ReforceXY/reward_space_analysis` and installing packages against it:
 
 ```shell
@@ -49,13 +52,14 @@ python test_reward_alignment.py
 
 > Deactivate the environment with `deactivate` when you're done.
 
-Unless otherwise noted, the command examples below assume your current working directory is `ReforceXY/reward_space_analysis` (and the optional virtual environment is activated).
+Unless otherwise noted, the command examples below assume your current working directory is `ReforceXY/reward_space_analysis` (and the virtual environment is activated).
 
 ---
 
 ## ð¡ Common Use Cases
 
 ### 1. Validate Reward Logic
+
 **Goal:** Ensure rewards behave as expected in different scenarios
 
 ```shell
@@ -63,11 +67,13 @@ python reward_space_analysis.py --num_samples 20000 --output validation
 ```
 
 **Check in `statistical_analysis.md`:**
+
 - Long/Short exits should have positive average rewards
 - Invalid actions should have negative penalties
 - Idle periods should reduce rewards
 
 ### 2. Analyze Parameter Sensitivity
+
 **Goal:** See how reward parameters affect trading behavior
 
 ```shell
@@ -86,26 +92,29 @@ python reward_space_analysis.py \
 **Compare:** Reward distributions between runs in `statistical_analysis.md`
 
 ### 3. Debug Reward Issues
+
 **Goal:** Identify why your RL agent behaves unexpectedly
 
 ```shell
-# Generate detailed analysis (statistical validation is now default)
+# Generate detailed analysis
 python reward_space_analysis.py \
     --num_samples 50000 \
     --output debug_analysis
 ```
 
 **Look at:**
+
 - `statistical_analysis.md` - Comprehensive report with:
   - Feature importance and model diagnostics
   - Statistical significance of relationships
   - Hypothesis tests and confidence intervals
 
 ### 4. Compare Real vs Synthetic Data
+
 **Goal:** Validate synthetic analysis against real trading
 
 ```shell
-# First, collect real episodes (see Advanced section)
+# First, collect real episodes (see Advanced Usage section)
 # Then compare:
 python reward_space_analysis.py \
     --num_samples 100000 \
@@ -124,89 +133,107 @@ None - all parameters have sensible defaults.
 ### Core Simulation Parameters
 
 **`--num_samples`** (int, default: 20000)
+
 - Number of synthetic trading scenarios to generate
 - More samples = more accurate statistics but slower analysis
 - Recommended: 10,000 (quick test), 50,000 (standard), 100,000+ (detailed)
 
 **`--seed`** (int, default: 42)
+
 - Random seed for reproducibility
 - Use same seed to get identical results across runs
 
 **`--max_trade_duration`** (int, default: 128)
+
 - Maximum trade duration in candles (from environment config)
 - Should match your actual trading environment setting
 
 ### Reward Configuration
 
 **`--base_factor`** (float, default: 100.0)
+
 - Base reward scaling factor (from environment config)
 - Should match your environment's base_factor
 
 **`--profit_target`** (float, default: 0.03)
+
 - Target profit threshold as decimal (e.g., 0.03 = 3%)
 - Used for efficiency calculations and holding penalties
 
 **`--risk_reward_ratio`** (float, default: 1.0)
+
 - Risk/reward ratio multiplier
 - Affects profit target adjustment in reward calculations
 
 **`--holding_max_ratio`** (float, default: 2.5)
+
 - Multiple of max_trade_duration used for sampling trade/idle durations
 - Higher = more variety in duration scenarios
 
 ### Trading Environment
 
 **`--trading_mode`** (choice: spot|margin|futures, default: spot)
+
 - **spot**: Disables short selling
 - **margin**: Enables short positions
 - **futures**: Enables short positions
 
 **`--action_masking`** (choice: true|false|1|0|yes|no, default: true)
+
 - Enable/disable action masking simulation
 - Should match your environment configuration
 
 ### Output Configuration
 
 **`--output`** (path, default: reward_space_outputs)
+
 - Output directory for all generated files
 - Will be created if it doesn't exist
 
 **`--params`** (key=value pairs)
+
 - Override any reward parameter from DEFAULT_MODEL_REWARD_PARAMETERS
 - Format: `--params key1=value1 key2=value2`
 - Example: `--params win_reward_factor=3.0 idle_penalty_scale=2.0`
 
 **All tunable parameters (override with --params):**
 
-*Invalid action penalty:*
+_Invalid action penalty:_
+
 - `invalid_action` (default: -2.0) - Penalty for invalid actions
 
-*Idle penalty configuration:*
+_Idle penalty configuration:_
+
 - `idle_penalty_scale` (default: 1.0) - Scale of idle penalty
 - `idle_penalty_power` (default: 1.0) - Power applied to idle penalty scaling
 
-*Holding penalty configuration:*
+_Holding penalty configuration:_
+
 - `holding_duration_ratio_grace` (default: 1.0) - Grace ratio (â¤1) before holding penalty increases with duration ratio
 - `holding_penalty_scale` (default: 0.3) - Scale of holding penalty
 - `holding_penalty_power` (default: 1.0) - Power applied to holding penalty scaling
 
-*Exit factor configuration:*
+_Exit factor configuration:_
+
 - `exit_factor_mode` (default: piecewise) - Time attenuation mode for exit factor (legacy|sqrt|linear|power|piecewise|half_life)
 - `exit_linear_slope` (default: 1.0) - Slope for linear exit attenuation
 - `exit_piecewise_grace` (default: 1.0) - Grace region for piecewise exit attenuation
 - `exit_piecewise_slope` (default: 1.0) - Slope after grace for piecewise mode
 - `exit_power_tau` (default: 0.5) - Tau in (0,1] to derive alpha for power mode
-- `exit_half_life` (default: 0.5) - Half-life for exponential attenuation exit mode
+- `exit_half_life` (default: 0.5) - Half-life for exponential decay exit mode
+
+_Efficiency configuration:_
 
-*Efficiency configuration:*
 - `efficiency_weight` (default: 0.75) - Weight for efficiency factor in exit reward
 - `efficiency_center` (default: 0.75) - Center for efficiency factor sigmoid
 
-*Profit factor configuration:*
+_Profit factor configuration:_
+
 - `win_reward_factor` (default: 2.0) - Amplification for PnL above target
 - `pnl_factor_beta` (default: 0.5) - Sensitivity of amplification around target
 
 **`--real_episodes`** (path, optional)
+
 - Path to real episode rewards pickle file for distribution comparison
 - Enables distribution shift analysis (KL divergence, JS distance, Wasserstein distance)
 - Example: `../user_data/models/ReforceXY-PPO/sub_train_SYMBOL_DATE/episode_rewards.pkl`
@@ -248,6 +275,7 @@ The analysis generates the following output files:
 ### Main Report
 
 **`statistical_analysis.md`** - Comprehensive statistical analysis containing:
+
 - **Global Statistics** - Reward distributions and component activation rates
 - **Sample Representativity** - Coverage of critical market scenarios
 - **Component Analysis** - Relationships between rewards and conditions
@@ -257,11 +285,11 @@ The analysis generates the following output files:
 
 ### Data Exports
 
-**`reward_samples.csv`** - Raw synthetic samples for custom analysis
-
-**`feature_importance.csv`** - Feature importance rankings from random forest model
-
-**`partial_dependence_*.csv`** - Partial dependence data for key features
+| File                       | Description                                          |
+| -------------------------- | ---------------------------------------------------- |
+| `reward_samples.csv`       | Raw synthetic samples for custom analysis            |
+| `feature_importance.csv`   | Feature importance rankings from random forest model |
+| `partial_dependence_*.csv` | Partial dependence data for key features             |
 
 ---
 
@@ -286,6 +314,7 @@ python reward_space_analysis.py \
 ```
 
 ### Real Data Comparison
+
 For production validation, compare synthetic analysis with real trading episodes:
 
 1. **Enable logging** in your ReforceXY config
@@ -302,6 +331,7 @@ python reward_space_analysis.py \
 The report will include distribution shift metrics (KL divergence, JS distance, Wasserstein distance) showing how well synthetic samples represent real trading.
 
 ### Batch Analysis
+
 ```shell
 # Test multiple parameter combinations
 for factor in 1.5 2.0 2.5 3.0; do
@@ -317,11 +347,13 @@ done
 ## ð§ª Validation & Testing
 
 ### Run Regression Tests
+
 ```shell
 python test_reward_alignment.py
 ```
 
 **Expected output:**
+
 ```
 â ENUMS_MATCH: True
 â DEFAULT_PARAMS_MATCH: True
@@ -336,6 +368,7 @@ python test_stat_coherence.py
 ```
 
 ### When to Run Tests
+
 - After modifying reward logic
 - Before important analyses
 - When results seem unexpected
@@ -349,6 +382,7 @@ python test_stat_coherence.py
 **Symptom:** `ModuleNotFoundError` or import errors
 
 **Solution:**
+
 ```shell
 pip install pandas numpy scipy scikit-learn
 ```
@@ -358,6 +392,7 @@ pip install pandas numpy scipy scikit-learn
 **Symptom:** Script completes but no files in output directory
 
 **Solution:**
+
 - Check write permissions in output directory
 - Ensure sufficient disk space (min 100MB free)
 - Verify Python path is correct
@@ -367,6 +402,7 @@ pip install pandas numpy scipy scikit-learn
 **Symptom:** Rewards don't match expected behavior
 
 **Solution:**
+
 - Run `test_reward_alignment.py` to validate logic
 - Review parameter overrides with `--params`
 - Check trading mode settings (spot vs margin/futures)
@@ -377,6 +413,7 @@ pip install pandas numpy scipy scikit-learn
 **Symptom:** Analysis takes excessive time to complete
 
 **Solution:**
+
 - Reduce `--num_samples` (start with 10,000)
 - Use `--trading_mode spot` (fewer action combinations)
 - Close other memory-intensive applications
@@ -387,6 +424,7 @@ pip install pandas numpy scipy scikit-learn
 **Symptom:** `MemoryError` or system freeze
 
 **Solution:**
+
 - Reduce sample size to 10,000-20,000
 - Use 64-bit Python installation
 - Add more RAM or configure swap file
@@ -416,18 +454,21 @@ python test_stat_coherence.py
 ### Best Practices
 
 **For Beginners:**
+
 - Start with 10,000-20,000 samples for quick iteration
 - Use default parameters initially
 - Always run tests after modifying reward logic
 - Review `statistical_analysis.md` for insights
 
 **For Advanced Users:**
+
 - Use 50,000+ samples for statistical significance
 - Compare multiple parameter sets via batch analysis
 - Validate synthetic analysis against real trading data with `--real_episodes`
 - Export CSV files for custom statistical analysis
 
 **Performance Optimization:**
+
 - Use SSD storage for faster I/O
 - Parallelize parameter sweeps across multiple runs
 - Cache results for repeated analyses
@@ -437,10 +478,10 @@ python test_stat_coherence.py
 
 For detailed troubleshooting, see [Troubleshooting](#-troubleshooting) section.
 
-| Issue | Quick Solution |
-|-------|----------------|
-| Memory errors | Reduce `--num_samples` to 10,000-20,000 |
-| Slow execution | Use `--trading_mode spot` or reduce samples |
+| Issue              | Quick Solution                                                |
+| ------------------ | ------------------------------------------------------------- |
+| Memory errors      | Reduce `--num_samples` to 10,000-20,000                       |
+| Slow execution     | Use `--trading_mode spot` or reduce samples                   |
 | Unexpected rewards | Run `test_reward_alignment.py` and check `--params` overrides |
-| Import errors | Activate venv: `source .venv/bin/activate` |
-| No output files | Check write permissions and disk space |
+| Import errors      | Activate venv: `source .venv/bin/activate`                    |
+| No output files    | Check write permissions and disk space                        |
-- 
2.43.0