DeepSeekMath Meets Order Book: Group-Aware Policy Optimization for High-Frequency Directional Trading

2026-05-25Machine Learning

Machine LearningComputational Engineering, Finance, and Science
AI summary

The authors explore using reinforcement learning (RL) to improve high-frequency trading decisions by analyzing order flow data from stock markets. Rather than using traditional value-based RL methods like Q-learning, they use policy gradient techniques such as PPO and specially designed versions (GRPO, GSPO) that focus on group normalization and managing losses. They tested these methods on stocks like AMZN, AAPL, and GOOG and found their approach led to better profits and less risk compared to Q-learning. Their work suggests that using order flow information works well as input, and group-aware policy methods outperform value-based ones in this context.

Reinforcement LearningHigh-Frequency TradingLimit Order BookOrder FlowPolicy GradientPPOQ-learningBacktestingPnLDrawdown
Authors
Sayak Charabarty, Souradip Pal
Abstract
This paper studies reinforcement learning for high-frequency trading on limit order books by pairing an Order-Flow-based state model with policy-gradient methods. Instead of value-based RL techniques like tabular Q-learning, our approach deploys policy-based methods like vanilla PPO and DeepSeekMath-inspired variants like GRPO and GSPO, that use group-normalized updates and downside-aware shaping. On backtests with financial assets AMZN, AAPL, and GOOG under a simplified backtesting setup based on spread-scaled rewards, these new policies improve net average PnL, profitability, and drawdown over the Q-Learning baseline. Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.