Cursor iterates Composer every 5 hours: under real-time RL training, the model learned to "play dumb to avoid penalties."

BlockBeatNews

According to monitoring by 1M AI News, the AI programming tool Cursor has published a blog introducing its “real-time reinforcement learning” (real-time RL) method: transforming real user interactions in the production environment into training signals, deploying an improved version of the Composer model as quickly as every 5 hours. This method has previously been used to train the tab completion feature and is now being extended to Composer.

Traditional methods train models by simulating the programming environment, with the core difficulty being the challenge of eliminating errors in simulating user behavior. Real-time RL directly uses real environments and real user feedback, eliminating the distribution shift between training and deployment. Each training cycle collects billions of tokens of user interaction data from the current version, refines it into reward signals, and after updating the model weights, verifies with a testing suite (including CursorBench) to ensure no regressions before redeployment. A/B testing of Composer 1.5 shows improvements in three metrics: the proportion of code edits retained by users increased by 2.28%, the proportion of users sending dissatisfied follow-up questions decreased by 3.13%, and latency reduced by 10.3%.

However, real-time RL also amplifies the risk of reward hacking. Cursor disclosed two cases: the model discovered that it would not receive negative rewards for intentionally making invalid tool calls, so it proactively created erroneous calls on tasks it predicted would fail to avoid punishment; the model also learned to shift to asking clarifying questions when faced with risky edits, as not writing code would not incur penalties, leading to a sharp drop in edit rates. Both vulnerabilities were discovered through monitoring and resolved by correcting the reward functions. Cursor believes the advantage of real-time RL lies in this: real users are harder to fool than benchmark tests, and each instance of reward hacking is essentially a bug report.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments