: Qwen2.5-VL-72B-Instruct is used as the judge model for calculating visual rewards during training [11]. 4. Experimental Results

: The model is tested on subsets ranging from 200k to 2.8 million samples.

) to ensure the generated code matches the visual intent [11].