This paper comes from Meta and introduces Thought Preference Optimization (TPO), a post-training process that encourages small models to think, similar to o1.
The results are impressive - Llama 3 8b performs almost on par with GPT-4o across a wide range of tasks, not just logic and math.
Interestingly, the post-training process significantly improves model performance even without “thoughts” (the “direct baseline” case in the paper).
This paper comes from Meta and introduces Thought Preference Optimization (TPO), a post-training process that encourages small models to think, similar to o1.
The results are impressive - Llama 3 8b performs almost on par with GPT-4o across a wide range of tasks, not just logic and math.
Interestingly, the post-training process significantly improves model performance even without “thoughts” (the “direct baseline” case in the paper).