Thinking LLMs: General Instruction Following with Thought Generation

by edon 10/20/24, 8:36 PMwith 1 comments
by edon 10/20/24, 8:36 PM

This paper comes from Meta and introduces Thought Preference Optimization (TPO), a post-training process that encourages small models to think, similar to o1.

The results are impressive - Llama 3 8b performs almost on par with GPT-4o across a wide range of tasks, not just logic and math.

Interestingly, the post-training process significantly improves model performance even without “thoughts” (the “direct baseline” case in the paper).