Hacker News

by scoresmokeon 10/11/23, 8:08 PMwith 2 comments

by maxrmkon 10/11/23, 9:15 PM

This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I've found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don't exist. And that human graders tend to rate these highly since they don't actually run the code.

Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models