Conceptually, LLM-as-a-judge doesn't feel like it should work — it's like asking a student to grade their own homework. it's very unintuitive for me that it actually seems to work pretty well
If the self evaluation makes it better, then why not do the self evaluation as part of the normal RAG workflow?
Who's data are they training on? Are they storing and using all customer data?
Conceptually, LLM-as-a-judge doesn't feel like it should work — it's like asking a student to grade their own homework. it's very unintuitive for me that it actually seems to work pretty well