Show HN: Tonic Validate Metrics – an open-source RAG evaluation metrics package

by Ephil012on 10/25/23, 12:38 PMwith 17 comments
by d4rkp4tternon 10/26/23, 11:56 AM

Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:

https://ai.google.com/research/NaturalQuestions

But I do t see this dataset mentioned much in RAG discussions.

by elyaseon 10/25/23, 4:21 PM

How does it compare to https://github.com/explodinggradients/ragas

by Ephil012on 10/25/23, 12:45 PM

Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.

by rwojoon 10/25/23, 3:29 PM

This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?

by capybaraon 10/25/23, 5:42 PM

This is cool. What are your plans for supporting and building upon this going forward?

by HenryBemison 10/25/23, 6:45 PM

So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...

I wish you the best though!

by agautscon 10/25/23, 3:30 PM

if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.

by yukichion 10/30/23, 3:00 AM

very cool! looking forward to trying it