There are a lot of explanations elsewhere, here I’d like to share some example questions and potential answers in an interview setting.
Question 1: In the setting of LLM applications, how do you measure whether your new prompt is better than the old one?
Question 2: Why does this matter?
Here are some tips for readers’ reference:
Introducing new prompts often leads to varied outcomes across different scenarios. The conventional approach to evaluating a model’s success, typical in traditional machine learning, doesn’t directly align with the nature of generative models. Metrics like accuracy (or the ones we talked about last week) might not seamlessly apply, as determining correctness can be subjective and challenging to quantify.
At a broader level, there are two key focal points to consider:
- Curate an Evaluation Dataset Incrementally: Develop a dataset tailored to your specific tasks. This dataset will aid in evaluating prompt performance during its development and this data should be built out incrementally.
- Identify an Appropriate Metric or Framework for Evaluation: Select a suitable metric or framework to gauge performance. We covered this a little bit last week in Question 2:
This question as a whole is a complex topic and I’ll write a more detailed post about it next time!
Some key points about why this matters:
- LLMs makes a lot of mistakes.
- You can tweak the prompts that results in improvements in some cases, but not necessarily means improvements in general.
- Building trust with your users is important, ultimately you need to maintain the performance of their tasks.
You can find explanation by Josh Tobin from my original post here!
Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.