Hallucination detector - solution to hallucinations problem in large language models
Hallucination Detector is a process on top of large language models with the goal to estimate the probability that the content generated by a large language model (like GPT-3) is factually correct.
It’s maybe not a perfect solution, it has some small limitations and perhaps it requires a lot of engineering to make it work.
We can use this process to solve the hallucinations problem - we can force an AI assistant to say "I don't know" where it doesn't know the answer. We can accomplish that in the following way. When we generate the content, after each chunk of the generated content (e.g. a sentence), we can use the Hallucination Detector process to estimate the probability that the content is incorrect, and if the probability is high, then we can instruct the model to generate the content again, but this time with expressing low confidence that the generated content is factually correct.
Limitations
1. It might limit creativity (this limitation probably can be solved, I will explain later how).
2. It comes with a slight additional cost (we need to generate 1 additional token per each chunk that we test).
Why do language models hallucinate?
Given that large language models predict the next word/token in a text, it’s completely expected that they hallucinate. Let's say, you have a text that contains some questions and then correct answers to those questions, and the next question is "What color is the sky?". Let's suppose that our model doesn't know the answer to this question (it wasn't in the training data), but it knows colors. So, it knows that the answer might be "green", "blue" or "red", but it doesn't know which one. So, in that case, it has a better chance of predicting the next word correctly if it gives the name of any color than just "I don't know". Because if the text previously contained only correct answers, then it's unlikely that it will contain the answer "I don't know". So, there is 33% chance that the next word will be green, 33% - blue, 33% - red, and 0% - "I don't know". So, the model generate a name of any color, instead of generating "I don't know". Because if it makes any guess it has higher probability of correctly predicting the next tokens in the text than if it admits that it doesn't know.
So, what we want to achieve is to know when the model has low confidence of its answer so that we can force the large language model to say "I don't know" where it doesn't know.
Process
Quick summary
The process is to simply take the content of which factual correctness we want to estimate, the context in which the content is located, put it into a prompt and ask the large language model if the content is factually correct (with a softmax temperature higher than 0), by generating one token (e.g. "yes" or "no") or more. However, what we pay attention to is not the generated answer, but the probabilities with which each token could be generated. If the large language model has a high probability of generating a "yes" token ("yes" or something semantically similar), then it means that the large language model has a high confidence that the answer is yes. If the generated token is 50% yes (or less) and 50% no (or more), then it means that the model doesn't have confidence on what the truth is. In that case, we know that the content might be incorrect.
Steps
Step 1 - Construct the prompt
The prompt can look more or less like this:
Entire text:
{{ text }}
Chunk of text:
{{ chunk }}
Is the chunk of text factually correct (yes/no):
Where:
text - the entire text,
chunk - the chunk of the entire text of which factual correctness we we want to estimate.
I haven't tested if the above prompt gives the expected results. We might need a different prompt.
It might also be the case that we need to do some fine-tuning, showing it a lot of examples, so that it learns to answer that prompt as expected.
However, it is certainly possible to design a prompt and/or fine-tune the model to give expected results because the results that we want to get is simply that the model gives one answer, when it thinks that the answer is factually correct and another answer, when it thinks that it's not factually correct. The model can't always tell, if the answer is factually correct (because it's not omnipotent), but in order for the process to work, we just need the model to output what is true according to its world model, not what actually is true. And that is certainly doable.
Step 2 - Generate the probabilities
Generate the probabilities for each token for occurring as the next token, with the above prompt.
Step 3 - Verdict
Given the probabilities, we can judge the confidence of the model that the generated chunk is factually correct. If the probability of "yes" tokens (tokens like "yes", "Yes" or semantically similar) is high, then it means that the model has high confidence that the chunk is factually correct which implies high probability that it is factually correct. If the probability of "yes" tokens is low, then it means that the model has low confidence which implies that we don't know if the chunk is factually correct.
If we use the process to solve the hallucinations problem in an AI assistant, we can regenerate the chunk again, only this time expressing low confidence that its answer is correct.
Limiting creativity problem
The above procedure might limit creativity of the AI assistant (since the above procedure might classify creative outputs as factually incorrect). In order to solve that problem, we might have another prompt that will decide if the given chunk (given the context / entire text) requires a creative or factually correct answer. If it requires a creative answer, then the chunk will be accepted without checking its factual correctness. Alternatively, we can change the prompt from the step 2 so that it answers "yes" not only when the answer is correct (according to the model), but also when the context doesn't require factually correct answer.