ExaRanker: Improving Information Retrieval Models with Natural Language Explanations

Artificial Intelligence research is making great strides in information retrieval (IR) problems, with models such as BERT and T5 showing improved performance through fine-tuning on millions of cases. However, the need for large amounts of fine-tuning data limits the performance of these models in real-world applications. A new study proposes a solution to this problem by using natural language explanations to train information retrieval models.

Information retrieval models like BERT and T5 have shown impressive performance in IR problems with large amounts of fine-tuning data. For example, in the BEIR benchmark, a monoT5 reranked model outperformed BM25 in 15 of 18 datasets after being fine-tuned on 400k positive query-passage pairs from MS MARCO. However, the performance of these models decreases significantly when the number of labeled examples is limited.

Challenges with Traditional Information Retrieval Models

Traditional information retrieval models use categorical labels (such as true/false) to fine-tune neural retrievers, which requires large numbers of training samples. This is because these labels need more context for the task that has to be learned, making it harder for the model to understand its subtleties. In the MS MARCO passage ranking benchmark, a BERT reranker fine-tuned using 10k query-relevant passage pairs only slightly outperformed BM25.

ExaRanker: A New Solution to Information Retrieval Problems

The ExaRanker algorithm proposed in this study provides a solution to the limitations of traditional information retrieval models. Instead of using categorical labels, ExaRanker employs natural language explanations as extra labels to train information retrieval models. The process starts by using an LLM model with in-context examples to provide explanations for query-passage-label triples. The explanations are then added to these training triples, and a sequence-to-sequence model is adjusted to produce the target label followed by the explanation.

The fine-tuned model is then used to calculate the relevance of a query-passage combination during the inference phase, based solely on the probability given to the label token. The study also demonstrates how few-shot LLMs like GPT-3.5 can be used to automatically add justifications to training examples, allowing IR experts to adapt their approach to additional datasets without manual annotation.

Results of the Study

The results of the study suggest that as the number of training instances increases, the usefulness of integrating explanations declines. The study also shows that performance is greater when the model is tuned to create a label before an explanation, which is logical and at odds with earlier findings in chain-of-thought studies. Finally, the study demonstrates that these explanations can be efficiently produced using large language models, opening the door for implementing the ExaRanker algorithm in various IR domains and activities.

I will keep an eye on this one.

Hudson