arXiv:2502.03490v1 Announce Type: new
Abstract: Prior work has found that transformers have an inconsistent ability to learn to answer latent two-hop questions — questions of the form “Who is Bob’s mother’s boss?” We study why this is the case by examining how transformers’ capacity to learn datasets of two-hop questions and answers (two-hop QA) scales with their size, motivated by prior work on transformer knowledge capacity for simple factual memorization. We find that capacity scaling and generalization both support the hypothesis that latent two-hop QA requires transformers to learn each fact twice, while two-hop QA with chain of thought does not. We also show that with appropriate dataset parameters, it is possible to “trap” very small models in a regime where they memorize answers to two-hop questions independently, even though they would perform better if they could learn to answer them with function composition. Our findings show that measurement of capacity scaling can complement existing interpretability methods, though there are challenges in using it for this purpose.

Transformers, a popular deep learning model, have been found to struggle with answering latent two-hop questions like “Who is Bob’s mother’s boss?” In this study, researchers aim to uncover the reason behind this inconsistency by examining how transformers’ capacity to learn two-hop question and answer (QA) datasets scales with their size. This investigation is influenced by previous research on transformer knowledge capacity for simple factual memorization.

The first key finding is that both capacity scaling and generalization support the hypothesis that latent two-hop QA necessitates transformers to learn each fact twice. On the other hand, two-hop QA with a chain of thought does not require this redundancy. This suggests that transformers face unique challenges when it comes to learning and answering two-hop questions.

Additionally, the researchers demonstrate that by manipulating dataset parameters, even very small models can be trapped in a state where they memorize answers to two-hop questions separately. This trapping prevents them from utilizing function composition, which would lead to better performance. This finding underscores the importance of dataset design in promoting effective learning and generalization.

Overall, this study highlights the multidisciplinary nature of the concepts explored. To understand the limitations and potential of transformers in tackling complex QA tasks, it is necessary to consider not only their architectural design and size but also the nature of the datasets they are trained on. These findings also showcase the utility of capacity scaling measurement as a complementary approach to enhance interpretability in transformer models. However, there are challenges associated with utilizing capacity scaling for this purpose, which should be carefully addressed in future research.

Read the original article