patrick-lewis's comments

patrick-lewis · on Aug 24, 2020

Hi, First author here.

Thanks for the comment and for adding examples, and for your nuanced comments of the answer overlap split.

My position is that these datasets are still useful for QA, but what was lacking was an analysis of how easy/hard the questions in them were, and what kind of modelling was needed to do well. These overlap phenomena are less like "bugs" maybe, but more like poorly understood features.

We need models that can accurately recall QA pairs they have seen before, so being able to score well on "memorizable" QA pairs is still important to do well, but we also want models that can do more than that. One single accuracy number on a leaderboard cannot capture all the behavioural information we need to properly understand the capabilities of these models.

patrick-lewis · on Aug 24, 2020

Hi, first author here.

> Given that nearest-neighbor outperforms on closed book, is it reasonable to suspect the model is doing NN itself internally (which would explain the good performance on close duplicates?)

I think this is definitely the case for the BART model. It is essentially acting as a QA-pair memorizer over the training data, and at test time, it just matches the question onto those seen at training time. Note that the T5-11B+SSM closed-book model was able to do a little better on NQ, so very large models with task-specific pretraining objectives do seem to do something slightly more interesting than just NN, but still really struggle in some settings.

> And if this is the case do you think training-time processing of data to attempt to move convert it to question/answer form data rather than raw QA would be a reasonable approach towards tackling this?

Great question! Converting sentences into a series of QA pairs is something we're really interested in. The T5-11B+SSM model we evaluate in the paper uses a special "Salient span masking" pretraining objective that does this to some extent (only mask words at pretraining time that are likely to be "answers" to factual questions), so in essence the pretraining task becomes pretty standard cloze-question answering, and they find that leads to better downstream results (https://arxiv.org/abs/2002.08910)

nl · on Aug 24, 2020

Thanks for the response.

> [BART] is essentially acting as a QA-pair memorizer over the training data, and at test time, it just matches the question onto those seen at training time.

Not super-surprising.

> The T5-11B+SSM model we evaluate in the paper uses a special "Salient span masking" pretraining objective that does this to some extent (only mask words at pretraining time that are likely to be "answers" to factual questions), so in essence the pretraining task becomes pretty standard cloze-question answering

This seems an obvious approach for cloze-type questions, but it seems non-obvious how to extent beyond this.

Are you aware of any work probing the differences in the representation using this style of masking vs a more normal language model objective? It would seem to me that this is the key to significant progress here (and of course one would speculate that a representation that works well for this would also work well for all kinds of KB-related tasks).

Thinking about this for a few minutes things like masking names, colors and numbers (the things that neural representations often confuse) and then asking questions based on them might be interesting. I wonder if bAbI could be extended for this?

patrick-lewis · on Aug 24, 2020

Hi, first author here. Yes, we evaluated 3 retrieval-based models, DPR, RAG and FID - check out the paper for the numbers (https://arxiv.org/pdf/2008.02637.pdf)