Right, the original LLaMA paper. LLaMA 2 and 3 are significantly more capable models trained on orders of magnitude more data, and those papers notably do not say where the data comes from. The LLaMA 3 paper helpfully mentions that "Much of the data we utilize is obtained from the web," so I guess that's better than nothing!
LLaMA 2: https://arxiv.org/pdf/2307.09288
LLaMA 3: https://arxiv.org/pdf/2407.21783