Basically, keep the repository and the knowledge within available only to humans who like to work with the tools and concept directly—without external assistance.
IANAL, but the short answer is you can't use US Copyright law to restrict how content is consumed. The limited protection given to authors by the law is to restrict what others can publish.
The law has been thoroughly litigated over web publishing and search engines so there is plenty of precedent to read up on if you want to understand why (short of a huge and super unlikely change to the laws) what you want can't (and shouldn't - the US Constitution created copyright in order to incentivize creators to publish their works instead of keeping them under lock and key) just search for things like [copyright and web search engines]:
If you want limitations that aren't implemented in copyright law then you'll need to only share your content to others privately and under a contract they've agreed to.
Way question is presented, falls under the category of "NP vs P"[2][3]
Historically, Sneakernet / Physical/limited access personal library / "not for distribution outside company" was the way.
Simplest way would be to have private / internal network with no outside internet access. This doesn't prevent sneakernet ports to machines with outside access. Nor does this prevent an LLM on usb stick from 'scanning' and/or unintentional 'picture uploads'.
How would one identify LLM scanning from non-LLM scanning (beyond 10,000,000 requests per second from single source)? Checking a sites robot.txt is on honor system. And similar related things where there is a specific way to idenify valid/invalid access, such as fail-to-ban, are a never ending battle of being updated/revised to remain current.
License or no license, sort of a different take on turning test of can an ai fool a human into believing ai is a human[0]. capture system[1] to verify not a bot example of this.
I want to create a bubble of a space that’s free from the direct influence of AI. I believe that choice should exist, just like the choice to not be indexed by search engine bots over the web.
>> want to create a bubble of a space that’s free from the direct influence of AI.
It seems unlikely you will achieve that goal online, since there is no way (online) to differentiate between human, and bot. (Its unclear what constitutes AI in your mind but clearly a "dumb" crawler can gather information then scanned by an LLM.)
Of course nothing limits you to creating this space online. Think about creating such a space offline (where identifying humans is easier.)
>> I believe that choice should exist,
Of course the choice already exists. There are no LLMs at my farmers market.
Equally online you can choose not to use search engines, llms, social media, copilot, github, or any other tech you choose not to use. Expanding that bubble beyond yourself may be harder.)
I'm not on social media. So are lots of other people. We tend not to socialise online.
>> just like the choice to not be indexed by search engine bots over the web.
I fear that choice does not exist. You can certainly -indicate- that choice via robots.txt, but uou don't really get to "enforce" that choice, much less do you have an expectation that search engines are universally respecting that choice.
Let me say this next bit with respect. I say it with kindness, not malice. It's easier on you mentally if you fight battles you can win. At this point trying to define, much less live, a life unaffected by AI is like a cyclist railing against the use of cars [1]. Of course you arrange your life without a car. Of course you can socialise with like minded folk. But carving out spaces where cars are formally banned is rare.
[1] yes, I'm aware there are places that are less car-dependant than the US. Yes in Amsterdam there are a lot of bikes. But bikes tend to share the road with cars - there are very few car-free zones.
robots.txt is actually a really usefulay to tell an attacker where to look for juicy content that doesn't want to be indexed, but following it entirely voluntary. It's easy to imagine a dark web search engine that only has that content.
If you want your stuff to exist in the same way, but for OpenAI training, just block GPTBot in your robots.txt
bit snarky, but if don't think about/use what don't want AI to scan; then no possibility of AI scaning/getting relevant info don't want AI to get/have access to.
Of course, in order to make sure not 'thinking about things AI would scan/get access to' have to think about things AI would scan/get access to.
The law has been thoroughly litigated over web publishing and search engines so there is plenty of precedent to read up on if you want to understand why (short of a huge and super unlikely change to the laws) what you want can't (and shouldn't - the US Constitution created copyright in order to incentivize creators to publish their works instead of keeping them under lock and key) just search for things like [copyright and web search engines]:
https://www.google.com/search?q=copyright+and+web+search+eng...
If you want limitations that aren't implemented in copyright law then you'll need to only share your content to others privately and under a contract they've agreed to.