Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is a community LLM possible? We'd have code to dynamically construct the pre-training dataset and use P2P mechanisms to share the acquired dataset. It would involve peer-crawling and other mechanisms to allow many people to contribute chunks to the dataset. Crawling chunks would be dynamically allocated to those contributing to avoid any double-crawling.

For post-training, the dataset would be a bunch of code that orchestrates the creation of training data via LLMs (needs to be legally sound), plus some kind of mechanical turk approach (something like wikipedia, where volunteers can work on chunks of data).

The main mechanism is this: what is shared is not just code, but also the acquired training data.

Critical aspects: - to have a mechanism to peer-validate submissions to the data pool, so that everybody can donate data without the risk of vandalism - a mechanism where the weights go through distributed training stages; somehow devs should be able to get a "lock" on the weights, do a bit of post training on it, and then get it approved. The "lock" means that during this brief period (trainining run), other devs are informed so we don't get two set of branched weights. A mechanism auto-evals the weights and accepts them as the new, updated weights. Retroactive discarding of weighs (e.g. after revising evals) is possible by branching the weights (needs some kind of efficient deduplication to avoid many copies of the weights).

I think this is possible. Maybe not with RAM, GPU and power shortages though.

Main benefit: Trannsparent training set means you know what the model was trained for. This makes it less opaque and less trial-and-error to see what modality the model is good at. This helps harness builders but also any other users of the models. It also decentralizes power.



It is possible and already being worked on at [1] though I have no idea how well any of its working.

[1] https://bittensor.com/about


Very interesting, thanks for sharing

EDIT: It's completely different though. This is more of a commodities market/auction/inference broker mechanism it seems.


I mean, the financial incentives are structured a bit different, but it's basically what you're describing, no? It's got projects for data collection, inference, training, etc. It's just that the dollar value of the compute contributed to say training is determined by the value of the token rather than as a straight dollar value. But even that is similar to just renting compute directly via fiat currencies given that every major provider of compute fluctuates it's cost based on supply/demand. Consider vast.ai or hetzner in which the cost to rent an h100 is not determined by anything but an auction system in which providers set prices and consumers agree to them.

My understanding is that bittensor is just the same market making where providers choose whether to provide and consumers choose to consume, it's just that you don't set your own price as the price is determined externally via the value of the tau. Which...tbh, fiat currencies fluctuate in buying power as well, if not quite so drastically. Just because the GPU is "still" $1/hr doesn't mean it actually cost as much as it used to given that the underlying value of the dollar changes just as the tau or yen or marc or eth or xrp or whatever does.

And thinking about it more, it's actually really quite similar to mturk in that via mturk you can purchase humans that do surveys, ocr, reviews, UX, etc...Via bittensor you can buy data gathering, training, inference, etc.


We're talking about completely different things. I'm talking about creating an LLM in the open, with individual contributors contributing to training sets as well as portions of the training work itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: