Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.


The way I see it, generally every new data point (on which the production model inference gets run once) becomes part of the data set which then gets used in training every next model, processing the same data point many more times in training, thus training unavoidably taking more effort than inference.

Perhaps I'm a bit biased towards all kinds of self-supervised or human-in-the-loop or semi-supervised models, but the notion of discarding large amounts of good domain-specific data that get processed only for inference and not used for training afterward feels a bit foreign to me, because you usually can extract an advantage from it. But perhaps that's the difference between data-starved domains and overwhelming-data domains?


What you say re saving all data is the ideal. I'd add a couple caveats, one is that in many fields you often get lots of redundant data that adds nothing to training (for example if an image classifier looking for some rare class you can be drowning in images of the majority class). Or you can just have lots of data that is unambiguously and correctly classified- some kind of active learning can tell you what is worth keeping.

The other is that for various reasons the customer doesnt want to share their data (or at least have sharing built into the inference system) so even if you'd like to have everything they record, it's just not available. Obviously something to discourage but it seems common


There's one piece of the puzzle you're missing: field-deployed devices.

If I play chess on my computer, the games I play locally won't hit the Stockfish models. When I use the feature on my phone that allows me to copy text from a picture, it won't phone home with all the frames.


Yup, exactly. It's a good point that for self-supervised workloads, the training set can become arbitrarily large. For a lot of other workloads in the vision space, most data needs to be labeled to be able to used for training.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: