Why not three models? One model does basic feature detections, like lines, shapes, etc. A second model that can take the first model's output as its input, and identify birds. A third model can take the first model's output as its input, and identify houses.
This is a lesson I've watched people, and companies learn for the past 7-8 years.
An end to end model will always outperform a sequence of models designed to target specific features. You truncate information when you render the data into output space (the model output vector) from feature space (much richer data inside the model), thats the primary reason why to do transfer learning all layers are frozen, the final layer is chopped off, and then the output of the internal layer is sent into the next model. Not the output itself.
Yes you can create a large tree of smaller models, but the performance cieling is still lower.
Please don't tell people to do this. Ive seen millions wasted on this.
When you train a vision model it will already develop a heirarchy of fundamental point, and line detectors in the first few layers. And they will be particularly well chosen for the domain. It happens automatically. No need to manually put them there.
I'm genuinely confused at how you made these assumptions about what I'm describing. Because the "more correct" design you contrast with the strawman you've concluded I'm describing is actually what I'm talking about, if perhaps imprecisely. A pretrained model like mobilenetV2, with its final layer removed, and custom models trained on bird and house images, which take this mobilenetv2[:-1] output as input. MobilenetV2 is 2ish megabytes at 224x224, and these final bird and house layers will be kilobytes. Having two multiple-megabyte models that are 95% identical is a giant waste of our embedded target's resources. It also means that a scheme that processed a single image with two full models (instead of one big, two small) would spend 95% of the second full model's processing time redundantly performing the same operations on the same data. Breaking up the models across two stages produces substantial savings of both processing time and flash storage, with a single big model as the "feature detection" first stage of both overall inferences, with small specialized models as a second stage.
Sorry to upset you. It was not clear from your description that this was the process you were referring to. Others will read what you wrote and likely misunderstand as I did. (Which was my concern because I've seen the "mixture of idiots" architecture attempted since 2015. Even now... Its a common misconception and an argument every ml practitioner has at one point or another with a higher up.)
As for your ammendment, it is good to reduce compute when you can, and reduce up front effort for model creation when you can. Reusing models may be valid, but even in your ammended process you will still end up not reaching the peak performance of a single end to end model trained on the right data. Composite models are simply worse, even when transfer learning is done correctly.
As for the compute cost, if you train an end to end model and then minify it to the same size as the sum of your composite models it will have identical inference cost, but higher peak accuracy.
You could even do that with the "Shared Backbone" architecture, as youve described where two tailnetworks share a head network. It has been attempted thoroughly in the Deep Reinforcement Learning subdomain I am most familiar, and result in unnecessary performance loss. So it's not generally done anymore.
Man, everyone at work is going to be really bummed when I tell them that some guy on the internet has invalidated our empirical evidence of acceptable accuracy and performance with assumptions and appeals to authority.
I did not say it would not work, nor that it couldnt be acceptable performance for a given task.
Just that its peak performance is lower than an end to end model, and that if youre going to encourage model kit-bashing be clear how you communicate it, so people dont make human centipede architectures and wonder why feces is what comes out the end.
I was a very polite enough "some guy on the internet". Thank you.
As someone not in ML but curious about the field this is really interesting. Intuitively indeed it would be natural to aim for some sort of inspectable composition of models.
Is there specific tooling to inspect intermediate layers or will they be unintelligible for humans?
The unending quest for "Explainability" has yielded some tools but has been utterly overrun and outpaced by newer more complicated architectures and unfathomably large models.
(Banks and insurance, finance etc really want explainability for auditing.)
The early layers in a vision model are sort of interpetable. They look like lines and dots and scratchy patterns being composited. You can see the exact same features in L1 and L2 biological neural networks in cats, monkeys, mice, etc. As you get deeper into the network the patterns become really abstract. For a human, the best you can do is render a pattern of inputs that maximizes a target internal neurons activation to see what it detects.
You can sort of see what they represent in vision. Dogs, fur, signs, face, happy, sad, etc, but once its a multimodal model and there is time and language involved it gets really difficult. And at that point you might as well just use the damn thing, or just ask it.
In finance, you cant tell what the fuck any of the feature detectors are. Its just very abstract.
As for tooling, a little bit of numpy and pytorch, dump some neurpn weights to a png, there you go. Download a small convnet pretrained network, amd i bet gpt4 can walk you through the process.
Is it feasible for someone with a SWE background with fair amount of industry years to transition into ML without a deep dive into a PhD and publications to show?
I am considering following the fastAI course or perhaps other MOOC courses but I am not sure if any of this would be reasonably taken seriously within the field?
It is reasonable. If you have time and are willing to put in the effort I can forcefeed you resources, and review code and such. I've raised a few ML babies. Mooc are probably the wrong way to go. Thats where i started and I got stuck for a while. You really need to be knee deep in code, and a notebook.
As for getting jobs I cant help you with that part. You'll have to do your own networking, etc.
gibsonmart1i3@gmail.com Shoot me an email if your serious lets schedule a call.
I asked a friend of mine @ google about what-next in ML the other day, and they recommended this post from a friend of theirs. I'm not sure I'd follow it end-to-end (like many things chatgpt it's an unknown 70-90% on target) but it's definitely identified some resources I didn't know about. https://www.linkedin.com/feed/update/urn:li:activity:7150542...
wegfawefgawefg - I bookmarked this and worked through it more carefully when I had time, I appreciated the learnings.