Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.
But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.
I've done a lot of exploratory work with Stable Diffusion LoRAs, and I actually do buy that there's some juice here, though it's almost certainly not nearly as good as other techniques can be. In particular, this technique will likely avoid the intruder dimension problem which plagues naive LoRA. SVD is expensive, but you only have to do it once at the beginning of training.
I haven't done much research lately, but when I was working on it, I was having substantial success training an adapter of the form U_k @ P @ A, where U_k was the top k left singular vectors of the underlying weight, and then P and A were your typical LoRA projection matrices.
The 13 parameters are kind of misleading here; the real juice is going to be in the P_i fixed random matrices. My suspicion is that they are overfitting to the benchmark, but they almost certainly are observing a real gain in model capacity that is largely due to avoiding the intruder dimension problem.
In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.
This got me thinking, and it might actually even be a comparable amount.
Let's estimate 12 years of schooling run at minimum $100,000 per student, at least in the US [1], and then add onto that number whatever else you may do after that, i.e. a bunch more money if paid (college) or "unpaid" (self-taught skills and improvements) education, and then the likely biggest portion for white-collar workers, yet hard-to-quantify, in experience and "value" professional work will equip one with.
Now divide the average SOTA LLM's training cost (or a guess, since these numbers aren't always published as far as I'm aware) by the number of users, or if you wanted to be more strict, the number of people it's proven to be useful for (what else would training be for), and it might not be so far off anymore?
Of course, whether it makes sense to divide and spread out the LLMs' costs across users in order to calculate an "average utility" is debatable.
Very interesting. We will see soon a rise of training assistants reading through our wearable sensors.
Sadly, it seems these foundation models are still not open to the public. I can't find any links within the research page or the paper to tinker a little bit...
It is also about trying to get the most of that hypothesis testing, defining success and failure the best you can.
I have encountered this "mediocre success" many times in AI solutions due to lack of problem definition. For instance, now with LLMs is very easy to write a prompt that gives you the output you want in 5 or 6 examples you have in mind. The problem is to build up your testing scenario from there, and gather as much data as possible until you make it representative of your use cases.
That is the only way to actually test your prompts, RAG strategies, and so on, instead of buying the last CoT-like prompt trend.
I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the position of your item lists along the context, paying extra attention at the beginning and the end of those list.
See the listwise approach at "Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting", https://arxiv.org/abs/2306.17563
I wouldn't be surprised to see it help, along with the "you'll get $200 if you answer this right" trick and a bunch of others :) They're definitely worth trying.
Academic silo with little to no real transfer to business. ML eventually enabled building better continuous and discrete models for inference, control, and prediction.
I would have though the other way around: marketing buzzword for a non-problem solved by engineers since Maxwell's times by variations of applying the concept of hysteresis.
At least my recollection of fuzzy logic are from around the late eighties / early nineties and always involved the example of a thermostat that can only turn fully on or fully off. :)
True, this is the first thing I’m reminded when I hear this term. Then I wonder what Fuzzy logic has anything to do with washing machines. Modern washing machines has everything routine programmed and are deterministic
Do they? Mine noticeably takes much longer to run on very dirty kitchen towels compared to clothes, even on the same setting and even though the clothing load is heavier. I had been assuming there's something to detect how soiled the water is.
Random fact of the day: (some?) Dishwashers detect soiled water by using a resetable fuse on the water pump and counting how many times the fuse trips; dirty water makes the pump work harder and trip the fuse.
I don't think that would work for clothes washers though, because there isn't the same kind of pumping water for recirculation. Somewhat related, my washing machine (upright, high efficiency) measures the load by spinning it with a calibrated input impulse and measuring the rotation of the basket, less rotation means more clothes means more water. It's helpful to load the clothes around the edge to get the right amount of water. But that doesn't illuminate how cycle duration is determined.
I have sometimes found myself into that situation. I have to be careful not to overthink where to put my "brain room" for that day, otherwise I carry this overhead burden that rumbles all day long, questioning if I should be putting that effort elsewhere.
Definitely, brains are fun. They can be your best ally and worst enemy.
I totally agree. I work on the private sector, coming from a research position too. I was also focused on the "interesting" side of the problem: the modeling, integrating domain knowledge into the analysis, drawing all sorts of plots... But there were other unavoidable and "uninteresting" needs for the research project, like building a data gathering system with its API and everything. This required my best software engineering abilities. Needless to say my best weren't precisely THE best, so as the project got bigger, the not-so temporary fixes increased, as well as poor design choices (if any). This finally led to a complete reestructure and almost fresh start.
I feel some of it could be avoided, so I learned the hard way that the whole modelling + software engineering process is a subtle craft. It is important to take care on the implications of your code and, specially, on how its done, since it may fall back onto you eventually. This reconciled me with the more technical stuff (my tools) to eventually put up a good work in a more satisfying way.
I believe the article is oriented to those who feel like the "nice guy" the author is describing, which is my case, and expects you to act on how you feel to his words. I feel myself very reflected on the profile seeking for high affiliation, low power, and high emotional control; however, I am not sure to what extent should it only focus to a work environment.
Over the time, I have found myself more willing to fight over my "slice" of power in the workplace, making my demands assertively and showing that I know my value without traces of regret or false humility. That is something I have had to work on a lot, it makes me uncomfortable; but I understand that I "deserve" some power. If I am doing things right I cam claim my state for influence and being heard. I think this aligns with your point on being responsible with yourself and not just being a bystander.
On the other hand, there are other spheres in which I do not expect such power beforehand, so I'm not usually ready to draw my weapons. This may be a social encounter with barely-known people, friends, or even family. There are power fights there too, but somehow they feel different. Maybe I'm not as convinced of how I deserved such power, or maybe I'm not willing to do the effort anymore and I just want to fulfill my affiliation urge.
But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
reply