Great point! I considering using fasttext as a baseline, however in practice fasttext really didn't work well at all with the small data set, much worse than the tfidf baseline. I think Fasttext's classification approach might not work well with such a small dataset. I'm not sure but I suspect its because it tries to learn embeddings - but there just isn't anywhere near enough data for that. I'd love an outside perspective on this.
Fair enough. That's a useful comparison to know about.
But I'm wondering how you get around that with the neural net. In the post, you said there are only a few hundred labeled examples, right? How can a neural net with hundreds of parameters set those parameters to anything reasonable, and not overfit, when there are about as many parameters as examples?
Great question and I share your intuition but I think its all properly regularizing your model. I guess for neural networks, Dropout works really darn well as a regularization strategy. I could have tried to see whether performance dropped significantly without dropout.
I've used both. fastText will openly tell you that it's basically Vowpal Wabbit but faster (hence the name), with a way to stick word embeddings in without pre-processing, and with fewer ways to shoot yourself in the foot.
If you're interested in Sequence to Sequence tasks (e.g. neural machine translation or abstractive summarization) with small data, check out our recent paper from Google Brain tackling this problem (disclaimer, I'm the first author): https://arxiv.org/abs/1611.02683
Curious how you guys got training data for this. Did someone have to go through and rate whether or not a sentence was quality or not? And how many training examples did you use? You say it was "difficult to develop a large set" but I'm curious how large that set actually was.
Edit: Also, do you think more data or a "better" or "more sophisticated" model would make the results better? I would guess more data would trump better model, but not sure.
I actually had this issue recently when trying to get training data for a project of mine as well [0], so I built an app [1] as a way to more easily classify documents.
Basically I have simpler interfaces and the ability for multiple people to quickly answer questions like this on a set of data. Easily exportable in the end as well. If you're interested in using that to get some more data on sentences, let me know. I'm really curious how much better the results get with more data, and this could help.
This is really great idea. Actually if there is something you can share along these lines, that would be amazing. I know Crowd Flower has a great "internal only" tool, which is kind of similar to what you are designing, but you have to pay for it. Actually I think there is a huge need for a generic tool along the lines of what you have started to build.
Haven't heard of CrowdFlower but yeah this is along those lines. Pretty similar. But I could definitely make something quick to fit this specifically. I've been looking for other uses along with what I'm doing and this fits exactly. Shoot me an email at the address listed on my profile and I can get going.
Comparison is wrong between tfidf on words and CNN char. You should use char ngrams along with LR and this will beat all your classifiers with high probability. This is because your CNN char does not have enough data to draw all the useful chat ngrams. Doing it as preprocessing and passing it to LR is in practice always better on small datasets. You can go one step forward and add layers and test an MLP on your char ngrams.
It's worth keeping in mind that learning from few examples is not such a big
deal. What is really hard to do (and a long-standing problem in machine
learning) is learning a model that generalises well to unseen data.
So the question is: does the OP really show good generalisation?
It's hard to see how one would even begin to test this, in the case of the OP.
The OP describes an experiment where a few hundred instances were drawn from a
set of 50K, and used both for training and testing (by holding out a few,
rather than cross-validating, if I got that right).
I guess one way to go about it is to use the trained model to label your
unseen data (the rest of the 50k) and then go through that model-labelled data
by hand, and try to figure out how well the model did.
We're talking here about natural language, however, where the domain is so
vast that even the full 50k instances are very few to learn well. That doesn't
have to do anything with the model being trained, deep or shallow. It has
everything to do with the fact that you can say the same thing in 100k
different ways, and still not exhaust all the ways to say that one thing. So
50k examples are either not enough examples of different ways to say the same
thing, or not enough examples of the different things you can say, or, most
probably, both.
It's also worth remembering that deep nets can overfit much worse than other
methods, exactly because they are so good at memorising training data. It's very hard
to figure out what a deep net is really learning, but it would not be at all
surprising to find out that your "powerful" model is just a very expensive
alternative to Ctrl + C.
There's something strange about the ROC curve here. It seems that the feature engineered and logistic regression methods can pick out some examples very easily (20% true positive rate at a very low false positive rate) but the CNN seems to not be able to make many predictions at a low false positive rate. It then catches up later. It's almost like it can't pick out the easy examples, but does just as good a job on the harder ones.
This is a great point, would be worth further investigation. And I agree with your general interpretation. It would be interesting to look further at where CNN is failing to detect bad ones and where the feature engineered one picks them up.
The link to LIME looks a bit out of place - LIME is an algorithm of explaining classifier decisions which is most useful for cases when you can't inspect weights and map them back to features. For TF*IDF + Logistic Regression there is no need to use LIME, one can just use weights and feature names directly. LIME is more helpful for all other models (there is a lot of caveats though), not for the basic tfidf + linear classifier model.
I can't seem to find where the sample size is mentioned. It mentions that Quid has 50,000 company descriptions, but is n=50,000 tiny in thr ML/DeepLearning world?
I do neuroscience research and where I am coming from I have maybe n=150to200 per class. And that is not generally regarded as a tiny sample.
The issue is those 50,000 descriptions aren't labeled good/bad. Someone had to pick a subset of them and label them, so my guess is they did this for maybe 100 or 200 descriptions.
n=50000 of tabular data is a good sample size, and results will likely have a low standard error assuming no systemic bias. (Although it's not "big" data)
n=50000 of text data is different, since there will be less repetition of contextual structures and words (particularly with proper nouns). The fact that the dataset only uses "hundreds" as mentioned in the original post is interesting.
50K is not that tiny, but I'd say in the hundreds is pretty small. I'd say hundreds of training examples is pretty normal in academia, but typically quite small in industry, particularly for problems that are actually quite abstract, such as this (sentiment is another similar problem).
> A downfall of CNNs for text is that unlike for images, the input sequences are varying sizes (i.e., varying size sentences), which means most text inputs must be “padded” with some number of 0’s, so that all inputs are the same size.
Actually Kim's model you're using doesn't require padding because it uses k-Max over time pooling.
Also kuddos for NOT updating your word embeddings during training! A lot of people are doing it, but IMHO it's a mistake most of the time.
Are you sure about the padding? On page 1746, bottom right it says "padded as necessary". And intuitively it makes sense that all your inputs need to be the same size for a CNN.
Since word embeddings were the starting point, I'm wondering what would the impact be if they'd stretched the vector sequences to the same length using linear or whatever interpolation as opposed to zero padding the sentences.
yeah that's what i was trying to say. Saying you can construct a deep graph using small data feels like "I'm gonna become a millionaire in ten years, only by operating this single lemonade stand"
I think deep learning can be seen as a class of machine learning techniques with more flexibility and which uses neural networks (usually with quite a few layers).
This post is a joke. Seriously, it amazes me that the entire industry seems fixated on a handful of techniques, just like they were with random forests 10 years ago, just like they were on SVMs ten years before that, just like they were base neural networks before that. There's a simpler way, nature almost requires it.
[1] https://github.com/facebookresearch/fastText