Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Quid uses deep learning with small data (quid.com)
113 points by Nimsical on Nov 18, 2016 | hide | past | favorite | 47 comments


The baseline I'd like to see this compared to is the not-very-deep-learning "bag of tricks" that's conveniently implemented in fastText [1].

[1] https://github.com/facebookresearch/fastText


Great point! I considering using fasttext as a baseline, however in practice fasttext really didn't work well at all with the small data set, much worse than the tfidf baseline. I think Fasttext's classification approach might not work well with such a small dataset. I'm not sure but I suspect its because it tries to learn embeddings - but there just isn't anywhere near enough data for that. I'd love an outside perspective on this.


Fair enough. That's a useful comparison to know about.

But I'm wondering how you get around that with the neural net. In the post, you said there are only a few hundred labeled examples, right? How can a neural net with hundreds of parameters set those parameters to anything reasonable, and not overfit, when there are about as many parameters as examples?


Great question and I share your intuition but I think its all properly regularizing your model. I guess for neural networks, Dropout works really darn well as a regularization strategy. I could have tried to see whether performance dropped significantly without dropout.


why not vowpal wabbit? fast text is practically a ad-hoc version of mathematically proven vowpal wabbit?


I've used both. fastText will openly tell you that it's basically Vowpal Wabbit but faster (hence the name), with a way to stick word embeddings in without pre-processing, and with fewer ways to shoot yourself in the foot.


If you're interested in Sequence to Sequence tasks (e.g. neural machine translation or abstractive summarization) with small data, check out our recent paper from Google Brain tackling this problem (disclaimer, I'm the first author): https://arxiv.org/abs/1611.02683


Thanks! I'll check it out. I have also been reading about abstractive summarization - hard problem it seems!


Curious how you guys got training data for this. Did someone have to go through and rate whether or not a sentence was quality or not? And how many training examples did you use? You say it was "difficult to develop a large set" but I'm curious how large that set actually was.

Edit: Also, do you think more data or a "better" or "more sophisticated" model would make the results better? I would guess more data would trump better model, but not sure.


Thanks for the comment! There were a few hundred sentences of each, collected internally from from a wide number of descriptions.

Yes, I'd definitely agree- more data is what we need here for further model improvements.


I actually had this issue recently when trying to get training data for a project of mine as well [0], so I built an app [1] as a way to more easily classify documents.

Basically I have simpler interfaces and the ability for multiple people to quickly answer questions like this on a set of data. Easily exportable in the end as well. If you're interested in using that to get some more data on sentences, let me know. I'm really curious how much better the results get with more data, and this could help.

[0] https://bigishdata.com/2016/11/01/classifying-country-music-...

[1] https://fierce-mountain-21498.herokuapp.com/


This is really great idea. Actually if there is something you can share along these lines, that would be amazing. I know Crowd Flower has a great "internal only" tool, which is kind of similar to what you are designing, but you have to pay for it. Actually I think there is a huge need for a generic tool along the lines of what you have started to build.


Haven't heard of CrowdFlower but yeah this is along those lines. Pretty similar. But I could definitely make something quick to fit this specifically. I've been looking for other uses along with what I'm doing and this fits exactly. Shoot me an email at the address listed on my profile and I can get going.


Comparison is wrong between tfidf on words and CNN char. You should use char ngrams along with LR and this will beat all your classifiers with high probability. This is because your CNN char does not have enough data to draw all the useful chat ngrams. Doing it as preprocessing and passing it to LR is in practice always better on small datasets. You can go one step forward and add layers and test an MLP on your char ngrams.


Thanks, would you mind expanding? I also played around with some char CNNs. They had similar performance.


Could you elaborate. Is LR, linear regression?


It's worth keeping in mind that learning from few examples is not such a big deal. What is really hard to do (and a long-standing problem in machine learning) is learning a model that generalises well to unseen data.

So the question is: does the OP really show good generalisation?

It's hard to see how one would even begin to test this, in the case of the OP. The OP describes an experiment where a few hundred instances were drawn from a set of 50K, and used both for training and testing (by holding out a few, rather than cross-validating, if I got that right).

I guess one way to go about it is to use the trained model to label your unseen data (the rest of the 50k) and then go through that model-labelled data by hand, and try to figure out how well the model did.

We're talking here about natural language, however, where the domain is so vast that even the full 50k instances are very few to learn well. That doesn't have to do anything with the model being trained, deep or shallow. It has everything to do with the fact that you can say the same thing in 100k different ways, and still not exhaust all the ways to say that one thing. So 50k examples are either not enough examples of different ways to say the same thing, or not enough examples of the different things you can say, or, most probably, both.

It's also worth remembering that deep nets can overfit much worse than other methods, exactly because they are so good at memorising training data. It's very hard to figure out what a deep net is really learning, but it would not be at all surprising to find out that your "powerful" model is just a very expensive alternative to Ctrl + C.

It's just memorised your examples, see?


There's something strange about the ROC curve here. It seems that the feature engineered and logistic regression methods can pick out some examples very easily (20% true positive rate at a very low false positive rate) but the CNN seems to not be able to make many predictions at a low false positive rate. It then catches up later. It's almost like it can't pick out the easy examples, but does just as good a job on the harder ones.


This is a great point, would be worth further investigation. And I agree with your general interpretation. It would be interesting to look further at where CNN is failing to detect bad ones and where the feature engineered one picks them up.


The link to LIME looks a bit out of place - LIME is an algorithm of explaining classifier decisions which is most useful for cases when you can't inspect weights and map them back to features. For TF*IDF + Logistic Regression there is no need to use LIME, one can just use weights and feature names directly. LIME is more helpful for all other models (there is a lot of caveats though), not for the basic tfidf + linear classifier model.


This is actually a great point. Thanks for sharing. I should maybe considering removing LIME in that context or changing the wording.


I can't seem to find where the sample size is mentioned. It mentions that Quid has 50,000 company descriptions, but is n=50,000 tiny in thr ML/DeepLearning world?

I do neuroscience research and where I am coming from I have maybe n=150to200 per class. And that is not generally regarded as a tiny sample.


The issue is those 50,000 descriptions aren't labeled good/bad. Someone had to pick a subset of them and label them, so my guess is they did this for maybe 100 or 200 descriptions.


This is correct. We had 300-400 examples of each


n=50000 of tabular data is a good sample size, and results will likely have a low standard error assuming no systemic bias. (Although it's not "big" data)

n=50000 of text data is different, since there will be less repetition of contextual structures and words (particularly with proper nouns). The fact that the dataset only uses "hundreds" as mentioned in the original post is interesting.


50K is not that tiny, but I'd say in the hundreds is pretty small. I'd say hundreds of training examples is pretty normal in academia, but typically quite small in industry, particularly for problems that are actually quite abstract, such as this (sentiment is another similar problem).


> A downfall of CNNs for text is that unlike for images, the input sequences are varying sizes (i.e., varying size sentences), which means most text inputs must be “padded” with some number of 0’s, so that all inputs are the same size.

Actually Kim's model you're using doesn't require padding because it uses k-Max over time pooling.

Also kuddos for NOT updating your word embeddings during training! A lot of people are doing it, but IMHO it's a mistake most of the time.


Are you sure about the padding? On page 1746, bottom right it says "padded as necessary". And intuitively it makes sense that all your inputs need to be the same size for a CNN.


about that detecting generic text that conveys little information

Can I have that for my email? (Seriously) And as browser plugin? Oh and on telephone, TV, radio and in real-life would be also nice.

It's probably also a nice predictor of startup success, developer quality and sales guy effectiveness.

I just wonder if I would ever read or hear a Politician again.

Very inspiring...


haha, absolutely. It takes a lot of intelligence to detect non-informativeness. you might enjoy: http://journal.sjdm.org/15/15923a/jdm15923a.pdf


What's the difference between softmax with categorical loss ([0, 1]) and sigmoid binary loss? ([0/1])


Since word embeddings were the starting point, I'm wondering what would the impact be if they'd stretched the vector sequences to the same length using linear or whatever interpolation as opposed to zero padding the sentences.


Could you elaborate? I'm not sure if I follow


Why do you not have a dev dataset? Gridsearch over your test dataset is bound to overfit


Wait, isn't "deep learning with small data" just machine learning, after all the buzzwords cancel themselves out?

I thought the whole point of "deep learning" is its approach to using data.


The point of deep learning is using a deep graph, like a neural network with a lot of layers, not the amount of data.

However, picking millions of parameters with small amounts of data is unlikely to work well.


It seems, at first blush, like using a very complex model to fit a very small amount of data is a recipe for some serious overfitting.


I would agree. Its kind of amazing that it indeed it does seem to work better than simpler models.


yeah that's what i was trying to say. Saying you can construct a deep graph using small data feels like "I'm gonna become a millionaire in ten years, only by operating this single lemonade stand"


Is all machine learning called "deep learning" now? Where is the line between "normal" machine learning algorithms and deep learning?


I think deep learning can be seen as a class of machine learning techniques with more flexibility and which uses neural networks (usually with quite a few layers).


I used to know some Quid people back in the day, good to see them here (and that they're international now, congrats to them).


really nice article -- easy to follow, sensible steps, clean code. I haven't done too much text ML and this was a nice piece to follow - thanks!


I appreciate this!


What about FastText?


This post is a joke. Seriously, it amazes me that the entire industry seems fixated on a handful of techniques, just like they were with random forests 10 years ago, just like they were on SVMs ten years before that, just like they were base neural networks before that. There's a simpler way, nature almost requires it.


> There's a simpler way, nature almost requires it

This is a normative statement, do you have empirical evidence?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: