more robitor's comments

robitor · on Jan 14, 2014

"It threatens the very basis of academic freedom and net neutrality"

So pretentious, did a teenager write this?

foxhill · on Jan 14, 2014

i don't think that's pretentious at all.

if my university implemented a scheme like this, i'd feel exactly the same way.

robitor · on Dec 6, 2013

How is something like this made? What's the math going on behind the scenes?

delluminatus · on Dec 6, 2013

These kinds of systems almost always use an n-gram model to generate text. It's a fascinating and surprisingly effective tool (the surprise is that implementing one is extremely simple).

Basically, it's based on collecting data from a corpus about probabilities of word transitions -- the probability of the "next word" given the n preceding words. So n-gram models often produce sentences that appear correct in small areas of a few words, but as a whole don't make much sense. The higher the "n", the more the generated sentences come to resemble the writing in the corpus. Once you get to quadrigrams, usually the model simply produces exact sentences verbatim from the text because the data becomes very sparse (how many times do you see "then I turned" in a corpus of tweets, for instance?), unless you have a very large corpus.

Like many NLP models, the n-gram model has seen many variations and tweaks that have differing levels of effectiveness, where often different tweaks produce more believable results for different corpora or variety thereof.

The math is quite simple. Let's take n=2, also called "bigrams", also called "Markov chains". In Pythonic pseudocode (because math notation doesn't work on HN) you could create a frequency distribution with:

  for i in range(1,words.length):

    model[words[i-1]][words[i]] += 1

assuming that "model" is a nested dictionary where keys have a default value of 0. then normalise:

  for prior,words in model:

    n = sum(words.values)

    words.values.map(lambda x: x / n)

robitor · on Dec 6, 2013

Thanks! I'm taking an information retrieval course right now and I'm interested in applying what I've learned to a cool pet project. I don't think we ever touched on n-gram models for some reason

saraid216 · on Dec 6, 2013

This isn't information retrieval. This is data processing. Information retrieval is a subset of data processing.

Retrieval specifically needs an algorithm to determine document relevance. Everything you're learning is to understand how different parts of that algorithm affect the results. It's a very difficult problem, even if you assume that the corpus isn't sapient.

Stuff like n-grams are more about reshuffling in order to expose patterns. It's a little bit like regressing some noisy data to see the trend of correlation.

delluminatus · on Dec 6, 2013

I learned about n-grams in a natural language processing class in uni.

A related (but more interesting, imho) concept is the Hidden Markov Model, which is used for things like part-of-speech tagging and areas of speech recognition. It takes a sequence of "observations", like sound vectors or words, and uses a probabilisitic model to match them with "hidden states", like phonemes or parts of speech.

I got a job with a part-of-speech tagging website I made as a pet project for that class :)

robitor · on Dec 3, 2013

so brave

robitor · on Nov 18, 2013

this is tight, i see a lot of potential in this product

robitor · on Nov 18, 2013

This isn't quantum physics, and that isn't the implication of the double-slit experiment. This is new-age pseudo science BS that uses the buzzword "quantum" to sound smart