These kinds of systems almost always use an n-gram model to generate text. It's a fascinating and surprisingly effective tool (the surprise is that implementing one is extremely simple).
Basically, it's based on collecting data from a corpus about probabilities of word transitions -- the probability of the "next word" given the n preceding words. So n-gram models often produce sentences that appear correct in small areas of a few words, but as a whole don't make much sense. The higher the "n", the more the generated sentences come to resemble the writing in the corpus. Once you get to quadrigrams, usually the model simply produces exact sentences verbatim from the text because the data becomes very sparse (how many times do you see "then I turned" in a corpus of tweets, for instance?), unless you have a very large corpus.
Like many NLP models, the n-gram model has seen many variations and tweaks that have differing levels of effectiveness, where often different tweaks produce more believable results for different corpora or variety thereof.
The math is quite simple. Let's take n=2, also called "bigrams", also called "Markov chains". In Pythonic pseudocode (because math notation doesn't work on HN) you could create a frequency distribution with:
for i in range(1,words.length):
model[words[i-1]][words[i]] += 1
assuming that "model" is a nested dictionary where keys have a default value of 0. then normalise:
for prior,words in model:
n = sum(words.values)
words.values.map(lambda x: x / n)
Thanks! I'm taking an information retrieval course right now and I'm interested in applying what I've learned to a cool pet project. I don't think we ever touched on n-gram models for some reason
This isn't information retrieval. This is data processing. Information retrieval is a subset of data processing.
Retrieval specifically needs an algorithm to determine document relevance. Everything you're learning is to understand how different parts of that algorithm affect the results. It's a very difficult problem, even if you assume that the corpus isn't sapient.
Stuff like n-grams are more about reshuffling in order to expose patterns. It's a little bit like regressing some noisy data to see the trend of correlation.
I learned about n-grams in a natural language processing class in uni.
A related (but more interesting, imho) concept is the Hidden Markov Model, which is used for things like part-of-speech tagging and areas of speech recognition. It takes a sequence of "observations", like sound vectors or words, and uses a probabilisitic model to match them with "hidden states", like phonemes or parts of speech.
I got a job with a part-of-speech tagging website I made as a pet project for that class :)
This isn't quantum physics, and that isn't the implication of the double-slit experiment. This is new-age pseudo science BS that uses the buzzword "quantum" to sound smart
So pretentious, did a teenager write this?