Interesting! I started this project a long while ago and I gradually introduced additional features and formats over time after running into performance issues with other file parsers (not Tika). Tika looks like a great solution if you don't mind the Java dependency.
Here's a JRuby wrapper: https://github.com/ricn/rika