Sure, and I did mention command line utilities. The issue I see with the benchmark is that it compares apples and oranges. For some programs it tests JIT performance and for others it tests startup time. If I want to test startup time, I just use a hello world program. To test JIT/interpreter performance, I try to exclude startup time.
Hello world programs only ever rely on the bare runtime, though; real utilities have to load libraries after the runtime loads, which incurs further delays (which might depend on JITing/interpreting speed if they're not pre-compiled.)
The same can be said about benchmarking numeric vs io code. Is it apples and oranges and bananas then? Most programs have a mix of everything. Whether Eurler's problems are a good mix that represent your workload is up to you to decide.
In addition to what fauigerzigerk said, if you run a lot of command line utilities where startup time is relevant, there are established solutions to that (nailgun for example) so it still isn't a fair comparison. If you want a comparison of state of the art solutions, which this post obviously is, put nailgun into the mix.
those utilities are often used shell scripted to run many times in a row. at least in interactive use, the multiplied startup times can grow annoyingly long.
You could use one of those Java background daemons to do that, but anyway, what I was trying to say is just that this benchmark doesn't test JIT or interpreter performance in the case of Jython.