My guess is that it's for efficiency. You have to dynamically allocate a string and then copy it to the destination if you separate "format" and "output". If you combine them, then you can directly output during formatting to your destination. Eliminating this inefficiency while separating "format" and "output" requires some sort of lazy interface or language hack.
Combining formatting and output is quite common - C has it in the printf family of functions (some of which do not output into anything, but return a string). Similarly, Java’s output streams have a format method, which can output to whatever, or fill in an array which you can later convert to a string.
It executes a program in a domain specific language which creates output to a stream.
This does a bunch of operations. The print functions are being executed on an output stream. If the ARG is NIL, then FORMAT creates a string output stream and later returns the generated string.
I was replying to your original post, did not notice you posted something different here. Formatting a string and then either returning it or sending it to stdout (or any other stream) has too much in common. The work is done in the formatting, sending is just an after thought.
In a sense, this is similar to the parentheses and the prefix notation: it can be distracting if you are not used to it, but after a while it becomes very convenient.
And I do not mean to sound snobbish, but you can easily write a couple macros to have different syntax for these two tasks if it really bothers you.
#'format knows nothing about files; it only knows about streams, which is what its first argument designates.
A point I try to stress with new CL programmers is that if you write new I/O functions to expect a stream designator they're much more flexible than if they expect a pathname.