There's a problem with language productivity comparisons that Armstrong mentions: it's impossible to write the same program in two different languages. If you have the same team write it, their second try will benefit from everything they learned the first time, and that is so huge a part of programming that it is sure to distort the outcome and may even dwarf any language effects. But if you use different teams instead, you've traded one confounding variable for another—the effect of switching teams—which is also hugely influential. Thus it's impossible to do an apples-to-apples comparison, and most such experiments deserve high skepticism. It's too easy to consciously or unconsicously engineer the outcome you expect, which is presumably why we nearly always hear that the experimenter's pet language won the day. Has the experimenter's pet language ever not won the day?
That makes me think of a more modest way to do these experiments that might return more reliable results: use the same team twice, but have them solve the problem in their favorite language first. That is, if A is the pet language and you want to compare A to B, write the program first in A and then in B. This biases the test in B's favour, because A will get penalized for all the time it took to learn about the problem while B will get all that benefit for free. Since there's already a major bias in favor of A, this levels the playing field some.
Here's why I think this might be more reliable. If you run the experiment this way and A comes out much better, you now have an answer to the charge that the second time was easier: all that benefit went to B and B still lost. Conversely, if A doesn't come out much better, you now have evidence that the language effect isn't so great once you account for the learning effect.
This approach strikes me as the most realistic. If there is an existing codebase in the preferred language A, the obvious question, if B wins, is "Do we rewrite the code in B?"
The only downside, and I think adding the language C tries to defuse it, is that whatever the team writes first will always stick. For example: if I wrote a version with dynamic typing (say in Ruby), then redid it with static typing (Haskell), of course I'm going to try to reuse types. The extreme example (Greenspun's rule) is if my pet language A is Lisp: regardless of what B is, the team could try to write a half-baked lisp runtime on top of B. The style carries over, and sometimes it doesn't translate exactly. I don't know how to solve this.
Good point—there are more effects than just "learning about the problem" that carry over to the next time you write the program. Once your brain has imprinted on a particular design for solving the problem, you'll probably carry that over to the next implementation. It may not be the design you'd have come up with if you were thinking in B in the first place and, short of erasing your memory and starting over, there's no way to test that.
If you want to compare languages A and B, have them write it in language C first for the domain knowledge, then the two languages. That seems like a better option than giving one a bias on purpose.
That doesn't take into account how good a language is for prototyping and experimenting. Even if I have to write in $blublang, I'll probably still prototype it in $petlang for the increased productivity and then port it once I have the architecture nailed down.
Write the program several times, alternating languages. Eventually, both the programs in both languages should converge on their respective optimal lengths.
(I say "optimal" rather than "shortest" so we're not tempted to sacrifice clarity for concision.)
For some reason I assumed that the experiment was to measure development time rather than program length. Obviously that wasn't specified, since you assumed the opposite.
That makes me think of a more modest way to do these experiments that might return more reliable results: use the same team twice, but have them solve the problem in their favorite language first. That is, if A is the pet language and you want to compare A to B, write the program first in A and then in B. This biases the test in B's favour, because A will get penalized for all the time it took to learn about the problem while B will get all that benefit for free. Since there's already a major bias in favor of A, this levels the playing field some.
Here's why I think this might be more reliable. If you run the experiment this way and A comes out much better, you now have an answer to the charge that the second time was easier: all that benefit went to B and B still lost. Conversely, if A doesn't come out much better, you now have evidence that the language effect isn't so great once you account for the learning effect.