I'm really having trouble with this. My understanding is EC2 provides an internal and external IP for each instance : http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-19/in... as well as a semi-friendly DNS name. Each of these machines can certainly make its own requests to arbitrary URLs? I don't see how this is any different than a bunch of machines sitting in a data center with a shared, dedicated internet connection? Also, from my rather limited experience of crawling sites the only time I drew negative attention was when I did not throttle my crawler. If you are properly caching to avoid repetitious requests and throttling your requests, how are you going to get blocked and why would it be different in EC2 versus a dedicated hosting center?
So a few points (I assume you're talking about the Elastic IPs):
1. Yes, each instance can have its own IP, but by default, each account is limited to 5 IP addresses.
2. You can increase your limit, but my guess is that it's difficult to do so. You have to put forward a special request and have it approved.
You're right that blocking may not be a big issue, but crawling several different domains quickly will be hard.
Just so you know, we haven't encountered anyone doing large-scale crawling that considers AWS or the cloud in general to be a realistic option. The biggest reason is still the cost.. the outbound transfer rates just don't make sense at scale.
Elastic IPs are about having the same functionality as static IPs, every instance has an IP per the previous link I posted. Every time you connect a new network device to any typical network, it gets an IP. I'm not sure how that relates to scalability of the bandwidth.
You are limited to some number of instances (20, 50?) and yes, you have to fill out a form to get more. The previous example with animoto shows how far you can go. I would wager that finding the funding for a large # of instances is more problematic than getting the approval.
I don't see why crawling several different domains quickly will be hard? There shouldn't be any difference between a bunch of instances on EC2 and a bunch of machines in a data center, from a technical point of view.
As far as the cost argument goes, of course I agree with you. If you can project a high level of CPU/bandwidth usage for an extended period of time then of course you should buy dedicated servers.
The only argument I was trying to make was that it is completely possible to do crawling on EC2 or any other cloud provider from a technical point of view, the only limitation is cost. I see the advantage of utility computing is that it offers a cheap way to handle bursty traffic, which you may certainly run into if your server utilization projections are off? I don't think you should use it as your primary set of servers if you can project some large volume of traffic.
You're right that the data center and the cloud will be very similar, but our assertion at 80legs is that both are very poor choices.
I'm not arguing that it's impossible to do crawling on the cloud. I'm saying it's near-impossible to do it on a large-scale on the cloud. 3500 instances is pretty good, but will still be an order of magnitude slower than what 80legs is capable of.
Now, if you show me someone that has 10,000+ instances on the cloud, I may agree with you!