I'm really having trouble with this. My understanding is EC2 provides an interna...

jdrock · on Sept 28, 2009

So a few points (I assume you're talking about the Elastic IPs):

1. Yes, each instance can have its own IP, but by default, each account is limited to 5 IP addresses.

2. You can increase your limit, but my guess is that it's difficult to do so. You have to put forward a special request and have it approved.

You're right that blocking may not be a big issue, but crawling several different domains quickly will be hard.

Just so you know, we haven't encountered anyone doing large-scale crawling that considers AWS or the cloud in general to be a realistic option. The biggest reason is still the cost.. the outbound transfer rates just don't make sense at scale.

keefe · on Sept 28, 2009

Elastic IPs are about having the same functionality as static IPs, every instance has an IP per the previous link I posted. Every time you connect a new network device to any typical network, it gets an IP. I'm not sure how that relates to scalability of the bandwidth.

You are limited to some number of instances (20, 50?) and yes, you have to fill out a form to get more. The previous example with animoto shows how far you can go. I would wager that finding the funding for a large # of instances is more problematic than getting the approval.

I don't see why crawling several different domains quickly will be hard? There shouldn't be any difference between a bunch of instances on EC2 and a bunch of machines in a data center, from a technical point of view.

As far as the cost argument goes, of course I agree with you. If you can project a high level of CPU/bandwidth usage for an extended period of time then of course you should buy dedicated servers.

The only argument I was trying to make was that it is completely possible to do crawling on EC2 or any other cloud provider from a technical point of view, the only limitation is cost. I see the advantage of utility computing is that it offers a cheap way to handle bursty traffic, which you may certainly run into if your server utilization projections are off? I don't think you should use it as your primary set of servers if you can project some large volume of traffic.

jdrock · on Sept 28, 2009

You're right that the data center and the cloud will be very similar, but our assertion at 80legs is that both are very poor choices.

I'm not arguing that it's impossible to do crawling on the cloud. I'm saying it's near-impossible to do it on a large-scale on the cloud. 3500 instances is pretty good, but will still be an order of magnitude slower than what 80legs is capable of.

Now, if you show me someone that has 10,000+ instances on the cloud, I may agree with you!