3) Hack the interface box at the top of the screen to include: facebook integration, a meebo chat box, etc...
The problem is that this sort of thing tends to _really_ upset webmasters. So, you'll need to restrict proxy sessions to specific domains (zappos.com, etc...) from which you've received approval to run an open proxy against.
It's surprising how far you can take this with one server; what you want to do is to port CGIProxy so that it runs on a custom HTTP epoll server, instead of on Apache. Brad Fitzpatrick has some great perl code on CPAN that shows you how to do this.
Once you have CGIProxy running on epoll (with epoll enabled in the kernel config, of course) you can easily handle thousands of concurrent clients on a single Amazon EC2 (small) instance.
You can disable proxification for images and media to enhance throughput (and to cut down on your bandwidth usage).
This is basically all that Facebook platform is -- a glorified reverse HTTP proxy.
Reverse HTTP proxies are a neat hack. But, the excitement tends to wear off once you realize that's all they really are. The underlying websites against which you are proxifying retain all of the true value. Plus, you are beholden to their TOS (Terms of Service).
Some of the earliest applications of reverse HTTP proxies (back in the 90's) were for anonymizing web sessions. That tends not to work too well, because no one wants to assume all of that liability.
An obvious application of reverse HTTP proxies would be to create an "enhanced" Craigslist: facebook integration, chat windows, better site search, LSA-based suggestions, etc.... This sounds like a great idea until you realize that the Craigslist TOS prohibits this sort of thing -- as do most websites. Even if the TOS is vague about this, it's only a matter of time before you get shut down -- no one wants a middleman proxying traffic between their website and their customers.
Remember when an ISP in Texas was inserting ads into people's browsing sessions? They were using reverse HTTP proxies.
I haven't even discussed the security implications of this. It's trivial to setup an HTTPS reverse proxy. CGIProxy even works with an SSL-encrypted GMail session. Think about that: all of your username/password data is being handed to a 3rd party middleman in plain text.
Even if the middleman running the reverse HTTP proxy is the nicest guy in the whole world, the security of your data now depends on two different companies getting their security protocls correct: (1) the company running the HTTP proxy, (2) the back-end website.
I'm sure the intentions behind Plurchase are excellent. But, this is a _very_ bad idea.
The only way to fix this situation, would be to to license an "enterprisey" white-label reverse HTTP proxy to e-commerce sites for internal use behind their firewall. For instance, Plurchase could license a "reverse HTTP proxy social networking applicance" to Zappos as a way to retrofit "social networking" features onto the existing Zappos website. That would address many of these issues. But, it's an entirely different business. Plus, the "web accelerator appliance" model has never worked very well.
Eventually, all features get folded into the underlying platform.
Most reverse proxies do the easy stuff, but all the hard stuff is still on the client side javascript/flash code. It'll probably take you a few hours to hack up one of the existing proxies out there and get "almost" what plurchase has, but you'll soon see that menus won't render in amazon and links just won't work because the link menu system is completely dynamically constructed so most proxies won't take you this far. We have an html rewriter written in javascript on the client side to solve some of those issues. We also have a javascript rewriter written in javascript to proxify the javascript code. If you right-click view-source on one of the proxified frames, you'll see that we go to great lengths to make things work under the cover.
Also, you might notice that you can't ignore all images & static assets, some of the image requests actually modify cookie information so you may need some different level of proxying there -- same goes for css/javascript assets. You'd be surprised what websites do under the cover. Even with all of our technology, there are still some sites that don't function properly and may require some tweaking on our proxy engine. I tested kayak.com yesterday on our proxy and it almost works but there are still some weird issues going on that I'll have to debug further. I mentioned most of these details in a blog post -- had this been the web 1.0 days, proxying might actually be easy, but we wouldn't have the speed of modern javascript interpreters.
I agree with you on some of the ToS stuff that we'd have to potentially work with merchants on. And since we're white-labeling based on domains, we can remove them if they don't want us there, but proxying shopping sites is unlike other sites -- they want customers to buy stuff. Anything that causes a customer to convert better is in their best interest. Large merchants were actually interested in this, but they didn't want partnerships until they saw traction.
It also handles cookies and can dynamically determine whether or not it should proxify images and CSS assets.
CGIProxy also handles flash.
In fact, CGIProxy can fully proxify an SSL-encrypted GMail session out-of-the-box.
CGIProxy is _very_ mature technology. Take a look at the source code -- it handles 100's of special-cases.
Some of the special-cases that CGIProxy handles are truly bizarre. Reading through the CGIProxy source code is actually pretty fascinating.
This brings up an interesting point. If you're interviewing a software engineering candidate, ask them what their favorite language is. Next, ask them to write a simple HTTP server in that language. Finally, ask them to write a reverse HTTP proxy that works with the server they just wrote. This is a surprisingly effective way of separating the wheat from the chaff. Most people with CS degrees are completely incapable of doing this.
I downloaded to take a look at the source, they do things that are similar to us actually, and they do have lots of great special case handling that we might use for reference later. One crazy thing that we do that isn't done here is the storage of cookies, we actually temporarily store cookies on the server side so that we can do fancy things with it -- sending around the shopping cart, transfer of shopping state etc.
However, if you've tried using it, you'll notice that it isn't perfect either... they show slashdot.org as an example, but lots of things are broken on it. Most of the javascript stuff doesn't work. Slashdot was a javascript light site a year ago, but now they use tons of fancy javascript like most sites. Their cookie support also seems to be broken -- not sure why. Although I will admit that James did quite a bit of amazing work in just 12k lines of perl code. The perl code is pretty decent too.
It's easy to get a proxy that works on 95% of the code, all the hard stuff is in that last 5% -- that's what we spent most of the time on :-)
I evaluated cgiproxy at the beginning of this project, along with a variety of apache mods and other libraries. cgiproxy didn't work for most of the merchant sites that I checked. Was an oversight to ignore them thereafter though. Like you said, they've put a lot of time into special cases, and we can learn things from them.
Their primary goal is anonymous browsing. Ours is collaborative shopping, or more specifically, adding new functionality to existing sites. Our proxy lets us do amazing things with client state, image scraping, and more. A different tool for a different problem.
Well it doesn't seem to work all the way. I'm not sure what you meant when you said it has "_full_ javascript support" and is "very mature technology". I tried it out with amazon.com and none of the menus seem to be working...
I wonder how much of your resources will be bound long-term trying to keep up with ever-changing frontend code of the different shops and sites. It's something that keeps most of the teams doing javascript injecting browser plugins really busy.
once we get our selenium farm going, hopefully not much. it takes about 10-15 minutes to write a script for a specific site. We just need tests that will validate our scripts against each site every day.
It's not too hard to do something similar in perl.
1) start off by downloading a copy of CGIProxy: http://www.jmarshall.com/tools/cgiproxy/
2) Setup an apache server running CGIProxy
3) Hack the interface box at the top of the screen to include: facebook integration, a meebo chat box, etc...
The problem is that this sort of thing tends to _really_ upset webmasters. So, you'll need to restrict proxy sessions to specific domains (zappos.com, etc...) from which you've received approval to run an open proxy against.
It's surprising how far you can take this with one server; what you want to do is to port CGIProxy so that it runs on a custom HTTP epoll server, instead of on Apache. Brad Fitzpatrick has some great perl code on CPAN that shows you how to do this.
Once you have CGIProxy running on epoll (with epoll enabled in the kernel config, of course) you can easily handle thousands of concurrent clients on a single Amazon EC2 (small) instance.
You can disable proxification for images and media to enhance throughput (and to cut down on your bandwidth usage).
This is basically all that Facebook platform is -- a glorified reverse HTTP proxy.
Reverse HTTP proxies are a neat hack. But, the excitement tends to wear off once you realize that's all they really are. The underlying websites against which you are proxifying retain all of the true value. Plus, you are beholden to their TOS (Terms of Service).
Some of the earliest applications of reverse HTTP proxies (back in the 90's) were for anonymizing web sessions. That tends not to work too well, because no one wants to assume all of that liability.
An obvious application of reverse HTTP proxies would be to create an "enhanced" Craigslist: facebook integration, chat windows, better site search, LSA-based suggestions, etc.... This sounds like a great idea until you realize that the Craigslist TOS prohibits this sort of thing -- as do most websites. Even if the TOS is vague about this, it's only a matter of time before you get shut down -- no one wants a middleman proxying traffic between their website and their customers.
Remember when an ISP in Texas was inserting ads into people's browsing sessions? They were using reverse HTTP proxies.
I haven't even discussed the security implications of this. It's trivial to setup an HTTPS reverse proxy. CGIProxy even works with an SSL-encrypted GMail session. Think about that: all of your username/password data is being handed to a 3rd party middleman in plain text.
Even if the middleman running the reverse HTTP proxy is the nicest guy in the whole world, the security of your data now depends on two different companies getting their security protocls correct: (1) the company running the HTTP proxy, (2) the back-end website.
I'm sure the intentions behind Plurchase are excellent. But, this is a _very_ bad idea.
The only way to fix this situation, would be to to license an "enterprisey" white-label reverse HTTP proxy to e-commerce sites for internal use behind their firewall. For instance, Plurchase could license a "reverse HTTP proxy social networking applicance" to Zappos as a way to retrofit "social networking" features onto the existing Zappos website. That would address many of these issues. But, it's an entirely different business. Plus, the "web accelerator appliance" model has never worked very well.
Eventually, all features get folded into the underlying platform.