Plurchase is just a reverse HTTP proxy server looking for an application. It's n...

catch23 · on Oct 10, 2009

Most reverse proxies do the easy stuff, but all the hard stuff is still on the client side javascript/flash code. It'll probably take you a few hours to hack up one of the existing proxies out there and get "almost" what plurchase has, but you'll soon see that menus won't render in amazon and links just won't work because the link menu system is completely dynamically constructed so most proxies won't take you this far. We have an html rewriter written in javascript on the client side to solve some of those issues. We also have a javascript rewriter written in javascript to proxify the javascript code. If you right-click view-source on one of the proxified frames, you'll see that we go to great lengths to make things work under the cover.

Also, you might notice that you can't ignore all images & static assets, some of the image requests actually modify cookie information so you may need some different level of proxying there -- same goes for css/javascript assets. You'd be surprised what websites do under the cover. Even with all of our technology, there are still some sites that don't function properly and may require some tweaking on our proxy engine. I tested kayak.com yesterday on our proxy and it almost works but there are still some weird issues going on that I'll have to debug further. I mentioned most of these details in a blog post -- had this been the web 1.0 days, proxying might actually be easy, but we wouldn't have the speed of modern javascript interpreters.

I agree with you on some of the ToS stuff that we'd have to potentially work with merchants on. And since we're white-labeling based on domains, we can remove them if they don't want us there, but proxying shopping sites is unlike other sites -- they want customers to buy stuff. Anything that causes a customer to convert better is in their best interest. Large merchants were actually interested in this, but they didn't want partnerships until they saw traction.

eserorg · on Oct 10, 2009

CGIProxy has full javascript support.

It also handles cookies and can dynamically determine whether or not it should proxify images and CSS assets.

CGIProxy also handles flash.

In fact, CGIProxy can fully proxify an SSL-encrypted GMail session out-of-the-box.

CGIProxy is _very_ mature technology. Take a look at the source code -- it handles 100's of special-cases.

Some of the special-cases that CGIProxy handles are truly bizarre. Reading through the CGIProxy source code is actually pretty fascinating.

This brings up an interesting point. If you're interviewing a software engineering candidate, ask them what their favorite language is. Next, ask them to write a simple HTTP server in that language. Finally, ask them to write a reverse HTTP proxy that works with the server they just wrote. This is a surprisingly effective way of separating the wheat from the chaff. Most people with CS degrees are completely incapable of doing this.

catch23 · on Oct 10, 2009

I downloaded to take a look at the source, they do things that are similar to us actually, and they do have lots of great special case handling that we might use for reference later. One crazy thing that we do that isn't done here is the storage of cookies, we actually temporarily store cookies on the server side so that we can do fancy things with it -- sending around the shopping cart, transfer of shopping state etc.

However, if you've tried using it, you'll notice that it isn't perfect either... they show slashdot.org as an example, but lots of things are broken on it. Most of the javascript stuff doesn't work. Slashdot was a javascript light site a year ago, but now they use tons of fancy javascript like most sites. Their cookie support also seems to be broken -- not sure why. Although I will admit that James did quite a bit of amazing work in just 12k lines of perl code. The perl code is pretty decent too.

It's easy to get a proxy that works on 95% of the code, all the hard stuff is in that last 5% -- that's what we spent most of the time on :-)

jeffDef · on Oct 10, 2009

I evaluated cgiproxy at the beginning of this project, along with a variety of apache mods and other libraries. cgiproxy didn't work for most of the merchant sites that I checked. Was an oversight to ignore them thereafter though. Like you said, they've put a lot of time into special cases, and we can learn things from them.

Their primary goal is anonymous browsing. Ours is collaborative shopping, or more specifically, adding new functionality to existing sites. Our proxy lets us do amazing things with client state, image scraping, and more. A different tool for a different problem.

immad · on Oct 10, 2009

It rewrites the flash and javascript so that they call your reverse proxy instead of the original domain?

eserorg · on Oct 10, 2009

catch23 · on Oct 10, 2009

Well it doesn't seem to work all the way. I'm not sure what you meant when you said it has "_full_ javascript support" and is "very mature technology". I tried it out with amazon.com and none of the menus seem to be working...

Feel free to give it a run through: http://bit.ly/5WVsw

(this is cgiproxy 2.1beta19 running on mod_perl on a fresh lenny instance -- no ssl configured here)

hussong · on Oct 10, 2009

I wonder how much of your resources will be bound long-term trying to keep up with ever-changing frontend code of the different shops and sites. It's something that keeps most of the teams doing javascript injecting browser plugins really busy.

catch23 · on Oct 10, 2009

once we get our selenium farm going, hopefully not much. it takes about 10-15 minutes to write a script for a specific site. We just need tests that will validate our scripts against each site every day.