I'm following this story with a lot of interest. I've done (and still do!) a lot of data crawling/scraping. In the past I've worked on so-called "alternative data" collection and analysis for financial forecasting.
Without going into too much detail, a lot of hedge funds have teams constantly searching for kernels of data that can contribute some kind of signal for market movements. This data can come in the form of satellite imagery for oil tankers or manufacturing centers, but it can also come from the very creative use of scraped and aggregated data. It's typically very difficult to identify, collect and analyze on a technical level (as 'chollida1 has lamented in the past: normalization, labeling/bucketing and analysis of disparate data across different formats, sources and processing timeframes is a pernicious problem at this scale). From a compliance standpoint there are also generally strict requirements governing legality of use.
Depending on the specific data, you might be capable of predicting earnings or broader market movements with a <5% margin of error each quarter for years at a time (I've personally seen and worked on projects with <1%, but that's the exception, not the norm). That tactic is usually found at discretionary funds; at quantitative funds the uses are much more abstract and cross-pollinated so as not to target single-equities, but rather holistic trends. Regardless, every fund is using data in some way these days; it's just a matter of how sophisticated, creative abstract they get in their analysis of it.
hiQ Labs doesn't collect data for this specific purpose, but it is absolutely related. In the past I have stayed away from crawling LinkedIn and Yelp precisely because they are very litigious (regardless of the eventual outcome and legality). Now that there's another relatively high profile case out in the open like this, I'm interested in seeing how it proceeds and what the ramifications will be for companies that collect data across a wide range of uses. As Grimmelman mentioned in the article, this can impact a lot of types of businesses, not just those in the same space as hiQ. Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.
Alternate data doesn't even need to be as sexy as satellite photos, hell you almost certainly want the data that isn't sexy, the stuff people haven't thought of because it's too boring. Alternative data vendors above all want sales, and even the funds themselves want things to show off to clients. This gives you great opportunities to look at the alternative data they aren't touching.
Given this is a predominantly a web development community it always surprises me how little creativity there is in the articles on investing. Neural networks and machine learning sound cool but the reality is almost none of the readership would be able to make any money off them.
Simply tracking how many sales or users exist in databases by watching sequential IDs should be the go-to method for any web developer trying to get an edge. I would have expected HN to have articles where people are getting creative on that, ie trying to use measures of entropy on usernames to get rough subscriber numbers etc.
Even plain scraping of prices etc, is often full of great insight that is ignored. If a grocery store drops their prices in profitable categories against their competitors, that could be the signal about an incoming price war for an entire sector. There's a lot more information in that than social media feeds and all the other sorts of sexy data that get coverage in the media.
The reason hedgefunds look at satellite images and oil tankers is because everyone looks at sequential ids and price changes so that doesn't give an edge.
That's simply not true - equity analysts can cover anywhere between 5 and 500 stocks, do you really think they have the time or skill set to track all of that? It really is laborious, grueling work.
If you look at the possible returns the equity market is going to make from a stock in a dollar value, and how much research spend is as a percentage of that, you'll quickly see it doesn't pay for much.
You can tell simply by looking at the broker research - that's probably the extent that analysts take things.
The big stocks obviously have a lot of it happening. (eBay listings, airline pricing etc is obviously touted a lot)
But once you start to go down to the mid caps, you enter a void where there isn't much heavy data focused research done, and it's very possible you can have a better gauge of the business than any other investor on the planet once you pull out this data out.
> Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.
Google comes to my mind. I'm only partly joking, as it seems to me that the line between search engine web crawling and other forms of web scraping is very thin.
It's more than thin: just scrape a few more sites, make a public search interface and say you are working on a search engine. In the meantime, you might analyze the data for other (not so) related purposes.
> Inside, are certain sites more worthwhile - and which ones (eg reddit, eBay, trade union websites, whatever)
Yes, absolutely. For many purposes websites that sell their own data are less useful (less signal exclusivity). Specific sources of data will be much more valuable depending on what the data is about.
> How about scraper brokers? Do they exist?
Yes. You're not getting access without an NDA in addition to paying quite a lot.
> Are there scammy scrapers? Make up BS and sell as scraped data?
That depends on how easy it is to verify the data. For most of what you'd term "alternative data" you'll know if it's real in 2 - 12 weeks, and it's not sustainable to sell crap.
But a lot of parties scrape dodgy financial timeseries data (ticks and quotes on equities or options) and sell it, priced as though it were tick data when it's barely accurately OHLC. They mostly sell this sort of data to amateurs who don't realize tick data is expensive for a reason.
> How big is this?
Very big. Most hedge funds ingest a lot of data whether they curate it internally or source it from elsewhere.
"satellite imagery for oil tankers" - Interesting.
If hedge funds floating drones above oil tankers, I'd guess they'd be accussed of corporate espionage / spying / invasion of privacy?
Ok, so oil tankers are big and "in the clear". What if $TANKERCORP floats big parachute balloons above its tankers to imply "looking past these is unauthorized viewing"?
Then if a HedgeFund gets a clever angle on a satellite photo.. is that the equivalent of breaking a lock, or violating CFAA?
Satellite imagery like this is legal to within a few feet of practical resolution, pretty much anywhere. The effective countermeasure is hiding things from a satellite, not attempting to sue satellite operators for flying overhead. I'm not aware of a specific law against using anything that is literally viewable from the sky, at least in the United States (someone can correct me if I'm wrong, but last I checked Google Maps blurs out some locations or keeps them outdated because the government requests it, not because of a formal law forcing them to do so).
There are two other notes in response to your question:
1. Drones are different from satellites, and are more susceptible to regulation in the way you're positing because they can be prevented from flying above specific areas. However, most of the same problems with countering them apply, because drones can record better three dimensional footage. In your specific example, if a tanker disguised itself overhead, it would still be legal to have a drone monitor the tanker from the sides, as long as doing so didn't break any law set by the FAA.
2. Drones are actively used these days for things like monitoring production facilities ("how many cars come out of this factory" for an oversimplified version). If they have to monitor from a distance, so be it, they'll do it. The effective countermeasure here is to have a huge amount of land that can't provide any intelligence, because the drones aren't allowed to fly over it and can't see far enough in to the facility.
There's definitely a productive ethics discussion that can be had here, but the legal precedents don't really allow for combatting these techniques right now. If it's public, it can be collected, ingested and used in an algorithm to determine alpha.
> someone can correct me if I'm wrong, but last I checked Google Maps blurs out some locations or keeps them outdated because the government requests it, not because of a formal law forcing them to do so
In the Eastern European country from where I'm from (a NATO member) the Google StreetView car even got to photograph and publicly put on the Internet the outside of military and air bases with clear signs of "do not take photographs" visible on StreetView itself. It's funny, my company also used to work in this space (local business directory with business addresses, photos of said business etc) and one of my former colleagues got detained for a day because of taking photos of businesses in the downtown area of one of the biggest cities in my country. He hadn't seen that there was a military "objective" in his line of view (probably some military HQ or someth, not a proper military base with tanks and trucks). Talk about the advantages of being an internet giant like Google..
Hedge fund researchers have also chartered private airplanes to fly over oil storage facilities and use infrared cameras to check tank levels. In the USA at least this is completely legal as long as they observe regular FAA flight rules. For the oil market as a whole this is a good thing since it helps price discovery.
It's not practical or safe to put a huge parachute or balloon over a tanker to block overhead imagery. Any sailor can tell you it just wouldn't work.
Well, it's pretty simple to legally dodge these kinds of threats.
If you scrape regularly, then pick up a dozen or more machines around the world, in less than friendly areas to US law. Pay with a rechargeable credit card or bitcoin. And the buy servers and set up a hadoop cluster that handles scan-jobs.
The worst case scenario is that LinkedIN, YELP, and others get some of your servers shut down. Wash, rinse, repeat.
EDIT: please note, this was only a thought-game to bypass rude and destructive laws like the CFAA, which weaponizes TOS'ses, EULA's, and other implicit contracts of adhesion (as in, you have to agree to see). Ideally, we would be better off with both these laws to have a sane scope, and for companies to not expect things to happen with content in public.
You're too technical, feels like.
Usually it's done in even simpler way - you have 3rd party provider from one of this countries (or who aggregates data from you), who does all scrapping and data cleaning for you for reasonable price.
Without going into too much detail, a lot of hedge funds have teams constantly searching for kernels of data that can contribute some kind of signal for market movements. This data can come in the form of satellite imagery for oil tankers or manufacturing centers, but it can also come from the very creative use of scraped and aggregated data. It's typically very difficult to identify, collect and analyze on a technical level (as 'chollida1 has lamented in the past: normalization, labeling/bucketing and analysis of disparate data across different formats, sources and processing timeframes is a pernicious problem at this scale). From a compliance standpoint there are also generally strict requirements governing legality of use.
Depending on the specific data, you might be capable of predicting earnings or broader market movements with a <5% margin of error each quarter for years at a time (I've personally seen and worked on projects with <1%, but that's the exception, not the norm). That tactic is usually found at discretionary funds; at quantitative funds the uses are much more abstract and cross-pollinated so as not to target single-equities, but rather holistic trends. Regardless, every fund is using data in some way these days; it's just a matter of how sophisticated, creative abstract they get in their analysis of it.
hiQ Labs doesn't collect data for this specific purpose, but it is absolutely related. In the past I have stayed away from crawling LinkedIn and Yelp precisely because they are very litigious (regardless of the eventual outcome and legality). Now that there's another relatively high profile case out in the open like this, I'm interested in seeing how it proceeds and what the ramifications will be for companies that collect data across a wide range of uses. As Grimmelman mentioned in the article, this can impact a lot of types of businesses, not just those in the same space as hiQ. Outside of finance I am familiar with many tech companies which (openly or otherwise), kickstarted what are now widely known enterprises through cleverly crawling or scraping massive amounts of data.