Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They may consider their methods too proprietary to supply details, so I'll speculate:

Information on earlier-in-day flights alone would probably be enough (mixed with historical data) to be fairly accurate about later-in-day flight delays -- even those that aren't continuations of the same plane -- because of giant overlaps in delay-causing conditions (weather, airport/mechanical mishaps, crew issues, etc.).

Weather forecasts might give another advantage, especially in predicting 'seed' delays that then hint at later delays.

If there are any other semi-public feeds related to FAA reporting or air-traffic control -- even if mostly meant for other pilots or General Aviation -- those would be incredibly valuable.

If those regional and national maps of planes-in-flight also contain sufficient positioning detail to notice when they're spending a little extra time on runways, or waiting for/at gates, etc. -- another positive early influencer for predictions.



I don't think it would be harmful for them to give a clue about what algorithms they use. There is so much tuning required to get these algorithms to perform the way you'd expect that I think they could still keep their IP locked up even if they gave a general hint.

With that said, I'll speculate as well: Perhaps you'd need some type of ideal dataset, one that included departure times/arrivals and distances and weather conditions of flights that came in on time as expected. Then you might start introducing some noisy data, ie. flights that made the same trip but came in late or early with same weather conditions. Then you'd add in the effect of weather conditions and see how flights fared. I'd speculate that you could get away with doing some type of regression analysis, but maybe you'd need to resort to a more complex algorithm for classifying ("On-time", "Early", "Late") based on a series of features ("Distance","Weather","Mechanical", "Time", etc). SVM could pull this off, or perhaps even a naive bayesian classifier. For research purposes I'd probably check out RVM because it might need less information to classify. Not sure if it would be realistic to use it though...this problem is in need of a highly scalable solution.


You are right that Flightcaster doesn't want to tell everyone the recipe for our special sauce, but we will say that the kinds of features that you mention are the kinds of features and sources that we are looking at.

It is all based on captured real-time data, so we are limited by what we can get access in real time. You are correct that some is public and some is semi-public. It is not the most efficient space so there is a lot of data that we will need to screen scrape and such.

A lot of the problem is just obtaining and pre-processing all the data from heterogeneous sources, and performing distributed joins to get it into the proper view for analysis.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: