Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Data Science loop: the day-to-day work of a data scientist (seanjtaylor.com)
72 points by dude_abides on Sept 25, 2012 | hide | past | favorite | 27 comments


The essence of what is a Data Scientist boils down to this awesome tweet by Josh Wills (@cloudera): https://twitter.com/josh_wills/status/198093512149958656

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.


Echoes of A.J. Liebling:

"I can write better than anybody who can write faster, and I can write faster than anybody who can write better."

----

Leading to a generalized strategy for success:

1) Identify two orthogonal metrics by which your work will be judged by relevant stakeholders.

2) Work / practice until you are better than anyone by at least one of those metrics (in practice, better than the large majority of your competitors).

The degenerate case involves being the best in the world by at least one of the metrics; more achievable is to be somewhere in the middle for each. To use our own patio11 as an example: be better at coding than anyone who is better at SEO, and be better at SEO than anyone who is a better coder.


Ooh, I like that even more than I like my usual explanation: the "Venn diagram game." (One circle gestures with fingers represents all X, one represents all Y, take the intersection, there's only like four people in it and three of them can't be hired for love or money.)

I don't typically sell engagements primarily for SEO, by the way. (I also don't think that it is the case that I would not be strictly dominated by someone, somewhere on those two axes. If you phrase it like "Among the universe of people a potential client could actually hire" then my confidence that there exists a better option than me falls rapidly for each constraint you impose on the hiring pool.)


To be slightly more precise, if your two metrics are A and B, it's

2) Out of the set of people who are at least as good at A, you are better at B; and/or of the set of people who are at least as good at B, you're better at A.

or

2) Nobody in the world is better than you at both A and B simultaneously.


Only n people are the best at something, but n^2 people are the best at pairs of things.


Actually, an infinite number of people can be Pareto optimal at just two things.


I can code better than anybody who understands the domain, knows statistics, is in the right location, available, works well with both business people and technical people, likes AI, and is 5'7'' tall.

(But seriously: I do sell myself as the technical guy who understands business, and it works wonders).


This is a very apt way to describe the job...but is "data scientist" really the right term? Because there are many, many scientists who are terrible at software engineering, despite being great at research and analysis.

Wouldn't we call what @cloudera refers to, "data engineers"?


Any software engineer? Any statistician?

Data scientists are to business analysts as Sanitation Engineers are to janitors. And that's all. Everything they think they're discovering, guys already did with supermarket loyalty programmes, frequent flyer programmes, etc 20 or 30 years ago.

Read and learn: http://en.wikipedia.org/wiki/AAdvantage#History


It you want to make an impact as a data scientist, it helps to prototype your ideas as a testable product. Many product managers and engineers ignore their data scientists' work because fighting fires almost always takes precedence. The recent HBR article (posted here) showed how the data scientist at LinkedIn was able to make a huge impact by running his own experiments against live users. At my company, we always provide a non-optimized prototype that can be run directly with our existing product. If the idea is good, we can optimize the code later.


I would be interested to see more "low level" posts about how people approach their data analysis workflow. I do data analysis for a living and find that efficient workflow has just as much impact on your productivity and results as the choice in technologies that you use.


That's a good idea; I will try and write up such a "low level" post this week. As a preview, the things that will prominently feature in the post are - RStudio, custom functions/aggregates in Postgres PL/R, and Sweave. (yes I am a R fanboy).


Would definitely like to read it!

I personally find the need to go back and forth quite a bit between different tools when working even on a single dataset. Say, i'll do some work with text part of the dataset in Python, then use Matlab for some basic quantitative analysis, then use R for some advanced statistics if I have to, port those results back to Matlab/Python for speed, then back to R to sweave, maybe some spreadsheet stuff too, etc... So for me, for every project I need to have my dataset in easily accessible format (csv/someSQL/etc) and I need scripts in every language that I use, that would communicate with the data source and get "up to speed" right away.

It's little tips like that I think people should be sharing more.


RStudio for life. I would like to hear about your experience with Sweave.


http://www.r-bloggers.com/

make sure you look at some older posts


Double your salary: call what you do "science", not "analysis".


Maybe, but double it again by calling it "analytics".


Can someone explain the affront to your sensibilities that is caused when one of our own (technologists) finds new ways to make money, rebrand and evolve a given subject area or just generally find new, better ways of explaining a thing?

...I'm baffled at the sarc and outrage every time something like this comes up.

Data analyst is a massively broad term that could be anyone from a junior ad-hoc SQL script writer to a guy who sits on a DWH generating leads using complex models and algorithms.

There is a need to distinguish these things from one another and the new definitions and tags are trying to suffice this need... what's your boggle?


No outrage, sarc, or boggle in my comment.

New names are all good if they name new concepts. But new names for old things just sound like buzzwords.


It would be great to hear more about the communication section.

For instance is it better to offer an ambitious summary first, followed by a deep exploration into the data and methods to understand how any conclusion may have been determined?

I am loathe to put something out there might overstate a point that will be run with without a sincere understanding, but I also fear that if the details are provided first without a thorough explanation of the nuances, that there is a danger they will be used carelessly.

I suppose my question is, have others have found it preferable to start with a headline then support that headline with caveats, or to create a detective story and walk people through arriving at a conclusion?

It's little like the conundrum of TED Talks (or a much of science news reporting). A very highly produced talk will establish an argument for an important adjustment of conventional wisdom in some field, only to create new problems because to people outside of the field it sounds like the final word.


Make the strongest claim you can support.


To be honest, I think that it is reckless to disavow any responsibility for what consumers of information you produce will do with it in your own organization.

Going back to TED Talks, ridiculous analysis and comments citing a TED talk are often based on a talk that itself was not ridiculous.

I asked about elaboration on that area of responsibility because I'd assume that he had a relatively small audience of highly capable people, yet even so had more intimate contact with the specifics and had to struggle with just how to present that more full knowledge in a concise but still responsible way.


This post completely neglects the part where data scientists derive statistical models from large sets of data in order to use them for classification, clustering or prediction purposes. Describing data science as advanced applied accounting is only true if you have to 1) analyze big datasets or 2) find hidden connections between data aka. data mining


Thanks for the post, I am starting a role as a data analyst soon and am curious to hear more about two aspects of your role:

1. How are the dynamics between engineering team, product managers and data scientists. Understand that LinkedIn's data team plays a huge role in creating and improving features (like People you may know), how is it like at Facebook, and in general other companies?

Engineers build things, managers make decisions, data scientists answer questions

2. What are the tools you use ( I see R mentioned in the thread), what books would you recommend? I mostly write in python, use D3 and gephi for visualization. Am also taking the Coursera course on Social Network Analysis (https://class.coursera.org/sna-2012-001/class/index) and reading the course book Easley & Kleinberg, Networks, Crowds and Markets. Thanks for sharing again.


Sorry if I didn't make this clear earlier: I just posted this link to HN, I didn't write this blog post. (Incidentally I'm also a data scientist.)


intersections are nonsexy.

ask the hawkers who ply their wares at intersections. not only do they have to deal with traffic from the north, there is a constant barrage of traffic from the south, not to mention the incessant traffic from the east, and hey, how can we forget the speeding traffic from the west...

but the same intersections are also frequented by pedestrians from the north, and the west, the south and the east, so the hawkers' trade is lucrative.

data science is at the intersection of linear algebra, machine learning, statistics and distributed computing.

if you ask the hawkers sitting in a nicely furnished airconditioned shop at the mall, they will tell you that the hawkers at intersections aren't real hawkers, they are just fly-by-night hustlers selling a stalk of unkempt roses that will wither away, selling evening tabloid newspapers that'll be useless to read tomorrow, unhygenic ice-cream cones and candy, unhealthy street food, braids for the hair that'll snap if you tug at it, fake watches, imitation handbags, ... so if you want to be a real hawker selling real healthy food, you must open a real restaurant in a mall. you wanna sell genuine cartier watches, get a license and open a premium retail outlet in a mall. you want to hawk useful literature, open a barnes and noble bookstore in a mall...

so also the genuine statisticians will mock the data scientists as fake...oh these guys don't grok industrial strength SAS and S-PLUS, they fool around with unproven toys like R.

the genuine linear algebraists are too busy submitting academic papers to the MAA to worry about trivialities like data science.

the genuine distributed computing programmers know that data scientists operate with a very tiny subset of distributed computing - usually just hadoop or bigtable, and even that they dodge with syntactic sugar like pig, cascading and scalding. they are not even real programmers - they don't refactor their code, some of them just write ad-hoc scripts that they don't even check in to the repository, they don't do agile, hell they don't care about readability of code - they call their chebyshev decompositions "def cbd()" instead of "public static void chebyshevDecomposition( DenseDouble2DMatrix inputMatrix)", how can you trust these jokers...

the genuine ML guys work on world-changing technology like genome sequencing and autonomous vehicles, natural language processing and credit card fraud detection, not fluff like mining information out of tweets and facebook likes and linkedin profiles and foursquare check-ins.

but you see, the same intersections are also frequented by pedestrians from the north, and the west, the south and the east, so the data scientists can put food at the table and make rent :)


How do you become a self made data scientist? What are the approiate books, courses, majors to study?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: