Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I couldn't agree more with this. The ML courses that Ng and Koller teach are really missing a lot of the statistical tools you need to do real-world data mining and ML.

My experience: I had basically zero math background, but I took ML with Ng and probabilistic graphical models with Koller, and later was a TA for Ng's ML class, during my Masters' degree and thought I was all set to go into machine learning jobs. To my surprise, I consistently found myself in interviews stumped by questions from basic stats, particularly significance testing, which people with more traditional stats backgrounds assume is basic knowledge (and it should be), but which wasn't taught in any of my ML classes.

I'm in a job now that involves some machine learning, but the ML component is 50% marshalling data (formatting, cleaning, moving), 40% trying to figure out how to get enough validated training examples, and 10% thinking about the right classifier to use (which someone else already implemented). Which to be honest is not very interesting.

So yeah, becoming a real data scientist is hard, requires a lot more knowledge than you get in one ML course, even from Andrew Ng, and the reality of the work often doesn't make it some dream career. And the competition for jobs isn't from other people who also just took that course -- it's from PhD statisticians and statistical physicists who might have taken one ML class to show them how to use all the mathematical tools they already have to do the new hot thing called machine learning.



My day job is as a data scientist, and most of the applied ML I perform is simply plugging data into some optimizer/detector for best algorithm and running split train/test loops. Most of the work is herding the cats in the business units and justifying my salary to the Board of Directors.


I was pretty gung ho about getting into ML two years ago and put a lot of time into online courses like Ng's, books, and ground-up implementations of a lot of the common algorithms. I enjoyed it, but after a while it became clear to me that a lot of this stuff is better described as applied statistics.

And this can be powerful, of course, but it doesn't really have much of the magic of AI.


"a lot of this stuff is better described as applied statistics"

This is a key insight. Bravo!

To generalize a bit more, most of ML is applied mathematics. Getting a good grounding in the underlying math is the most illuminating step to learning ML (spoken as someone who wasted a lot of time doing other things thanks to an irrational fear of learning mathematics and am still bad at it)

Deep math/stat understanding combined with the engineering bits(like programming, cleaning the data, running clusters) and the communication bits, (like visualization) brings you to (what should be) 'data science' (imvvho ymmv etc etc).

I am still not sure one person can pull it all off-it probably needs a solid team of specialists. But hey 'data scientist' is a hot job description, and so you can't blame people who know bits and pieces (sometimes very small bits and pieces ;) ) calling themselves 'data scientists' or whatever. "Machine Learning for Hackers" and all that jazz. We've seen all this before with "HTML coders" from the nineties.


Work as a search quality engineer at Google and you do pretty much all of that.

Except for running the clusters[1], I've done pretty much all of those steps myself. I started with a nice statistical idea, built some simple models, played with feature selection and learning algorithms, built model viewers, built classifiers, validated classifiers, built demos, validated demos, built a production implementation[2], optimized the production implementation to make it small/fast enough, and finally launched a big search quality improvement.

[1] I certainly write distributed code that runs on them, but maintaining the DCs definitely isn't part of my job description.

[2] Validation of the final quality in prod is actually someone else's job, not because I couldn't do it, but you might not want me to tell you how good my stuff is, cause you know, I might be biased.


Right, it's less sexy than people think. That was my reaction when taking an Artificial Intelligence class and a Machine Learning class over 10 years ago as an undergrad. I was like, "these are unprincipled hacks". I liked graphics better. There were actual algorithms.

But in college you never get to apply them to real problems. I think if you apply them to real problems you'll have the revelation. Especially when you try other approaches first. But actually applying them requires domain knowledge, data cleaning skills, and programming skills beyond what many people have (certainly myself as an undergraduate).


Not disagreeing with your second paragraph, but just wanted to point out that Machine Learning has matured way beyond the "unprincipled hacks" phase, and as was correctly pointed out above, can be seen as a direction in applied statistics. If you look at a modern course in multivariate statistics, there's a significant overlap with ML (http://goo.gl/GTDUC).

I think people's expectation of a new scientific or engineering discipline to be "sexy" or "magic" is simply a sign of widespread ignorance of the field, so it's a good thing when something becomes less "magic" and more "real". I bet airplanes were "magic" until we learned how to fly them consistently and safely :)


Machine learning is full of hacks to connect theory to producing results given constraints. That's not too different from hacks that are done for the sake of performance in graphics rendering.


There isn't much in AI that has "the magic of AI".


There is a successful school of thought that says significance testing is hogwash, and that school may be found near Palo Alto.


Specifically in the Gates building, I believe... but what about the tribe across the road in Sequoia Hall?


So some social skills and some database skills. That's pretty much the description of every IT job. What's so hard?

Predictive analytics has the possibility to radically transform many industries. This trend of hyping up it's difficulty is a dishonest tactic to inflate salaries. Pretty soon we'll get the fraudalent activities of lawyers and colleges with their accredition requirements if this trend isn't kept in check.


I'd add math skills, but I really appreciate this comment. Hyping the difficulty and complexity of these tasks to protect economic turf - and insisting on accreditation and experience vs. actual results - is never a good long-run play.

The only value any professional can reliably claim to add - data scientist, software architect, CEO, etc. - is the value they can prove. One must be able to justify their job or salary by superior results (quantitatively better predictions, lower costs, faster development cycles). If you can't do this, but instead insist your superior training protects the company from some scary, unseen, vague 'bad consequences' of not having a 'true professional' performing your work, then you're just as big a drain on enterprise value as a lawyer.

Or worse, an MBA. :)


You missed the stats and the math.


As practicing data scientists have pointed out in this thread, that is the least part of the job. And, really, how hard is it. As icelander points out, it comes down to plugging data into standard algo's.


'As practicing data scientists have pointed out'

The OP and resulting discussion is about whether people calling themselves (or having titles saying) 'data scientists' actually are 'scientists' or know anything about 'data'.

so one guy having a 'data scientist' title and spending his time plugging data into black boxes says very little about what 'the least part of the job' in general is.

If anything your "some social skills and some database skills" is an even poorer description. By that measure, the business analyst next door who can write some SQL scripts is a 'data scientist'. But then otoh, maybe he is, who knows? ;)

Edit: I just noticed you are the "Are there practical applications for proving theorems? It seems like it's the full employment act for pencil pushers." guy.

http://news.ycombinator.com/item?id=4634969

My apologies for replying to your comments. Won't happen again. (beware @jacquesm !)


> (beware @jacquesm !)

I should log out of here anyway, thanks for the reminder

chmod 444 news.ycombinator.com


edit: I thought you were agreeing with that slandering fool plinkplonk so I got a little out hand. Sorry jacquesm, i've edited it all out.


There isn't a thing in what I wrote here that warrants this comment.


I don't think there's such a huge body of knowledge in "data science" as people claim. It's not astronomy or biology etc. As a "data scientist" the predictive ability of your model is all that counts. So the social skills and basic knowledge of stats + your progrmming skills/creativity is what's important (and the kaggle competitions have borne this out).

edit: Ok, ad hominem attack. And in that thread no one managed to making a convincing case for the utility of millions/billions of people learning maths as traditionally taught. You're probably a data scientist who wants to inflate your salary (there's my attack).

edit: LOL. you're a joke plinkponk, hahaha - being called out on your lies about your chosen profession being hard (so that you can feel special) really gets you doesn't it! http://www.urbandictionary.com/define.php?term=Wiener%20Didd...


Keep it civil man. You can argue your points without calling names. As a party with no horse in this race, I find the conversation interesting enough without the need to get nasty.


A bit of stats and a bit of math are not that hard.

Really knowing your stuff in either one of those fields is hard, knowing them very well and knowing enough computer science to apply it all (properly) is more the work of a small team than a single individual.

People like that are rare. There is nothing stopping anybody from calling themselves 'data scientist'. Just like there is nothing stopping anybody from calling themselves software architect or system administrator.

In the end that's just people marketing themselves as good as they know how but that does not mean there isn't a sliding scale between warm body and excellence. I think that is the distinction the article tries to make.


The evidence that is coming out from kaggle doesn't support the claim of teams with experienced specialized skill. Teams of a single student have beat out entire industry of companies (the essay score competition for example).


Kaggle competitors work on small data sets, so algorithmic problems don't really surface there.

Just because your algorithm can "beat" an industry's worth of work doesn't mean it can be implemented in a practical or efficient way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: