A large amount of Show HNs are not startups; they are personal projects.
Relatedly, using "the startup is not dead" as a metric for the startup alive is a bad idea. People do not shut down everything when a Hacker News submission does not get upvotes (especially in the case of Show HN, where many are hosted on GitHub pages and are free to host, although you account for that in the regression).
The regression has a few issues:
> It's a classification problem (alive or dead) so OLS doesn't make sense.
> Score and Comments are multicolinear and cannot both be in the same model.
> You don't answer or give statistics toward "how well" the model predicts.
> Related to the comment earlier, you don't comment on the magnitude of the on_GitHub coefficients, which are huge and skew the entire result of the regression!
While I always appreciate analyses of HN data, the conclusions raise more questions than answers.
0. I measure Show HN projects, some of which are startups. The dead/alive status as success is a proper measure for survival. For success, see the section on commercial success.
> OLS works fine in classification problems. And it has advantages.
Do you have more explanation of these advantages? I read through the link you sent, and a bit more about linear probability models. Such things were never discussed in my statistics curriculum (BS, MS, PhD), except for motivating why logistic regression was necessary. I'm not sure I understand the economist's arguments in favor of LPM. Both the interpretation and the distribution of the test statistics will be totally different with OLS versus Logistic Regression, and the overall probability of a defunct project is pretty small ( \hat{P(y=0)} = .07 )--enough where there would be pretty big differences. To be clear, my reservation is with the p-values in the OLS model, not the predictions it generates. While the models agree on the direction of the covariates, the magnitudes are quite different, even when you convert logit/probit to be on the same scale as LPM.
> Multicollinearity refers to perfect multicollinearity.
Perfect multicollinearity will definitely mess up the estimation, but even if Score and Comments are not perfectly collinear, it's difficult to talk about each one's effect on the probability individually, as is the interpretation of coefficients in a (logistic) regression. What does the VIFs look like for Score and Comments, in particular?
> I mention R² as one measure of predictive power.
But the outcome is binary, so you'll have a similar issue as Minimaxir's first point about OLS. If you wanted to talk about prediction accuracy, what about a confusion matrix, misclassification rate, or specificity/sensitivity/F1? Granted, you'll not want to predict on the same tagged examples that you trained the model on, but maybe you could split it 80-20? Or tag another 20-50? There are also R²-like measures you can use when the dependent variable is binary (a whole class of pseudo-R² measures).
I would be curious to see the relationship between these predictors and the response. It's usually been my experience that linearity is a strong assumption to make, and that I'd expect for something like comments or score that once it reached a certain threshold, there was no extra value added by getting more comments/score. Are the log-score and log-comments linear over their entire support?
The dataset interpreted my site as dead because it 301 redirected to the SSL connection. This is probably quite common, so take the living/dead stats with a grain of salt.
Came here to say exactly this. Most stats should probably either treat 3XX as "success" or follow the link to figure out what the next location's status is.
Not complaining, just a note. Those counts are from different points in time of Hacker News. Which means that a chance to get point grows with userbase. Some extreamly popular thing in 2011 can't compete with something in 2016.
Probably it's hard to guess how many user HN got in those time points. But maybe adding section by year could help a bit.
Heh, I have 3 projects on that list and I'm still running two of the three. One was a flappy bird clone in Swift; the other is a Touch Visualizer for iOS and the third is https://www.gitignore.io. Pretty interesting analytics; it's funny because I'm in the process of migrating the touch visualizer to a Swift project and a new project owner.
Relatedly, using "the startup is not dead" as a metric for the startup alive is a bad idea. People do not shut down everything when a Hacker News submission does not get upvotes (especially in the case of Show HN, where many are hosted on GitHub pages and are free to host, although you account for that in the regression).
The regression has a few issues:
> It's a classification problem (alive or dead) so OLS doesn't make sense.
> Score and Comments are multicolinear and cannot both be in the same model.
> You don't answer or give statistics toward "how well" the model predicts.
> Related to the comment earlier, you don't comment on the magnitude of the on_GitHub coefficients, which are huge and skew the entire result of the regression!
While I always appreciate analyses of HN data, the conclusions raise more questions than answers.