> OLS works fine in classification problems. And it has advantages.
Do you have more explanation of these advantages? I read through the link you sent, and a bit more about linear probability models. Such things were never discussed in my statistics curriculum (BS, MS, PhD), except for motivating why logistic regression was necessary. I'm not sure I understand the economist's arguments in favor of LPM. Both the interpretation and the distribution of the test statistics will be totally different with OLS versus Logistic Regression, and the overall probability of a defunct project is pretty small ( \hat{P(y=0)} = .07 )--enough where there would be pretty big differences. To be clear, my reservation is with the p-values in the OLS model, not the predictions it generates. While the models agree on the direction of the covariates, the magnitudes are quite different, even when you convert logit/probit to be on the same scale as LPM.
> Multicollinearity refers to perfect multicollinearity.
Perfect multicollinearity will definitely mess up the estimation, but even if Score and Comments are not perfectly collinear, it's difficult to talk about each one's effect on the probability individually, as is the interpretation of coefficients in a (logistic) regression. What does the VIFs look like for Score and Comments, in particular?
> I mention R² as one measure of predictive power.
But the outcome is binary, so you'll have a similar issue as Minimaxir's first point about OLS. If you wanted to talk about prediction accuracy, what about a confusion matrix, misclassification rate, or specificity/sensitivity/F1? Granted, you'll not want to predict on the same tagged examples that you trained the model on, but maybe you could split it 80-20? Or tag another 20-50? There are also R²-like measures you can use when the dependent variable is binary (a whole class of pseudo-R² measures).
I would be curious to see the relationship between these predictors and the response. It's usually been my experience that linearity is a strong assumption to make, and that I'd expect for something like comments or score that once it reached a certain threshold, there was no extra value added by getting more comments/score. Are the log-score and log-comments linear over their entire support?
> OLS works fine in classification problems. And it has advantages.
Do you have more explanation of these advantages? I read through the link you sent, and a bit more about linear probability models. Such things were never discussed in my statistics curriculum (BS, MS, PhD), except for motivating why logistic regression was necessary. I'm not sure I understand the economist's arguments in favor of LPM. Both the interpretation and the distribution of the test statistics will be totally different with OLS versus Logistic Regression, and the overall probability of a defunct project is pretty small ( \hat{P(y=0)} = .07 )--enough where there would be pretty big differences. To be clear, my reservation is with the p-values in the OLS model, not the predictions it generates. While the models agree on the direction of the covariates, the magnitudes are quite different, even when you convert logit/probit to be on the same scale as LPM.
> Multicollinearity refers to perfect multicollinearity.
Perfect multicollinearity will definitely mess up the estimation, but even if Score and Comments are not perfectly collinear, it's difficult to talk about each one's effect on the probability individually, as is the interpretation of coefficients in a (logistic) regression. What does the VIFs look like for Score and Comments, in particular?
> I mention R² as one measure of predictive power.
But the outcome is binary, so you'll have a similar issue as Minimaxir's first point about OLS. If you wanted to talk about prediction accuracy, what about a confusion matrix, misclassification rate, or specificity/sensitivity/F1? Granted, you'll not want to predict on the same tagged examples that you trained the model on, but maybe you could split it 80-20? Or tag another 20-50? There are also R²-like measures you can use when the dependent variable is binary (a whole class of pseudo-R² measures).
I would be curious to see the relationship between these predictors and the response. It's usually been my experience that linearity is a strong assumption to make, and that I'd expect for something like comments or score that once it reached a certain threshold, there was no extra value added by getting more comments/score. Are the log-score and log-comments linear over their entire support?