We were interested in using word frequencies of each author in their work to predict (1) whether the author has an age above 22; (2) whether the author self-identified as female . So the goal of this project is to find some statistical methods that can get a good prediction result. Since there were 16432 authors and 23277 words, it was a very typical high dimensional binary classification problem. I used Logistic Regression with Elastic Net penalization and R package to finish this analysis. I got A for this course at the end of the semester.
Code: click here.
I divided the whole project into two parts, sex and age. There will be two models for sex and age, separately. For two models, the modeling processes are the same: The predictors are term frequencies for 23277 terms. The response is binary variable sex or age as presented in dataset. So there are 23277 variables and 16432 observations in the whole dataset.
Since it is a binary classification problem, I considered logistic regression. However, since the number of variables are far more than the number of observations, I considered using regularized logistic regression with Elastic Net:
The following part will be divided into two parts: (1) How I chose the best α and ƛ for Elastic Net regulation. (2) How I chose the best threshold for logistic regression.
2. Modeling Process
2.1 Choosing α and ƛ
A recipe for choosing the best group of α and ƛ:
- (a) Randomly split the complete dataset (without observations that need to be predicted) into training set and testing set with a proportion of 7:3. Since the complete dataset is balanced in sex and age, I also kept the two sets balanced in response.
- (b) Set a sequence of α = seq(0, 1, length.out = 10). For each α, using Area Under Curve (AUC) to choose the optimized ƛ.
- (c) For the list of α’s, choosing α and its corresponding ƛ that can maximize AUC.
2.2 Choosing Logistic Regression Threshold
A recipe for choosing threshold for regularized logistic regression given the best group of α and ƛ:
- (a) Randomly split the complete dataset into 4 folds. In each fold, split the data into training set and testing set with a proportion of 1:1. Since the complete dataset is balanced in sex and age, I also kept the two sets balanced in response.
- (b) Set a sequence of thresholds = seq(0.4, 0.6, length.out = 100). For each threshold, using 4-folds cross validation to calculate the averaged accuracy of prediction on testing set. The formula of accuracy is:
Accuracy = (True Positive + True Negative) / #(All Observations)
- (c) For the list of thresholds, choosing threshold that can maximize CV accuracy.