Life Satisfaction Predictor
- Language: R
- Report: Report
- Github: Predicting Life Satisfaction
Overview: 2-3 minute read
Life satisfaction is a phenomenon difficult to accurately define or capture. In our study we have been given 270 possible determinants of life satisfaction. Several classification techniques were employed to determine the best possible combination of the determinants which can best explain a person’s satisfaction in Europe
Domain knowledge was first used to eliminate some variables that seemed reasonably irrelevant to the study at hand. Variables such as the timing of the interview, the gender of the third to thirteenth person at home, and more. Also, variables with over 50% missing data were removed to avoid any bias to the results. This left us with about 184 variables. LASSO was used to further decrease the number of variables to a total of 136 variables. A combination of backward stepwise regression and the importance function in random forests further reduced the number of predictors to a subset of 19 variables. The analysis was done on both the 136 variables and the 19 variables to guarantee that there was no potential loss.
Techniques such as boosting, logistic regression, trees, random forests, and neural networks were used. Almost all yielded similar results. Only using an ensemble of trees, random forests and boosting did we get our highest result which was an accuracy of 89.6% on 5 fold cross validation.
In depth: 15 - 20 minute read
Abstract
Life satisfaction is a phenomenon difficult to accurately define or capture. In our study we have been given 270 possible determinants of life satisfaction. Several classification and learning techniques have been employed to determine the best possible combination of such determinants can best explain a person’s satisfaction in Europe. Domain knowledge was first used to eliminate some variables that seem to reasonably irrelevant to the study at hand. Variables such as the timing of the interview, the gender of the third to thirteenth person at home, and more. Also, variables with over 50% missing were removed to avoid any bias to the results. This left us with about 184 variables. LASSO was used to further decrease the number of variables to a total of 136 variables. A combination of backward stepwise regression and the importance function in random forests further reduced the number of predictors to a subset of 19 variables. The analysis was done on both the 136 variables and the 19 variables to guarantee that there was no potential loss. Techniques such as boosting, logistic regression, trees, random forests, and neural networks were used. Almost all yielded similar results. Only using an ensemble of trees, random forests and boosting did we get our highest result which was 0.88741 on Kaggle and 0.896 on our test data.
Exploratory Data Analysis and Data Cleaning
The main source of exploring data was the long code codebook which contained all the necessary information regarding each variable. We used domain knowledge to eliminate variables that seemed irrelevant such as time of interview, the gender of third to thirteenth person at home, third to thirteenth person in the household in their relation to the respondent, and other variables. Our domain knowledge is mainly from the World Bank’s, OECD’s definitions of Satisfaction as well as some research on the predictors and determinants of Citizen Satisfaction. Predictors of life satisfaction are generally divided into five categories: economic, trust, safety, demographic, and happiness variables. Economic variables are variables such as income level, satisfaction with country’s economy, etc. Trust variables are ones that have to do with trust in the institutions such as parliament, police, political parties, etc. Demographic variables are ones that have to do with gender, age, country, etc. Safety variables have to do with feeling of safety when walking in the streets, able to have a role in politics, freedom of speech, etc. Finally, happiness variables have to do with asking the respondent directly about their state of mind and aspects of satisfaction within their daily life. All such definitions have variables that represent them in our dataset. Accordingly, the following variables were deemed irrelevant to our definitions of satisfaction and thus were removed.
Exploring Missing Data
24.35% variables had some missing data
3.3% variables had 50% missing data
1.8% variables had 80% missing data
The golden rule is to eliminate variables with 80% or more missing data but eliminating variables with more than 50% missing data gave better results.
Treating Missing Data
All categorical values were imputed by the mode
All Numerical values were imputed by the median
It was also made sure that categories were equally spaced (that is the difference between categories is one unit)
Variables with inconsistent numbered categories have had those numbers changed such as in the case of v56, v58, v60, v62, v65, v66, v67, v68 which all reflect the level of education for respondent as well as respondent’s parents and had a category numbered as “55” while other categories were 1 to 5 or 1 to 7. The categories were recoded to be in the same direction as well such that as we go higher in categories, the more positive the variable is ( i.e. in v1 “Able to take active role in political life” the higher the response the better or more positive the variable).
Ordinal variables were also converted to ordered factors to retain the significant and meaning of a category’s order.
Feature Selection
LASSO
Lasso regression is called the Penalized regression method, it is used to select the subset of variables. The LASSO uses the sum of the absolute values of the model parameters and imposes a constraint on them, where the sum has a specified constant as an upper bound. Because of this constraint, the regression coefficients of some variables become zero. This allowed us to identify which variables were most strongly associated with predicting life satisfaction. To optimize LASSO, we need to find the suitable tuning parameter 𝜆 which as increases, increases bias and decreases variance. We start out with a default value for 𝜆=100 then we plot the misclassification error for different values of 𝜆 to get the ideal tuning parameter. The tuning parameter is important because it controls the strength of the penalty, which implies a greater threshold for importance. LASSO excluded about 48 variables and left us with 136 variables.
Random Forest
Calculating the importance of each variate using the permutation which shows how a score decreases when a variable is not included and Gini impurity criteria in Random Forest, we were left with 6 variables as the most importance (Gini >= 13 and Permutation >=8)
Note that the variables with the most importance are one’s that literally ask about happiness, enjoyment, depression, etc. which all reflect the secondary satisfaction explained in our domain knowledge. The other variables ask about the economy which is also reflected in the economy part of our domain knowledge. Also note that v98 which is simply a question of how happy is a person correlates directly to our question of satisfaction in both a common sense perspective and an analytical perspective since it has a score of 100 in both criteria and has much greater influence than the other 5 variables combined according to our results.
Backward Stepwise Regression
It is a popular method to eliminate irrelevant variables. It starts with taking into consideration all the variables then eliminating the least significant variate one by one. This elimination process continues till all the insignificant variables are eliminated. Using Backward Stepwise Regression at its default values, we were left with 19 variables.
Note: The 6 variables calculated by Random Forest were a subset of the 19 variables calculated by Backward Stepwise Regression.
Final Result of Feature Selection
We decided to work with 2 datasets:
Dataset with 19 variables from Backward Stepwise Regression
Dataset with 136 variables from LASSO
Analysis
Initial Analysis
Initially, we went with using various classifiers; after tuning and testing each classifier we got (based on our train/test split): -
Tuning Parameters
Decision Tree – It was pruned till we got minimum value for test error.
Random Forest – Performed grid search on 1 – 30 trees at 10 or 20 nodes over a threefold cross validation to get 19 trees and node size of 20 for 136 variables and 8 trees for 19 variables.
Boosting – The number of trees were set to 1000, shrinking factor was 0.1.
KNN – For K = 5, we got maximum accuracy.
Final Analysis
Since Random Forests and Boosting had the highest accuracies and trees is of the same family of classifiers and had a much lower accuracy, we did an ensemble of random forests, boosting, and trees.
Used 50% data to train a random forest, boosting and decision trees.
Implemented an ensemble classifier on the above three trained models with logistic regression to get their weights.
Results and Conclusion
Accuracy on our test data – 89.6% (136 variables)
Accuracy on Kaggle – 88.74% (136 variables) and 88.484% (19 variables)
Position on leaderboard – 30
V98 “How happy are you?” is the most important and strongest predictor of satisfaction.