2020

Predicting World Happiness

My final project for my undergraduate Machine Learning class, where I used the KNeighborsRegressor and Linear Regression to predict the happiness score of countries around the world

Overview

For my final project for Machine Learning, I looked at 2015 through 2019 happiness data from the World Happiness Report. To predict a country's happiness score from a particular year, I implemented Linear Regression and K Neighbors Regressor from scikit-learn.

Dataset

Data for the years 2015 through 2020 were made available on Kaggle^*. Due to more drastic inconsistencies in the data for 2020 relative to 2015 through 2019, this project focused on using data from 2015 through 2019. The columns I looked at are as follows –

Country

Region – Regions used are from the 2015 dataset and are mapped to all countries in the other datasets to ensure consistency and add regions to datasets that didn't previously have them

Happiness Score

Economy (GDP per Capita)

Health (Life Expectancy) – The values are based on the data extracted from the World Health Organization's Global Health Observatory data repository

Freedom – The values are national average of responses to "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"

Trust (Government Corruption) – The questions asked were, "Is corruption widespread throughout the government or not?" and "Is corruption widespread within businesses or not?" The overall perception is just the average of the two 0-or-1 responses.

Generosity – The values are national average of response to the Gallup World Poll question "Have you donated money to a charity in the past month?"

The dataset is stored in a Pandas DataFrame. Certain years did not have region, so the regions used in the 2015 dataset are mapped to every country in all of the other datasets. An additional column was added to the DataFrame in order to make note of the year that this datapoint is from in order to concatenate each year's DataFrame into one larger DataFrame while retaining the information about the year it came from.

^* Since working on the project in December 2020, more recent data has become available.

Observations

The seeds that I compared the results of were 10, 18, 26, 34, and 42.

Common Observations among All Seeds

The K Neighbors Regressor model was more correlated than the Linear Regression model, no matter what the seed was for shuffling, even though linear regression is known to work well with continuous data and K Nearest Neighbors is not known to be a very effective algorithm.
While there were a few outliers that predicted a lower happiness score than it should have, most outliers would predict a higher happiness score, regardless of whether the model was K Neighbors Regressor or Linear Regression.
As the seed increased, the correlation coefficient R² for both models decreased.
The models performed about the same, regardless of whether or not the feature values were scaled using the MinMaxScaler.
Both models had a generally high correlation coefficient R² (> 0.7), and most points on the scatterplots were somewhere around the ideal line of y = x.

Other Observations

Before having each point on the scatterplot colored based on which region the country was in, it seemed like the models were working well, and the data I found on Kaggle was good data. When looking at the graphs after coloring each point by region, it seems like the data may look at each country's data to determine and assess its overall happiness score only based on the happiness definition of Western Europe.
Countries with darker skinned / black populations tend to be on the lower end of the ranking. It makes me curious about how to change this dataset to include multiple definitions of happiness and taking in the different ways that different cultures and different individuals perceive think about what it means to be happy and other factors that can be measured instead.
Going off of the last observation about how a country's region was not a feature used to predict the happiness score of each country, it seems like a lot of this computation was based off a definition that a certain group of people, and by this definition, since different countries in the same region do have similarities, it seems to make sense, in a way, that countries in the same region may be in the same part of the scatterplots.

Goals for the Future

– Update the dataset and predict 2021 data (and later)
– Learn more about how to collect data
– Put together different data sources
→ Consider other potential factors of happiness including but not limited to
⇒ Volunteer work
⇒ Scale of self-appreciation
⇛ 1 = no self-appreciation
⇛ 10 = lots of self-appreciation
⇒ How much natural light people are exposed to
⇒ Air quality and pollution – scale
⇛ 1 = no pollution
⇛ 10 = high pollution
⇒ Personal assessment of work-life balance
⇒ How much experiences versus material goods are valued
→ Compare similar measures from different data sources
– Learn more about positive psychology to help determine what leads to better perceptions of personal happiness
– Make a web app to make the data and results more accessible
– Combine with the Happy Journal web app project