Team 7
Ghoul Pool



Aashray Arora | Varsha Paidi | Sravanthi Pereddy | Udit Kumar Gupta


Challenge / background descriptions

Literature review

In our challenge, we have to predict the age at which given celebrity will die. Since the output variable is not class labels, this is not a classification problem. Here the output variable Age at which celebrity will die is continuous value which is clearly regression problem. So we have referred many papers to understand how Age can be used as output variable in various Machine learning regression models.


“Age Likes Some Years”
Siddiqi, Ahmed. "Age Likes Some Years." Scientometrics Vol. 69.2 (2006): 315-21. Akadi Kiadó. Web.

Hypothesis:
Method:
Results:
We compared the results described in the paper with our results (on our dataset). We used equal number of people in each profession (15,000). The result are shown below. We did not observe the peaks mentioned by the author. The author has a very small sample. On a large dataset like ours, such claims would not hold.

Number of deaths at different age groups for different professions, to evaluate claim made in [22]. We do not observe the peaks mentioned by the author on larger dataset like ours.


"Demographic Research"
De Beer, Joop. "Smoothing and Projecting Age-specific Probabilities of Death by TOPALS." Demographic Research Vol. 27.20 (2012): 543-92. Max Planck Institute for Demographic Research. Web.

Life Tables
Arias, Elizabeth. "United States Life Tables, 2008." National Vital Statistics Reports Vol. 61.3 (2012). Web.

The cohort life table:
Period Life Table:
Probability of dying:
“Demographic Prediction Based on User's Browsing Behavior”
Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen

Data sets

To download the data set, click on the following link.
Data: Data


Possible Data Sources


Initially we had explored various sources for gathering our data. Like the Notable Names Database, Wikipedia, DBpedia and Freebase. Notable Names Database (NNDB) consists of data of more than 40,000 very famous people. But this was too small for our needs. We considered parsing Wikipedia, but due to its unstructured nature, even a robust web crawler was failing to get all the information we need easily. This led us to DBpedia. It exposes Wikipedia has a structured database. This seemed good for our needs before we explored Free- base. Freebase is a community-curated database of well-known people, places, and things. Freebase consists of data from Wikipedia as well as some other sources, for example, biographies of people. The persons in Wikipedia is subset of the ones in Freebase. Using Freebase effectively increases our dataset size and hence we finalized on using it for our dataset.



WHO Life Table

SSA Life Table

For the life tables our possible data sources were the World Health Orga- nization (WHO) life tables and the Social Security Agency (SSA) actuarial life tables. We have compared the two in the "Evaluation of Life Tables" section and justified our decision to use the SSA actuarial life tables for the life expectancy predictions.


Freebase as Data Source

Freebase classifies objects using ontology. We are interested only in famous people. The persons class is useful in our case. Freebase has information of more than 2 million people. It was recently bought over by Google and we use Google APIs for Freebase to collect information from it.

Along with the APIs we use the Meta-web Query Language (MQL) for Free- base to query the Data we need. MQL is similar the Structured Query Language (SQL) to filter queries to return only the data that is necessary.

Figure below shows a block diagram of the information extractor we developed to gather our dataset. First we use MQL to make a query for all unique persons. We use the APIs to send this request and it returns to us unique IDs for about 1000 people at a time. We use an iterator to get 1000 IDs at a time until we have all the IDs. We collected 2,274,435 IDs. We then send a request to get information about each person using the person’s ID, one by one. We get fields like Name, Date of Birth, Date of Death (if dead), Age, Gender, Nationality, Cause of death, Profession, etc. Freebase imposes a limit of 100K queries per user per day, hence we used multiple user accounts over 2 weeks to collect information about each person.


Size of Dataset

We have a dataset of 2,274,435 rows (people) in total. We have collected many fields (columns) for each person. In total our dataset has 63 columns. Some of them are Name, Date of Birth, Date of Death, Age, Birth Place, Nationality, Gender, Cause of death, Profession(s), Religion, Ethnicity, Spouse/Partner, Education, etc. Clearly, not all fields are relevant for models currently, but since we had almost no extra cost in collecting them we have included them. This will help us in our next steps in case we find interesting ways to use them in our models. We remove the (currently) irrelevant columns in the data filtering phase.

Data Cleaning

We first did some basic cleaning on this dataset. We have deleted rows that do not contain a date of birth for a person. We have used one main profession for people who were classified into many professions. We also performed some other minor cleaning up like removing null rows and random strings in numeric columns.

Preprocessing & Filtering

Next we had to do a lot of preprocessing on our dataset before it can be used on our machine learning models.

Firstly, we needed a way to specify the profession a person belongs to numerically. For this, we made a new column for each profession that we are interested in modeling and marked a 1 there if the person belongs to that profession and a 0 if he/she does not.

Secondly, we needed a way to specify the nationality of a person numerically. For this we hashed each country to a unique numeric ID and created a new column for each person identifying the nationality id he/she belongs to.

Third, we needed a way to specify the gender numerically. We used 0 to represent a male and 1 in case of a female.

Fourth, we decided to use the dead people in our dataset for training purposes. Hence, we filtered to include only dead people. The 2010 life tables would not depict good predictions for persons born very early in the timeline, say in 1800. Hence we further filtered to include dead people born only after the year 1890.

Finally and most importantly. After seeing incorrect results from models that predict the year a person would die and/or the age he/she would die and after discussing the same with Professor Skiena[7]. We decided to manipulate our dataset in a way that we can predict the number of remaining years in a person’s life. For this we took 4 different years into consideration, namely, 1915, 1940, 1965 and 1990. For each person we calculate the remaining years from each of the aforementioned years. Thus we have the current age of the person at the year under consideration and the number of more years the person lived from his/her current age. Our code is robust enough to do this process for more than the 4 years taken into consideration, if necessary. This process effectively increases the size of our dataset by 4 times.


We are quite satisfied with our data set, we have used an apt source given our project domain and the data set is large enough for us to sample and develop machine learning based models.


Development Environment

We have heavily modularized our development and evaluation environment. Figure below shows a block diagram of the same. Basically we just run a single command, which trains and tests our data and generate statistics of each model we have implemented and also tells which one is performing the best based on these statistics.

Figure 4: Development Environment

After the steps taken on our dataset described in the previous section we divided our data into train and test parts. We have used 80% of our data for training and 20% as test data for evaluation. After this we pass the training and test data as parameters to one of the many machine learning algorithms we use which we describe in the next section. Once the training is done, we generate a summary, which consists of the coefficients assigned by the machine learning algorithm and statistics about the training.

We then evaluate the model on our test data and the stats function then generates statistics about the evaluation. It compares the test data predictions of remaining years of a person’s life to the actual remaining years the person had. The evaluation statics we report are:

The stats() function also generates a scatter plot comparing the actual and predicted values. The stats() function also generates an error plot for the error made in the predictions. Then another function uses this model to make predictions of remaining years on our final dataset (of the 32 persons given by Prof. Skiena).

Observations


1. Co-relation Matirx



  • We observed high negative correlation between the expected remaining years from life tables (life expectancy rem) and current age (curr age) of a person. Similarly we observed high negative correlation between the actual remaining years (remaining years) and current age (curr age) as well. This is pretty obvious intuitively.


  • More importantly we observed a very high positive correlation between the expected remaining years from life tables (life expectancy rem) and the actual remaining years (remaining years). This shows that the life tables make good enough predictions and we could potentially use it for our baseline predictions.




2. Observation of Death Age on different Professions


  • We plotted a box plot to make observations on the death age of persons belonging to various professions.


  • On comparing the medians we observe that persons belonging to the music industry (rockstars), tend to have shorter life than other famous persons. Scientists (including professors) tend to live longer than other famous persons.





3. Observation on Nationality



  • We plotted a data map to observe the life expectancy of the people in our dataset across different countries.


  • We observe that people of some african countries live lesser than others. We observe that people from the more developed countries like USA, Canada, U.K, etc. tend to live longer than others.


Baseline model




In this section we will explain our baseline model

Initially, we used WHO life tables for our baseline model but now we are now using the SSA actuarial life tables instead of the WHO tables. We also make predictions on the remaining life a person has rather than the probability of death of the person.

The reason for the same is presented in the life table evaluations section further.

The baseline model basically gets the expected remaining life span of a person from the life tables given the persons age and gender. We evaluated our baseline model. Figure shows the evaluation statistics.



Evaluation of Life Tables

  • A life table shows for each age, what the probability is that a person of that age and gender will die before his or her next birthday. It also depicts the survivorship and life expectancy for people at different ages.
WHO Life Table Evaluation on our data set

SSA Life Table Evaluation on our data set



  • Our initial baseline model assigned the probability of death from life tables to each of the personalities based on their respective current age and gender. From the assigned probabilities, predictions were made as to which celebrity would have kicked the bucket first. This model used the probability of death obtained from World Health Organization (WHO) Life Table. We evaluated these life tables and realized that the values obtained were not suitable for our needs. For example, the WHO life tables gives a death probability of 1 for each person above the age of 100. Which is clearly not appropriate. Probably the WHO focuses on young people more than modeling the risks that come with aging.

  • We then moved from the WHO life tables to the Actuarial Life Table for 2010 given by the Social Security Agency. These tables looked more like what an insurance company would be using. The probability of death for a 100 year old male given by them is around 0.35. We evaluated these tables on our data set.

  • For each person in the dataset, we calculated the death probability at each of the integral age until the person died or their current age. We also flagged that the person was alive for each of the integral ages and dead if the person died. Then the assigned probabilities were categorized into 17 categories. Starting from range less than 0.0002 all the way up to a range between 0.8 and 1. Subsequently, we calculated the number of alive people, number of dead people, total number of people and the ratio of number of dead people to total number of people for each of the 17 probability categories. To calculate the number of alive people, we counted all the people who were alive for each of the probability range. Similarly, we calculated the number of dead people by summing up all those who were dead for each of the probability range. The total number of people for each of the probability range was calculated by summing the number of people alive and dead. Then the ratio of the number of dead people to total number of people was calculated. The obtained values after calculating the ratios were analyzed. We found that for higher probabilities, the ratio was within the probability ranges. But for some probability ranges there was a small deviation from the expected value.


Advanced model


All the machine learning methods were implemented using either the Scikit-learn python library or the statsmodel python library. The Summary section shows the evaluation details(Performance Stats, Scatter Plot, Error Plot) of each algorithm. We have summarized all the machine learning algorithms below.


Linear Regression

We used Liner Regression (also called Ordinary Least Squares, a type of supervised learning method) on our dataset. Linear Regression fits a linear model with coefficients w = (w 1, ..., w p) to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.




Ridge Regression

We used Ridge Regression on our dataset. It addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. Here, α ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of α, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. We have selected an α value of 1.




K Nearest Neighbours (KNN)

Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors. The basic Nearest Neighbors Regression uses uniform weights that is, each point in the local neighborhood contributes uniformly to the classification of a query point.

Under some circumstances, it can be advantageous to weight points such that nearby points contribute more to the regression than faraway points. This can be accomplished through the weights keyword. The default value, weights = ’uniform’, assigns equal weights to all points. weights = ’distance’ assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function of the distance can be supplied, which will be used to compute the weights.

We found k=10 to be the optimal value during our testing.




Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.




Random Forest

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.




Summary of performance of ML models


In total, we evaluated 6 machine learning algorithms. Figure below shows the evaluation statistics for all the machine learning models. Figure below also summarizes the performance of each of these algorithms and highlights the best one, based on the Mean Absolute Error (MAE). Ridge Regression performed the best .



Scatter Plot: Baseline Model

Error Plot: Baseline Model



Scatter Plot: Linear Regression

Error Plot: Linear Regression



Scatter Plot: Ridge Regression

Error Plot: Ridge Regression



Scatter Plot: K-Nearest Neighbour

Error Plot: K-Nearest Neighbour



Scatter Plot: Decision Tree

Error Plot: Decision Tree



Scatter Plot: Random forest

Error Plot: Random Forest



It is clear to us that overall our models are better than the baseline (if at all) by a very small margin. Our models achieved a maximum of about 12% better than the baseline model for ridge regression. We understand that we need to significantly improve our models to give a much better efficiency.



Final prediction


Final Report

Ridge Regression is currently the best model. The current predictions as per the ridge regression are shown in the left figure.

These predictions do look good to us for a few reasons.

Firstly, we observe that Rock/Music star and to some extent Movie stars are penalized by the model, predicting they would have shorter lives than others.

Secondly, Scientists are observed to live a little longer than others as observed with Stephen Hawking.




Calculator

Click Here to Calculate

IPython Notebooks


References

  1. Wikipedia Entry on Death Pool, http://en.wikipedia.org/wiki/Dead_pool
  2. Wikipedia Entry on Life Tables, http://en.wikipedia.org/wiki/Life_table
  3. Wikipedia Entry on Actuary, http://en.wikipedia.org/wiki/Actuary
  4. Predicting When You’ll Die - weather.com, The Weather Channel, http://www. weather.com/health/predicting-your-death-20130306 (Accessed: October 21, 2014)
  5. Actuarial science, wikipedia.com, https://en.wikipedia.org/wiki/Actuarial_ science/ (Accessed:October 21, 2014)
  6. World Heath Organization life tables, http://www.who.int/gho/mortality_ burden_disease/life_tables/life_tables/en/
  7. Social Security Agency life tables, http://www.ssa.gov/oact/STATS/table4c6. html
  8. Notable Names Database, http://www.nndb.com/
  9. DBPedia, http://dbpedia.org/
  10. Freebase, https://www.freebase.com/
  11. Distinguished Teaching Professor Skiena, Computer Science, Stony Brook University. http://www3.cs.stonybrook.edu/~skiena/
  12. Scikit-learn: Machine Learning in Python, http://scikit-learn.org/stable/ index.html
  13. Statsmodels : a Python module that allows users to explore data, estimate statistical models, and perform statistical tests, http://statsmodels.sourceforge.net/
  14. Mehta, Rajendra H., Toru Suzuki, Peter G. Hagan, Eduardo Bossone, Dan Gilon, Alfredo Llovet, Luis C. Maroto et al. ”Predicting death in patients with acute type A aortic dissection.” Circulation 105, no. 2 (2002): 200-206.
  15. Farr, Barry M., Andrew J. Sloman, and Michael J. Fisch. ”Predicting death in patients hospitalized for community-acquired pneumonia.” Annals of internal medicine 115.6 (1991): 428-436.
  16. Mortality and global health estimates, WHO, http://www.who.int/gho/ mortality_burden_disease/en/ (Accessed: October 21, 2014)
  17. social security, Actuarial Life Table, http://www.ssa.gov/oact/STATS/table4c6. html (Accessed:October 21, 2014)
  18. NNDB: Tracking the entire world, NNDB: Tracking the entire world,http://www. nndb.com/ (Accessed: October 21, 2014)
  19. Wikimedia Foundation, Wikipedia, Wikipedia, https://en.wikipedia.org/wiki/ Wikipedia (Accessed:October 21, 2014)
  20. wiki.dbpedia.org : About, wiki.dbpedia.org : About, http://dbpedia.org/About (Accessed:October 21, 2014)
  21. DBpedia: Distributed Extraction, nileshc.com, http://nileshc.com/blog/2014/ 06/dbpedia_distributed_extraction/ (Accessed:October 21, 2014)
  22. Siddiqi, Ahmed. ”Age Likes Some Years.” Scientometrics Vol. 69.2 (2006): 315-21. Akadmiai Kiad. Web
  23. VIERCK, E., K. HODGES, Aging : Demographics, Health, and Health Services. Westport, Conn., Greenwood Press. (2003)
  24. De Beer, Joop. ”Smoothing and Projecting Age-specific Probabilities of Death by TOPALS.” Demographic Research Vol. 27.20 : 543-92. Max Planck Institute for Demographic Research. Web. (2012)
  25. Arias, Elizabeth. ”United States Life Tables, 2008.” National Vital Statistics Reports Vol. 61.3 Web. (2012)
  26. Campbell, John, C. Diep, J. Reinken, and L. McCOSH. ”Factors Predicting Mort ity in a Total Population of the Elderly.” Journal of Epidemiology and Community Health : 337-42. Web. (1985)
  27. De La Croix, David, and Omar Licandro. ”The Longevity of Famous People from Hammurabi to Einstein.” Barcelona GSE Working Paper Series. Web. (2012)
  28. Louis DZ, Robeson M, McAna J, et al. Predicting risk of hospitalisation or death: a retrospective population-based analysis. (2014)