Ideally, inpatient healthcare services would be of high quality and readily available for all. However, several academic studies suggest that the quality of services may vary. For example, Swedish researchers have shown that inpatient services are less accessible during hospital overcrowding (i.e. when hospital beds are scarce) [1]. Moreover, it has been shown that outcomes in patients seeking emergency care at times of high Emergency Department utilization are worse than those of patients presenting at other times [2].
Emergency Department overcrowding has also been related to an increased incidence of in-hospital deaths [3, 4, 5]. Although not explicitly addressed in these studies, it is not unlikely that part of the increase in in-hospital mortality could be attributed to an admission-bias, where overcrowding in the Emergency Department is a symptom of hospital overcrowding, which in turn causes only the sickest patients to be admitted to a hospital bed [1].
Although connections have been established between the level of stress of the healthcare organization and patient outcomes (as above), there may be other potentially time-varying confounders that impact patient outcomes. Few studies have addressed these, however. This is of academic as well as public interest, as knowledge about such confounders may be potential targets for managerial decisions.
The data used for this project was gathered from an observational study on prospectively collected (but perturbed) data from a large academic hospital Emergency Department somewhere in Europe. The data material encompassed almost three years worth of data. The study is unique in that it allowed for predicting in-hospital deaths using a granular set of variables, including disease severity (as expressed in a standardized triage algorithm of level 1-4, where 1 indicates most severe), an array of laboratory results (Troponin T, Lactate, CRP, White Blood Cells, Hemoglobin and pH) and various patient demographics (age, gender etc).
Initially, we considered two anomalies: the number of patients admitted and the proportion of those patients who then died on a particular day. An interesting question was to see if we could detect days or time frames in which the data did not conform to a model of ‘“normal behavior”, i.e. days when more or less patients died than expected. Our initial data analysis and evaluation of rolling means led us to conclude that the proportions of patient death over time did not yield any meaningful anomalies, as the changes over time were small.
Because we were interested in analyzing this dataset specifically, we discussed moving away from detecting anomalies in the proportion of patients who were admitted or who died, and towards building classifiers to predict patient outcome and investigate the importance of time variables in our classifiers. Once we came up with interesting models, we continued with error analysis. Our models misclassified ~25%-30% of the patients, and we decided to examine the characteristics of those patients that were misclassified to understand where our classifier fails.
The ED dataset contains information on 34,812 patients, including laboratory test results and date and time of hospital admittance. Data exploration was done to determine and visualize the distributions of various characteristics of the patients.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import Series, DataFrame
import time
import datetime
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cmx
import matplotlib.colors as colors
import matplotlib
import sklearn
from __future__ import division
import seaborn
import scipy as sp
import seaborn as sns
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DecisionTree
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.grid_search import RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve
import StringIO
import glob
import sqlite3
import fnmatch
import itertools
import pickle
from statsmodels.tsa.stattools import adfuller
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 6
#input the dataset
ed_dataset = pd.read_csv("ED_with_time_indicators_2nd.csv",sep=',')
ed_dataset.head(5)
ed_dataset.shape
ed_dataset.columns
Hospital admission information was converted to the type DateTime.
ed_dataset['ADMISSION_DATE'] = pd.to_datetime(ed_dataset['ADMISSION_DATE']) #changes string format to datetime
Check data types to confirm ADMISSION_DATE was converted from a string to datetime
ed_dataset.dtypes
We are also interested in seeing the proportion of hospital deaths over time to examine if there are any interesting patterns we can potentially examine.
###CREATING FRAME WITH DAY RESOLUTION AND PLOTTING TIME SERIES
ed_data['date'] = ed_data.ADMISSION_DATE.apply(lambda d: d.date()) #creating date in yyyy-mm-dd format to use later
ed_data['date'] = pd.to_datetime(ed_data['date'])
death_admits_per_day = ed_data.groupby("date")[["hospital_death"]].sum().reset_index()
visits_per_day = ed_data.groupby("date")[["Unnamed: 0"]].count().reset_index()
per_day_frame = pd.merge(visits_per_day, death_admits_per_day, on="date", how="left")
per_day_frame.rename(columns={'Unnamed: 0': 'number_visits', 'hospital_death': 'hospital_deaths'}, inplace=True)
per_day_frame["fraction_dead"] = per_day_frame["hospital_deaths"]/per_day_frame["number_visits"]
###TIME SERIES ANALYSIS###
def test_stationarity(timeseries,window):
#Determing rolling statistics
rolmean_fraction_of_admitted_dead = timeseries.fraction_dead.rolling(window=window).mean()
rolstd_fraction_of_admitted_dead = timeseries.fraction_dead.rolling(window=window).std()
#Plot rolling statistics:
orig_prop_dead = plt.plot(timeseries.date,timeseries.fraction_dead, color='green',label='Original proportion dead')
mean_prop_dead = plt.plot(timeseries.date, rolmean_fraction_of_admitted_dead, color='cyan', label='Rolling Mean proportion dead')
std_prop_dead = plt.plot(timeseries.date, rolstd_fraction_of_admitted_dead, color='magenta', label = 'Rolling Std proportion dead')
plt.legend()
plt.title('Rolling Mean & Standard Deviation '+str(window)+" days window")
plt.show(block=False)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
test_stationarity(per_day_frame,30)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
test_stationarity(per_day_frame,90)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
test_stationarity(per_day_frame,180)
Based on this visualization, it does not appear like there are meaningful anomalies in proportion dying over time, and anomaly detection on this dataset might be more like meaningless noise detection. Instead, we will build classifiers for our dataset and analyze the importance of the time variables in this classifier building process.
Before we start building a classifier, we are interested in some of the underlying characteristics of the patients in our dataset. Data visualization would be an efficient way to examine this.
print 'number of patients: ' , ed_dataset.shape[0]
Generate table of statistics with max, min, range, mean of dataset values, including age, emergency department length of stay, and laboratory test results.
# Table of statistics :
cols = [23, 24, 27, 28, 29, 30, 31, 32, 34, 35, 36]
df_cols = ed_dataset[cols]
# get column names from important columns
column_names = df_cols.columns.values
print column_names
# create pandas df with column names
stats = pd.DataFrame(columns=column_names)
#create a row called 'max', 'min, 'range', 'mean' and store max values
stats.loc['max'] = df_cols.max()
stats.loc['min'] = df_cols.min()
stats.loc['range'] = df_cols.max() - df_cols.min()
stats.loc['mean'] = df_cols.mean()
stats.head(n=5)
Create new dataframes with subsets of original emergency department dataset. Use these new dataframes for visualization via piecharts and histograms.
women_ed = ed_dataset[ed_dataset['sex'] == 0]
men_ed = ed_dataset[ed_dataset['sex'] == 1]
death_ed = ed_dataset[ed_dataset['hospital_death'] == 1]
nodeath_ed = ed_dataset[ed_dataset['hospital_death'] == 0]
weekend_ed_deaths = ed_dataset[(ed_dataset['weekday'] == 0) &(ed_dataset['hospital_death'] == 1)]
weekend_ed_nodeaths = ed_dataset[(ed_dataset['weekday'] == 0) &(ed_dataset['hospital_death'] == 0)]
weekday_ed_deaths = ed_dataset[(ed_dataset['weekday'] == 1) & (ed_dataset['hospital_death'] == 1)]
weekday_ed_nodeaths = ed_dataset[(ed_dataset['weekday'] == 1) & (ed_dataset['hospital_death'] == 0)]
monday_ed_deaths = ed_dataset[(ed_dataset['day'] == 0) &(ed_dataset['hospital_death'] == 1)]
monday_ed_nodeaths = ed_dataset[(ed_dataset['day'] == 0) &(ed_dataset['hospital_death'] == 0)]
tf_ed_deaths = ed_dataset[(ed_dataset['day'] != 0) & (ed_dataset['hospital_death'] == 1)]
tf_ed_nodeaths = ed_dataset[(ed_dataset['day'] != 0) & (ed_dataset['hospital_death'] == 0)]
ed_1_deaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 1) &(ed_dataset['hospital_death'] == 1)]
ed_1_nodeaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 1) &(ed_dataset['hospital_death'] == 0)]
ed_2_deaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 2) &(ed_dataset['hospital_death'] == 1)]
ed_2_nodeaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 2) &(ed_dataset['hospital_death'] == 0)]
ed_3_deaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 3) &(ed_dataset['hospital_death'] == 1)]
ed_3_nodeaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 3) &(ed_dataset['hospital_death'] == 0)]
ed_4_deaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 4) &(ed_dataset['hospital_death'] == 1)]
ed_4_nodeaths = ed_dataset[(ed_dataset['first_1_to_4_prio'] == 4) &(ed_dataset['hospital_death'] == 0)]
Pie charts showing proportion of men and women in ED dataset and hospital deaths and hospital survival. 3% of all patients who were admitted to the hospital from the emergency department died during their hospital stay. Therefore, this dataset is unbalanced.
# Pie Charts:
# Men vs. Women
men = men_ed.shape[0]
women = ed_dataset.shape[0] - men
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(121)
ax1.pie([men, women],
labels=['men: ' + str(men), 'women: ' + str(women)],
colors=['yellowgreen', 'lightskyblue'],
autopct='%1.1f%%', shadow=True, startangle=90)
# Death vs. No Death
death = death_ed.shape[0]
no_death = ed_dataset.shape[0] - death
ax2 = fig.add_subplot(122)
ax2.pie([death, no_death],
labels=['hospital deaths: ' + str(death), 'hospital survival: ' + str(no_death)],
colors=['lightsteelblue', 'lightsalmon'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.show()
#fig.savefig('piesa.png')
The pie charts given below display the proportion of patients who died in the hospital with proportion of patients who did not within subsets of the original dataset. These prorportions are shown for patients who were admitted on a weekday, patients who were admitted on a weekend, patients who were admitted on a Monday, and patients who were admitted on a weekday aside from Monday. Overall, the proportion of hospital deaths remained around 3% for each of these categories of hospital admittance.
fig = plt.figure(figsize=(10, 10))
# Weekend deaths vs. Weekday deaths
weekend = weekend_ed_nodeaths.shape[0]
weekend_death = weekend_ed_deaths.shape[0]
ax1 = fig.add_subplot(221)
ax1.pie([weekend, weekend_death],
labels=['hospital survival: ' + str(weekend), 'hospital deaths: ' + str(weekend_death)],
colors=['dodgerblue', 'tomato'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Weekend", fontsize = 15)
weekday = weekday_ed_nodeaths.shape[0]
weekday_death = weekday_ed_deaths.shape[0]
ax2 = fig.add_subplot(222)
ax2.pie([weekday, weekday_death],
labels=['hospital survival: ' + str(weekday), 'hospital deaths: ' + str(weekday_death)],
colors=['dodgerblue', 'tomato'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Weekday", fontsize = 15)
# Monday vs. Tuesday - Friday
monday = monday_ed_nodeaths.shape[0]
monday_death = monday_ed_deaths.shape[0]
ax3 = fig.add_subplot(223)
ax3.pie([monday, monday_death],
labels=['hospital survival: ' + str(monday), 'hospital deaths: ' + str(monday_death)],
colors=['dodgerblue', 'tomato'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Monday", fontsize = 15)
tf = tf_ed_nodeaths.shape[0]
tf_death = tf_ed_deaths.shape[0]
ax4 = fig.add_subplot(224)
ax4.pie([tf, tf_death],
labels=['hospital survival: ' + str(tf), 'hospital deaths: ' + str(tf_death)],
colors=['dodgerblue', 'tomato'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Tuesday - Friday", fontsize = 15)
plt.show()
#fig.savefig('piesb.png')
The pie charts given below display the breakdown of severity of disease among patients who survived the hospital visit and paitents who did not. Severity of disease is separated into four categories and this number is given to a patients upon evaluation in the emergency department. One is the most acute and four is the least acute.
# Pie Charts:
# Severity of Diease
one = ed_1_nodeaths.shape[0]
one_death = ed_1_deaths.shape[0]
two = ed_2_nodeaths.shape[0]
two_death = ed_2_deaths.shape[0]
three = ed_3_nodeaths.shape[0]
three_death = ed_3_deaths.shape[0]
four = ed_4_nodeaths.shape[0]
four_death = ed_4_deaths.shape[0]
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(121)
ax1.pie([one, two, three, four],
labels=['one (most acute): ' + str(one), 'two: ' + str(two), 'three: ' + str(three), 'four: ' + str(four)],
colors=['tomato', 'gold', 'mediumseagreen', 'deepskyblue'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Hospital Survival: Severity of Diease", fontsize = 15)
ax2 = fig.add_subplot(122)
ax2.pie([one_death, two_death, three_death, four_death],
labels=['one (most acute): ' + str(one_death), 'two: ' + str(two_death), 'three: ' + str(three_death), 'four: ' + str(four_death)],
colors=['tomato', 'gold', 'mediumseagreen', 'deepskyblue'],
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title("Hospital Deaths: Severity of Disease", fontsize = 15)
plt.show()
#fig.savefig('piesc.png')
Histograms were then plotted to visualize the distribution of various characteristics of patients, separated by hospital survival and death. Histograms were then overlayed in order to better visualize and present the data and differences between the distributions.
# Adjusted Individual Histograms
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(10, 10))
ax1.hist(nodeath_ed['EDLOS'].values, bins=50, normed=1,
facecolor="green", alpha=0.75)
ax2.hist(death_ed['EDLOS'].values, bins=50, normed=1,
facecolor='red', alpha=0.75)
ax1.set_title('Hospital Survival: ED Length of Stay')
ax2.set_title('Hospital Deaths: ED Length of Stay')
ax1.set_xlabel('hours')
ax1.set_ylabel('frequency')
ax1.set_xlim([0, 20])
ax2.set_xlabel('hours')
ax2.set_ylabel('frequency')
ax2.set_xlim([0, 20])
ax3.hist(nodeath_ed['IPLOS'].values, bins=50, normed=1,
facecolor="green", alpha=0.75)
ax4.hist(death_ed['IPLOS'].values, bins=50, normed=1,
facecolor='red', alpha=0.75)
ax3.set_title('Hospital Survival: In-Patient Length of Stay')
ax3.set_xlabel('hours')
ax3.set_ylabel('frequency')
ax3.set_xlim([0, 1400])
ax4.set_title('Hospital Deaths: In-Patient Length of Stay')
ax4.set_xlabel('hours')
ax4.set_ylabel('frequency')
ax4.set_xlim([0, 1400])
ax5.hist(nodeath_ed['Age'].values, bins=50, normed=1,
facecolor="green", alpha=0.75)
ax6.hist(death_ed['Age'].values, bins=50, normed=1,
facecolor='red', alpha=0.75)
ax5.set_title('Hospital Survival: Age')
ax5.set_xlabel('years')
ax5.set_ylabel('frequency')
ax5.set_xlim([0, 110])
ax6.set_title('Hospital Deaths: Age')
ax6.set_xlabel('years')
ax6.set_ylabel('frequency')
ax6.set_xlim([0, 110])
plt.tight_layout()
plt.show()
#fig.savefig('histc.png')
# Overlayed Histograms
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.hist(nodeath_ed['EDLOS'].values, bins=70, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax1.hist(death_ed['EDLOS'].values, bins=40, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax1.set_title('Emergency Department Length of Stay', fontsize='x-large')
ax1.set_xlabel('hours', fontsize='large')
ax1.set_ylabel('frequency', fontsize='large')
ax1.set_xlim([0, 20])
ax1.legend(loc='best')
ax2.hist(nodeath_ed['IPLOS'].values, bins=105, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax2.hist(death_ed['IPLOS'].values, bins=30, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax2.set_title('In-Patient Length of Stay', fontsize='x-large')
ax2.set_xlabel('hours', fontsize='large')
ax2.set_ylabel('frequency', fontsize='large')
ax2.set_xlim([0, 1400])
ax2.legend(loc='best')
ax3.hist(nodeath_ed['Age'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax3.hist(death_ed['Age'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax3.set_title('Patient Age', fontsize='x-large')
ax3.set_xlabel('years', fontsize='large')
ax3.set_ylabel('frequency', fontsize='large')
ax3.set_xlim([0, 110])
ax3.legend(loc='best')
plt.tight_layout()
plt.show()
#fig.savefig('histb.png')
More histograms were plotted to visualize the distributions of the various laboratory test results among patients who died during their hospital stay and those who did not.
# Histograms of lab values
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize= (15, 10))
# Missing Values -- Remove outliers from lab values
# Write over this variable for every lab value
# Troponin-T : > 2000 [25 observations]
dummy_death = ed_dataset[(ed_dataset['TroponinT'] < 2000) & (ed_dataset['hospital_death'] == 1)]
dummy_nodeaths = ed_dataset[(ed_dataset['TroponinT'] < 2000) & (ed_dataset['hospital_death'] == 0)]
ax1.hist(dummy_nodeaths['TroponinT'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax1.hist(dummy_death['TroponinT'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax1.set_title('Biomarker Troponin-T Levels', fontsize='x-large')
ax1.set_xlabel('units', fontsize='large')
ax1.set_ylabel('frequency', fontsize='large')
ax1.set_xlim([0, 400])
ax1.legend(loc='best')
# Lactate : > 20 [13 observations]
dummy_death = ed_dataset[(ed_dataset['Lactate'] < 20) & (ed_dataset['hospital_death'] == 1)]
dummy_nodeaths = ed_dataset[(ed_dataset['Lactate'] < 20) & (ed_dataset['hospital_death'] == 0)]
ax2.hist(dummy_nodeaths['Lactate'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax2.hist(dummy_death['Lactate'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax2.set_title('Biomarker Lactate Levels', fontsize='x-large')
ax2.set_xlabel('units', fontsize='large')
ax2.set_ylabel('frequency', fontsize='large')
ax2.set_xlim([0, 15])
ax2.legend(loc='best')
# Hemoglobin : No change
ax3.hist(nodeath_ed['Hemoglobin'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax3.hist(death_ed['Hemoglobin'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax3.set_title('Biomarker Hemoglobin Levels', fontsize='x-large')
ax3.set_xlabel('units', fontsize='large')
ax3.set_ylabel('frequency', fontsize='large')
#ax3.set_xlim([0, 110])
ax3.legend(loc='best')
# CRP
dummy_death = ed_dataset[(ed_dataset['CRP'] < 9000) & (ed_dataset['hospital_death'] == 1)]
dummy_nodeaths = ed_dataset[(ed_dataset['CRP'] < 9000) & (ed_dataset['hospital_death'] == 0)]
ax4.hist(dummy_nodeaths['CRP'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax4.hist(dummy_death['CRP'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax4.set_title('Biomarker CRP Levels', fontsize='x-large')
ax4.set_xlabel('units', fontsize='large')
ax4.set_ylabel('frequency', fontsize='large')
ax4.set_xlim([0, 400])
ax4.legend(loc='best')
# WBC
dummy_death = ed_dataset[(ed_dataset['WBC'] < 9000) & (ed_dataset['hospital_death'] == 1)]
dummy_nodeaths = ed_dataset[(ed_dataset['WBC'] < 9000) & (ed_dataset['hospital_death'] == 0)]
ax5.hist(dummy_nodeaths['WBC'].values, bins=110, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax5.hist(dummy_death['WBC'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax5.set_title('White Blood Cell Count', fontsize='x-large')
ax5.set_xlabel('Count', fontsize='large')
ax5.set_ylabel('frequency', fontsize='large')
ax5.set_xlim([0, 80])
ax5.legend(loc='best')
# pH
dummy_death = ed_dataset[(ed_dataset['pH'] < 9) & (ed_dataset['hospital_death'] == 1)]
dummy_nodeaths = ed_dataset[(ed_dataset['pH'] < 9) & (ed_dataset['hospital_death'] == 0)]
ax6.hist(dummy_nodeaths['pH'].values, bins=50, normed=1,
facecolor="green", alpha=0.65, label='Survival')
ax6.hist(dummy_death['pH'].values, bins=50, normed=1,
facecolor='red', alpha=0.35, label='Death')
ax6.set_title('pH', fontsize='x-large')
ax6.set_xlabel('pH', fontsize='large')
ax6.set_ylabel('frequency', fontsize='large')
ax6.set_xlim([6.5, 7.8])
ax6.legend(loc='best')
plt.tight_layout()
plt.show()
#fig.savefig('histd.png')
In a clinical setting, laboratory values are entered into the electronic health record (EHR) system manually by laboratory staff. If there is an error in the analysis process, the laboratory staff enters an error message into the system. Such values were replaced with NaN in this dataset for study purposes. Some patients also lacked entries for laboratory values altogether. This is relatively common, as not all patients have all lab values analyzed (i.e. there must be a rationale or an indication for drawing a certain laboratory test). Hence, cases who had no entry for a particular laboratory test in the EHR database were considered not having needed the particular laboratory test and where therefore assumed to have a low probability of the disease that a particular lab test could be used to show. Therefore, such missing cases had a random value in the normal range of the lab parameter imputed. Both these types of missing values accounted for were encountered for all of the six variables describing laboratory results:
No lab value registered: Troponin T (29059/40000) Lactate (12679/40000) CRP (13160/40000) White Blood Cells (13206/40000) Hemoglobin (7836/40000) pH (9565/40000)
Error registered: Troponin T (167/40000) Lactate (25/40000) CRP (18/40000) White Blood Cells (46/40000) Hemoglobin (67/40000) pH (33/40000)
In the case of disease severity (expressed by the triage-algorithm) missing values were indicated as a separate category. No missing values where encountered for the other variables included in the study. No serviceinterruptions were reported for the EHR system during the study period.
first_1_to_4_prio (780/40000)
#READING IN THE DATASET
ed_data = pd.read_csv("ED_with_time_indicators_2nd.csv", delimiter=',')
#changing data-type of time variable to datetime
ed_data.ADMISSION_DATE = pd.to_datetime(ed_data['ADMISSION_DATE'])
ed_data = ed_data.drop("admission_datetime",axis=1) #dropping unnecessary variable
for var in ed_data.columns.values: #changing dtype of other variables to float
if var == "ADMISSION_DATE":
pass
else:
ed_data[var] = ed_data[var].astype(float)
#Investigating missing values
for var in ed_data.columns.values:
print "#####", var, "-----number missing", ed_data[var].isnull().sum()
###PREPARING TO CREATE CORRELATION MATRIX####
ed_data_outcomes = ed_data.iloc[:,-1:].values.ravel()
ed_data_features = ed_data.drop(["hospital_death"],axis=1)
ed_data_features_wo_timevar = ed_data_features.drop(["ADMISSION_DATE"],axis=1)
x = np.asarray(ed_data_features)
x2_ = np.asarray(ed_data_features_wo_timevar)
y = ed_data_outcomes
print x.shape, x2_.shape, y.shape
corr_matrix = np.corrcoef(x2_.T)
corrmin, corrmax = -np.abs(corr_matrix).max(), np.abs(corr_matrix).max()
fig, ax = plt.subplots(1, 1)
ax.pcolor(corr_matrix, cmap=matplotlib.cm.Blues, vmin=corrmin, vmax=corrmax)
ax.set_title('Heatmap of ED dataset correlation matrix')
ax.set_xlim((0,35))
ax.set_ylim((0,35))
plt.show()
This concludes our preliminary data exploration and visualization.
Now, we are interested in building a classifier to predict an unfortunate outcome (hospital death). We first try to exclude any time variables, assuming that the time of admission into the hospital is not a good predictive factor for hospital death. We can then compare performance against a classifier built including the time variables in the training set to see if there is a dramatic difference in performance.
total_data = pd.read_csv('ED_EDUCATIONAL.csv')
For this project, we will only consider those in the dataset who are adults. This is because children and adults can differ in biomarker levels, and children are likely to be treated with extra caution upon presentation to the emergency room (and thus become hospitalized). The age has been perturbed, i.e. a fixed number has been subtracted from the true age. This fixed number is ~3 (as per Mathias), so the adults are people with age >= 15. This will remove 4,212 observations from our dataset.
# How many in this dataset are children? We want to work on subset of dataframe that is adults.
np.sum(total_data['Age'] < 15)
Subset the data to only include adults.
adults_data = total_data.loc[total_data['Age'] >= 15]
adults_data.head()
print adults_data.shape
This leaves us with 35,788 observations to work with.
Convert admission date to datetime object.
# convert to datetime object
adults_data['ADMISSION_DATE'] = pd.to_datetime(adults_data['ADMISSION_DATE'])
type(adults_data['ADMISSION_DATE'][0])
adults_data.drop(adults_data.columns[[0]], axis=1, inplace=True)
In our final step of data cleaning, we exclude points that have lab test values of 9000 or greater. These high values were entered because some error in the testing process prevented accurate results from being obtained (i.e. coagulation in the test tube, etc). We see from the description of our data file above that there are four variables with maximum values exceeding 9000. They are TroponinT, CRP, WBC, and pH. Thus, we will subset the adults to look at only the adults with values < 9000 for these lab tests.
adults_data = adults_data[adults_data["TroponinT"] < 9000]
adults_data = adults_data[adults_data["CRP"] < 9000]
adults_data = adults_data[adults_data["WBC"] < 9000]
adults_data = adults_data[adults_data["pH"] < 9000]
adults_data.describe()
This leaves us with 34,812 adults to start our analysis/ classifier building.
adults_data.head()
We split our cleaned data into predictors and response (hospital death). Printing out corresponding shapes allows us to double check that we split the dataframe correctly. We then split the data into training and test sets, using 65% of our data for training and 35% of our data for testing.
## Split into train, test
# predictors = adults_data.iloc[:, 1:-4].values
predictors = adults_data.values[:, 1:-1]
print predictors.shape #Check correct shape
print predictors[0][0], predictors[0][-1] #Check we got the correct columns
# response = adults_data.iloc[:, -4]
response = adults_data.values[:, -1]
print response.shape # Check this is the correct column
print response[:10] # Check this is the correct column
x_train, x_test, y_train, y_test = train_test_split(predictors, response, train_size = .65)
print 'Predictors:', x_train.shape, x_test.shape
print 'Response:', y_train.shape, y_test.shape
# Compare proportions of hospital death in training and test
# Small, but should be equal
print 'Proportion of hospital deaths in training set:', np.sum(y_train)/ len(y_train)
print 'Proportion of hospital deaths in test set:', np.sum(y_test)/ len(y_test)
The outcome we are trying to build a classifier for, hospital deaths, happens in only approximately 3% of our dataset. We have quite an unbalanced dataset, and we suspect that this will create certain problems in the model building process.
# classifiers need integer labels
y_train = y_train.astype(int)
y_test = y_test.astype(int)
I start with some baseline models: unweighted logistic, weighted logistic, lda, qda, decision trees, random forest. I decided not to include knn or SVM because of computing time. With each model, print out a classification report that gives precision and recall(accuracy) on each class.
class_labels = ['No Hospital Death', 'Hospital Death']
unweighted_logistic = LogisticRegression()
unweighted_logistic.fit(x_train, y_train)
unweighted_log_y_pred = unweighted_logistic.predict(x_test)
print('Unweighted Logistic')
print(classification_report(y_test, unweighted_log_y_pred, target_names = class_labels))
print('\n')
weighted_logistic = LogisticRegression(class_weight='balanced')
weighted_logistic.fit(x_train, y_train)
weighted_log_y_pred = weighted_logistic.predict(x_test)
print('Weighted Logistic')
print(classification_report(y_test, weighted_log_y_pred, target_names = class_labels))
print('\n')
lda = LDA()
lda.fit(x_train, y_train)
lda_y_pred = lda.predict(x_test)
print('LDA')
print(classification_report(y_test, lda_y_pred, target_names = class_labels))
print('\n')
qda = QDA()
qda.fit(x_train, y_train)
qda_y_pred = qda.predict(x_test)
print('QDA')
print(classification_report(y_test, qda_y_pred, target_names = class_labels))
print('\n')
tree = DecisionTree(max_depth=6)
tree.fit(x_train, y_train)
tree_y_pred = tree.predict(x_test)
print('Decision Tree')
print(classification_report(y_test, tree_y_pred, target_names = class_labels))
print('\n')
rf = RandomForest()
rf.fit(x_train, y_train)
rf_y_pred = rf.predict(x_test)
print('Random Forest')
print(classification_report(y_test, rf_y_pred, target_names = class_labels))
print('\n')
Based on the results from these baseline models, I decide to explore weighted logistic regression and decision trees/random forest. The baseline weighted logistic regression model yielded the highest accuracy with hospital death, so I have a hunch that this will be the most useful model.
However, while decision trees and random forests yielded very low accuracy with hospital death, the decision tree structure might represent how a clinician carries out decisions. I am interested in what the decision tree uses as important variables, so I will first quickly explore that. Since I'm working with the trees, I might as well try to tune them to see if they can perform any better on class 1.
Then, I will move on to working on improving the weighted logistic regression model.
# Explore the simple decision tree model
def display_dt(model):
dummy_io = StringIO.StringIO()
export_graphviz(model, out_file = dummy_io)
print dummy_io.getvalue()
simple_tree = DecisionTree(max_depth=6)
simple_tree.fit(x_train, y_train)
display_dt(simple_tree)
The first node in the decision tree above is blood pH: threshold value <= 7.255 Medically, blood PH level of 7.2 requires dire action. If blood pH is <= 7.255, then the next node is IPLOS, or inpatient length of stay. If blood pH > 7.255, then the next node is age.
Might as well try to tune the trees to see if accuracy on class 1 improves! First, we try tuning the depth of the tree. We also try an adaptive-boosting method, in which the results from many trees are combined in the final output.
# try different depths, get precision / accuracy
depths = range(2,10)
for depth in depths:
# fit simple tree
print("Tree")
dt = DecisionTree(max_depth = depth)
dt.fit(x_train, y_train)
print depth
print classification_report(y_test, dt.predict(x_test), target_names = class_labels)
print '\n'
print("Adaboosted tree")
# fit adaboosted tree
bdt = AdaBoostClassifier(DecisionTree(max_depth = depth))
bdt.fit(x_train, y_train)
print classification_report(y_test, bdt.predict(x_test), target_names = class_labels)
I'm not impressed by the performance of the decision tree on accurately classifying class 1, even after trying out various depths and implementing adaptive boosting. Maybe a random forest, a cohort of decision trees, will give us a better classifier.
I will implement some random forests and tune them.
score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test),
model.score(x_test[y_test==0], y_test[y_test==0]),
model.score(x_test[y_test==1], y_test[y_test==1])],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
# try to tune random forests- what is optimal number of trees?
ntrees = range(1, 20)
kf = KFold(len(x_train), n_folds=5)
class1_accuracies = []
for tree in ntrees:
print 'Trees:', tree
accuracies = []
for train_index, test_index in kf:
x_validate_train, x_validate_test = x_train[train_index], x_train[test_index]
y_validate_train, y_validate_test = y_train[train_index], y_train[test_index]
rf = RandomForest(n_estimators = tree, max_depth = 8) # Pick best depth from above tuning for depth.
rf.fit(x_validate_train, y_validate_train)
y_pred = rf.predict(x_validate_test)
scores = score(rf, x_validate_test, y_validate_test)
accuracies.append(scores['accuracy on class 1'])
class1_accuracies.append(np.mean(accuracies))
print class1_accuracies
best_tree = np.argmax(class1_accuracies) +1
print 'Best forest has this many trees:', best_tree
The random forest that gives the best accuracy on class 1 has two trees, and even then, it gives a 7.7% accuracy on classifying class 1. Given these disappointing results, it's time to move on to tuning the weighted logistic regression classifier.
First, I try a weighted logistic regression with cross-validation to tune for best C, or regularization parameter. The regularization parameter will help avoid overfitting. Smaller values specify stronger regularization.
logistic = LogisticRegressionCV(Cs = 10, class_weight='balanced')
logistic.fit(x_train, y_train)
score(logistic, x_test, y_test)
Off the bat, this is not a bad place to start- we have 78.6% accuracy on class 1! We wil try some more Cs.
logistic = LogisticRegressionCV(Cs = [1, 10, 100, 1000], class_weight='balanced')
logistic.fit(x_train, y_train)
print score(logistic, x_test, y_test)
print classification_report(y_test, logistic.predict(x_test))
conf_mat = confusion_matrix(y_test, logistic.predict(x_test))
# Create dataframe from confusion matrix
conf_df = pd.DataFrame(conf_mat, columns=['y_hat=0', 'y_hat=1'], index=['y=0', 'y=1'])
print conf_df
I try an adaboosted implementation to see if this improves our cross-validated logistic regression model.
logistic_boost = AdaBoostClassifier(LogisticRegressionCV(Cs = [1, 10, 100, 1000], class_weight='balanced'))
logistic_boost.fit(x_train, y_train)
score(logistic_boost, x_test, y_test)
The best way to improve our weighted logistic regression classifier for such a rare outcome is to change the threshold probability at which the classifier predicts hospital death. By default, if the predicted probability of an individual is > 0.5, then the classifier will classify that individual as a hospital death. However, we might be able to capture more true hospital deaths by lowering this threshold.
#-------- thresholded_accuracy
# From CS109 Challenge homework
# A function to evaluate accuracy of LogReg model with threshold 't'
# Input:
# model (LogReg model)
# t (threshold)
# x (n x d array of predictors)
# y (n x 1 array of binary response variable vals)
def thresholded_accuracy(model, t, x, y):
# Get the class probabilities rather than class predictions from log reg
y_prob = model.predict_proba(x)[:, 1]
# Make the class predictions based on threshold
y_pred = np.zeros((x.shape[0], 1)).reshape(-1, )
# If P(y=1|x) > threshold t then we set the label to 1: y = 1
y_pred[y_prob > t] = 1
# Compute accuracy of our predictions
accuracy = np.mean(y_pred == y)
return accuracy
logistic = LogisticRegressionCV(Cs = [1, 10, 100, 1000], class_weight='balanced')
logistic.fit(x_train, y_train)
thresholds = [0.05, 0.10, 0.15, 0.2, 0.25, 0.30, 0.35, 0.40, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.55]
for threshold in thresholds:
print '\n'
print threshold
print 'Train set:\n\t total accuracy:', thresholded_accuracy(logistic, threshold, x_train, y_train),\
'\n\t accuracy on class 0:', thresholded_accuracy(logistic, threshold, x_train[y_train==0, :], \
y_train[y_train==0]),\
'\n\t accuracy on class 1:', thresholded_accuracy(logistic, threshold, x_train[y_train==1, :], \
y_train[y_train==1])
print 'Test set:\n\t total accuracy:', thresholded_accuracy(logistic, threshold, x_test, y_test),\
'\n\t accuracy on class 0:', thresholded_accuracy(logistic, threshold, x_test[y_test==0, :], \
y_test[y_test==0]) ,\
'\n\t accuracy on class 1:', thresholded_accuracy(logistic, threshold, x_test[y_test==1, :], \
y_test[y_test==1]),\
We can see the tradeoffs of different thresholds. An extremely low threshold will basically classify most points as hospital deaths, thus, it is likely that we will correctly classify all true hospital deaths, but we will also increase our number of false positives. Additionally, the ability to predict hospital survival accurately decreases.
The threshold p = 0.45 seems to hit a good sweet spot of overall accuracy and high accuracy on class 1 (hospital death).
After this exploration of model building, I decide to stick with a cross-validated, weighted logistic regression model with a custom threshold of p = 0.45. Now it's time to run this classifier on the entire dataset and save the outputs.
Implement classifier on test data, return predicted class label and predicted probability.
def predict_best_classifier(x_test):
# give spredicted values
# best classifier as defined in function below
# weighted logreg, cross-validated with different threshold.
threshold = .45
y_prob = logistic.predict_proba(x_test)[:, 1]
# make clas predicions based on chosen threshold
y_pred = np.zeros((x_test.shape[0],1)).reshape(-1,)
# If P(y=1|x) > threshold t, we set label y = 1
y_pred[y_prob > threshold] = 1
return y_pred, y_prob
Test this function out on the test set. We print out the total number of points labeled as hospital death from the classifier's prediction, and the true total number of points labeled as hospital death.
y_pred, y_prob = predict_best_classifier(x_test)
print sum(y_pred)
print sum(y_test)
Our classifier gives us 3,673 hospital deaths in the test set while there are actually only 374. We are overestimating hospital death, but in this case, it would be better to have more false positives of hospital death instead of have many false negatives, which could lead to patient negligence (better safe than sorry!)
death_pred, death_prob = predict_best_classifier(predictors)
print len(death_pred)
I ran our entire working data through the classifier, and made sure we got the appropriate length of output (34,812 patients). I now append the predicted probability and the predicted label to the dataframe, and export this dataframe to a csv file.
adults_data['class prediction'] = death_pred
adults_data['class prob'] = death_prob
adults_data.head()
adults_data.to_csv('logreg_predicted.csv', sep=',')
Plotting the ROC curve allows us to visualize the true positive rate against the false positive rate at different thresholds.
from sklearn.metrics import roc_curve, auc
y_score = logistic.fit(x_train, y_train).decision_function(x_test)
# Compute ROC curve and ROC area for each class
fpr, tpr, thresholds = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
fig = plt.figure(figsize=(15,8))
plt.plot(fpr, tpr, color='magenta', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.legend(loc='lower right')
plt.title('Receiver Operating Characteristic', fontsize='xx-large')
plt.show()
#fig.savefig('ROC.png')
The area under the curve justifies that this is a good classifier.
We are interested in exploring the patients that the weighted logistic regression classifier has misclassified. In particular, because this classifier was trained without including any time variables, I am interested in seeing if there is a pattern within the time variables among the misclassified patients.
nancy = pd.read_csv("logreg_predicted.csv") #input file from weighted logreg jupyter notebook. This may be named something different
nancy.head()
In order to analyze patterns in time variables, we need to make sure the time variables are in a format we can work with. The code below is for cleaning up the time data, creating a weekday variable to indicate what day of the week the patient was admitted, and creating indicator variables for whether or not that day was a weekend.
nancy['ADMISSION_DATE'] = pd.to_datetime(nancy['ADMISSION_DATE'])
nancy['weekday'] = nancy['ADMISSION_DATE'].apply(lambda x: x.weekday())
nancy.head()
def weekday(row):
if row['weekday'] <5:
val = 1
else:
val = 0
return val
nancy['weekday_binary'] = nancy.apply(weekday, axis=1)
nancy.head()
Now, we separate out the people who were misclassified by the weighted logistic regression classifier.
misclassified_df = nancy[nancy['class prediction'] != nancy['hospital_death']]
misclassified_df.head()
We further separate the misclassified into those who truly survived (but were classified as dead), and those who truly died (but were classified as alive).
misclassified0 = misclassified_df[misclassified_df['hospital_death'] == 0.0] #People who actually survived
misclassified1 = misclassified_df[misclassified_df['hospital_death'] == 1.0] #People who actually died
misclassified0.describe()
misclassified1.describe()
9,820 patients actually survived, but were misclassified as a hospital death, whereas 156 people actually died but were misclassified as a survival. This imbalance makes sense, since the classifier was modified to increase false positives.
I plot a histogram of the predicted probabilites output by the classifier, stratified by those who truly survived and those who truly died. The vertical line indicates the chosen threshold for the weighted logreg classifier, which was p = 0.45.
fig = plt.figure(figsize=(18,10))
plt.hist(nancy[nancy['hospital_death'] == 0.0]['class prob'].values, alpha= 0.5, normed=True, bins=50, color='green', label='Survival')
plt.hist(nancy[nancy['hospital_death'] == 1.0]['class prob'].values, alpha = 0.5, normed=True, bins=50, color='red', label='Death')
plt.axvline(x=0.45)
plt.legend(fontsize='xx-large')
plt.xlabel('Predicted Probability from Classifier', fontsize='xx-large')
plt.title('Histogram of Predicted Probabilities', fontsize='xx-large')
plt.show()
#fig.savefig('predicted_probabilities.png')
We see that this threshold chosen is nearly at the minimum of the curve outlined by both probability distributions. We do not have a clear separation of the distributions of predicted probabilities stratified by true outcome.
It would be interesting to further examine our classifier to determine if there are interesting patterns in what kinds of patients were likely to be misclassified. We look at the different distributions of certain biomarkers and lab test results for those misclassified by the weighted logreg classifier, stratified on their true outcome. Certain properties were selected as indicated by their importance in random forest classifier: age, pH, Troponin, Lactate, Hemoglobin, CRP. We also look at measures of hospital experience: hospital occupancy rate, emegency department length of stay, and in patient length of stay.
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(3,2,1)
ax1.hist(misclassified0['Age'].values, alpha=0.65, normed=True, color='green', label='True Survival')
ax1.hist(misclassified1['Age'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax1.legend(loc='best')
ax1.set_title('Age of Misclassified', fontsize='xx-large')
ax2 = fig.add_subplot(3,2,2)
ax2.hist(misclassified0['pH'].values, alpha=0.65, normed=True, color='green', label='True Survival')
ax2.hist(misclassified1['pH'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax2.legend(loc='best')
ax2.set_title('pH of Misclassified', fontsize='xx-large')
ax3 = fig.add_subplot(3,2,3)
ax3.hist(misclassified0['TroponinT'].values, alpha=0.65, normed=True, color='green', bins = 5000, label='True Survival')
ax3.hist(misclassified1['TroponinT'].values, alpha = 0.35, normed=True, color='red', bins = 500, label='True Death')
ax3.set_xlim((0,100))
ax3.legend()
ax3.set_title('TroponinT of Misclassified', fontsize='xx-large')
ax4 = fig.add_subplot(3,2,4)
ax4.hist(misclassified0['Lactate'].values, alpha= 0.65, normed=True, color='green', bins=100, label='True Survival')
ax4.hist(misclassified1['Lactate'].values, alpha = 0.35, normed=True, color='red', bins=10, label='True Death')
ax4.set_xlim((0,25))
ax4.legend()
ax4.set_title('Lactate of Misclassified', fontsize='xx-large')
ax5 = fig.add_subplot(3,2,5)
ax5.hist(misclassified0['Hemoglobin'].values, alpha= 0.65, normed=True, color='green', label='True Survival')
ax5.hist(misclassified1['Hemoglobin'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax5.legend()
ax5.set_title('Hemoglobin of Misclassified', fontsize='xx-large')
ax6 = fig.add_subplot(3,2,6)
ax6.hist(misclassified0['CRP'].values, alpha= 0.65, normed=True, color='green', label='True Survival')
ax6.hist(misclassified1['CRP'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax6.legend()
ax6.set_title('CRP of Misclassified', fontsize='xx-large')
plt.show()
#fig.savefig('lab_metrics.png')
fig = plt.figure(figsize=(4.5,10))
ax1 = fig.add_subplot(3,1,1)
ax1.hist(misclassified0['occupancy'].values, alpha= 0.65, normed=True, color='green', label='True Survival')
ax1.hist(misclassified1['occupancy'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax1.legend(loc='best')
ax1.set_title('Hospital Occupancy', fontsize='xx-large')
ax2 = fig.add_subplot(3,1,2)
ax2.hist(misclassified0['IPLOS'].values, alpha=0.65, normed=True, color='green', bins=100, label='True Survival')
ax2.hist(misclassified1['IPLOS'].values, alpha = 0.35, normed=True, color='red', label='True Death')
ax2.set_xlim((0,1000))
ax2.legend(loc='best')
ax2.set_title('In-Patient Length of Stay', fontsize='xx-large')
ax3 = fig.add_subplot(3,1,3)
ax3.hist(misclassified0['EDLOS'].values, alpha=0.65, normed=True, color='green', bins =20, label='True Survival')
ax3.hist(misclassified1['EDLOS'].values, alpha = 0.35, normed=True, color='red', bins = 20, label='True Death')
ax3.set_xlim((0,24))
ax3.legend(loc='best')
ax3.set_title('ED Length of Stay', fontsize='xx-large')
plt.show()
#fig.savefig('hospital_metrics.png')
We can see that the age of those who truly survived is slightly higher than the age of those who actually experienced hospital death. This makes sense given age is an important predictor of outcome, and that older individuals are more likely to be extremely sick and more likely classified as a hospital death.
Now, the question of interest is: is there a pattern in what weekdays, or other time factors, are more likely to be misclassified than others?
misclassified_by_weekday = misclassified_df.groupby('weekday').size().values
visits_weekday = nancy.groupby('weekday').size().values
First, for each weekday, we calculate the number of people who survived and died.
# Need number of people who survived and died on each weekday
no_survived_weekday = []
no_died_weekday = []
for i in range(7):
survived = np.sum(nancy[nancy['weekday'] == i]['hospital_death'] == 0.0)
died = np.sum(nancy[nancy['weekday'] == i]['hospital_death'] == 1.0)
no_survived_weekday.append(survived)
no_died_weekday.append(died)
print no_survived_weekday
print no_died_weekday
Plot proportion of people misclassified per weekday.
plt.bar(list(range(7)), np.divide(misclassified_by_weekday.astype('float'), visits_weekday.astype('float')), align='center', color='thistle')
plt.xticks(list(range(7)), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Proportion of people misclassified')
plt.show()
The proportion of people misclassified looks pretty similar across all weekdays. We would need to do a hypothesis test to conclude anything of significance.
We should stratify the above plot by those who actually survived and those who actually died.
misclassified0_by_weekday = misclassified0.groupby('weekday').size().values
misclassified1_by_weekday = misclassified1.groupby('weekday').size().values
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,2,1)
ax1.bar(list(range(7)), np.divide(misclassified0_by_weekday.astype('float'), no_survived_weekday), align='center', color='thistle')
ax1.set_xticks(list(range(7)))
ax1.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
ax1.set_title('Proportion of people misclassified who actually survived')
ax2 = fig.add_subplot(1,2,2)
ax2.bar(list(range(7)), np.divide(misclassified1_by_weekday.astype('float'), no_died_weekday), align='center', color='thistle')
ax2.set_xticks(list(range(7)))
ax2.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
ax2.set_title('Proportion of people misclassified who actually died')
plt.show()
Among those who actually survived, the misclassification looks pretty similar across all weekdays. However, this is not necessarily the case for people who actually died: the number of people misclassified who actually died divided by the total number of people who died that day varies across the weekdays. However, we note that we have a smaller sample of those people who actually died, so the results may not be statistically significant or achieve sufficient power.
To get an overall sense of the hospital admission over time, I plot a histogram of the admission dates in our dataset.
admit_dates = list(nancy['ADMISSION_DATE'].values)
hospital_death = nancy['hospital_death'].values
predicted = nancy['class prediction'].values
fig = plt.figure(figsize=(25,5))
plt.hist(admit_dates, bins=100, color='lightblue', normed=True)
plt.title('Admission Dates: Frequency')
plt.show()
The distribution of admission dates in our dataset looks uniform.
Now, we want to analyze the distribution of misclassified points over time. This will involve some preliminary data cleaning again, so that we can sort the data by specific month and year.
First, strip time- we are just interested in the date.
misclassified_df['just_date'] = misclassified_df['ADMISSION_DATE'].dt.date
misclassified_df.head()
Make a new column that labels the month and the year the patient entered the dataset.
misclassified_df['month_year'] = misclassified_df['ADMISSION_DATE'].apply(lambda x: x.strftime('%B-%Y'))
misclassified_df.head()
Get the counts of patients in each month-year category, i.e. how many patients were admitted in February 2003? In March 2003? And etcetera. We want to do this for entire dataset and for those misclassified.
misclassified_by_monthyear = misclassified_df.groupby('month_year').size() #Pandas series
misclassified_by_monthyear_values = misclassified_by_monthyear.values
# Do this for main dataset
nancy['month_year'] = nancy['ADMISSION_DATE'].apply(lambda x: x.strftime('%B-%Y'))
visits_by_monthyear = nancy.groupby('month_year').size() #Pandas series
visits_by_monthyear_values = nancy.groupby('month_year').size().values
Double check that our counting process worked.
print misclassified_by_monthyear
Now, we just have to order these values chronologically. I could not find a quick way to do this automatically; it ended up being faster to just reorder these by hand, which I did not enjoy, but the job was done.
monthyear_ordered = [
misclassified_by_monthyear.index.values[11], #jan 2003
misclassified_by_monthyear.index.values[8],
misclassified_by_monthyear.index.values[20],
misclassified_by_monthyear.index.values[0], # april 2003
misclassified_by_monthyear.index.values[23],
misclassified_by_monthyear.index.values[17],
misclassified_by_monthyear.index.values[14],
misclassified_by_monthyear.index.values[3],
misclassified_by_monthyear.index.values[30], #september
misclassified_by_monthyear.index.values[28],
misclassified_by_monthyear.index.values[26],
misclassified_by_monthyear.index.values[6], # December 2003
misclassified_by_monthyear.index.values[12], # January 2004
misclassified_by_monthyear.index.values[9],
misclassified_by_monthyear.index.values[21],
misclassified_by_monthyear.index.values[1],
misclassified_by_monthyear.index.values[24],
misclassified_by_monthyear.index.values[18],
misclassified_by_monthyear.index.values[15],
misclassified_by_monthyear.index.values[4], # august 2004
misclassified_by_monthyear.index.values[31],
misclassified_by_monthyear.index.values[29],
misclassified_by_monthyear.index.values[27],
misclassified_by_monthyear.index.values[7], # December 2004
misclassified_by_monthyear.index.values[13], # January 2005
misclassified_by_monthyear.index.values[10],# Febuary 2005
misclassified_by_monthyear.index.values[22],
misclassified_by_monthyear.index.values[2],
misclassified_by_monthyear.index.values[25],
misclassified_by_monthyear.index.values[19],
misclassified_by_monthyear.index.values[16],
misclassified_by_monthyear.index.values[5],
]
print monthyear_ordered
We can now find the proportion misclassified in each month by dividing the number misclassified in each month by the number of visits in each month. I will have to again reorder these by hand, but the good news is we can keep the ordering structure used above, and just change the name of the list we want to reorder.
proportion_misclassified_by_monthyear = np.divide(misclassified_by_monthyear_values.astype('float'), visits_by_monthyear_values.astype('float'))
print proportion_misclassified_by_monthyear
print 'length', len(proportion_misclassified_by_monthyear)
proportion_misclassified_ordered = [
proportion_misclassified_by_monthyear[11], #jan 2003
proportion_misclassified_by_monthyear[8],
proportion_misclassified_by_monthyear[20],
proportion_misclassified_by_monthyear[0], # april 2003
proportion_misclassified_by_monthyear[23],
proportion_misclassified_by_monthyear[17],
proportion_misclassified_by_monthyear[14],
proportion_misclassified_by_monthyear[3],
proportion_misclassified_by_monthyear[30], #september
proportion_misclassified_by_monthyear[28],
proportion_misclassified_by_monthyear[26],
proportion_misclassified_by_monthyear[6], # December 2003
proportion_misclassified_by_monthyear[12], # January 2004
proportion_misclassified_by_monthyear[9],
proportion_misclassified_by_monthyear[21],
proportion_misclassified_by_monthyear[1],
proportion_misclassified_by_monthyear[24],
proportion_misclassified_by_monthyear[18],
proportion_misclassified_by_monthyear[15],
proportion_misclassified_by_monthyear[4], # august 2004
proportion_misclassified_by_monthyear[31],
proportion_misclassified_by_monthyear[29],
proportion_misclassified_by_monthyear[27],
proportion_misclassified_by_monthyear[7], # December 2004
proportion_misclassified_by_monthyear[13], # January 2005
proportion_misclassified_by_monthyear[10],# Febuary 2005
proportion_misclassified_by_monthyear[22],
proportion_misclassified_by_monthyear[2],
proportion_misclassified_by_monthyear[25],
proportion_misclassified_by_monthyear[19],
proportion_misclassified_by_monthyear[16],
proportion_misclassified_by_monthyear[5],
]
fig = plt.figure(figsize=(15,10))
plt.bar(list(range(32)), proportion_misclassified_ordered, align='center', color='thistle')
plt.xticks(list(range(32)), monthyear_ordered, rotation='vertical')
plt.xlim([-0.5, 31.5])
plt.title('Proportion of Hospital Visits Misclassified')
plt.show()
#fig.savefig('misclassified_overtime.png')
Now, we break down this analysis by those who actually survived and those who actually died.
misclassified0['month_year'] = misclassified0['ADMISSION_DATE'].apply(lambda x: x.strftime('%B-%Y'))
misclassified1['month_year'] = misclassified1['ADMISSION_DATE'].apply(lambda x: x.strftime('%B-%Y'))
misclassified0_by_monthyear = misclassified0.groupby('month_year').size() #Pandas series
misclassified0_by_monthyear_values = misclassified0_by_monthyear.values
misclassified1_by_monthyear = misclassified1.groupby('month_year').size() #Pandas series
misclassified1_by_monthyear_values = misclassified1_by_monthyear.values
In order to do this, we need to tabulate the number of survivals and deaths in each month over the 3 years. We can then calculate the proportion of those who survived that were misclassified, and the proportion of those who died that were misclassified.
# Need to create survivals, deaths by monthyear
survivals_by_monthyear = nancy[nancy['hospital_death'] == 0.0].groupby('month_year').size() #Pandas series
survivals_by_monthyear_values = survivals_by_monthyear.values
deaths_by_monthyear = nancy[nancy['hospital_death'] == 1.0].groupby('month_year').size()
deaths_by_monthyear_values = deaths_by_monthyear.values
proportion_misclassified0_by_monthyear = np.divide(misclassified0_by_monthyear_values.astype('float'), survivals_by_monthyear_values.astype('float'))
proportion_misclassified1_by_monthyear = np.divide(misclassified1_by_monthyear_values.astype('float'), deaths_by_monthyear_values.astype('float'))
print proportion_misclassified0_by_monthyear
print proportion_misclassified1_by_monthyear
# print proportion_misclassified_by_monthyear
# print 'length', len(proportion_misclassified_by_monthyear)
Same as we've done previously, we now need to order these lists.
proportion_misclassified0_ordered = [
proportion_misclassified0_by_monthyear[11], #jan 2003
proportion_misclassified0_by_monthyear[8],
proportion_misclassified0_by_monthyear[20],
proportion_misclassified0_by_monthyear[0], # april 2003
proportion_misclassified0_by_monthyear[23],
proportion_misclassified0_by_monthyear[17],
proportion_misclassified0_by_monthyear[14],
proportion_misclassified0_by_monthyear[3],
proportion_misclassified0_by_monthyear[30], #september
proportion_misclassified0_by_monthyear[28],
proportion_misclassified0_by_monthyear[26],
proportion_misclassified0_by_monthyear[6], # December 2003
proportion_misclassified0_by_monthyear[12], # January 2004
proportion_misclassified0_by_monthyear[9],
proportion_misclassified0_by_monthyear[21],
proportion_misclassified0_by_monthyear[1],
proportion_misclassified0_by_monthyear[24],
proportion_misclassified0_by_monthyear[18],
proportion_misclassified0_by_monthyear[15],
proportion_misclassified0_by_monthyear[4], # august 2004
proportion_misclassified0_by_monthyear[31],
proportion_misclassified0_by_monthyear[29],
proportion_misclassified0_by_monthyear[27],
proportion_misclassified0_by_monthyear[7], # December 2004
proportion_misclassified0_by_monthyear[13], # January 2005
proportion_misclassified0_by_monthyear[10],# Febuary 2005
proportion_misclassified0_by_monthyear[22],
proportion_misclassified0_by_monthyear[2],
proportion_misclassified0_by_monthyear[25],
proportion_misclassified0_by_monthyear[19],
proportion_misclassified0_by_monthyear[16],
proportion_misclassified0_by_monthyear[5],
]
proportion_misclassified1_ordered = [
proportion_misclassified1_by_monthyear[11], #jan 2003
proportion_misclassified1_by_monthyear[8],
proportion_misclassified1_by_monthyear[20],
proportion_misclassified1_by_monthyear[0], # april 2003
proportion_misclassified1_by_monthyear[23],
proportion_misclassified1_by_monthyear[17],
proportion_misclassified1_by_monthyear[14],
proportion_misclassified1_by_monthyear[3],
proportion_misclassified1_by_monthyear[30], #september
proportion_misclassified1_by_monthyear[28],
proportion_misclassified1_by_monthyear[26],
proportion_misclassified1_by_monthyear[6], # December 2003
proportion_misclassified1_by_monthyear[12], # January 2004
proportion_misclassified1_by_monthyear[9],
proportion_misclassified1_by_monthyear[21],
proportion_misclassified1_by_monthyear[1],
proportion_misclassified1_by_monthyear[24],
proportion_misclassified1_by_monthyear[18],
proportion_misclassified1_by_monthyear[15],
proportion_misclassified1_by_monthyear[4], # august 2004
proportion_misclassified1_by_monthyear[31],
proportion_misclassified1_by_monthyear[29],
proportion_misclassified1_by_monthyear[27],
proportion_misclassified1_by_monthyear[7], # December 2004
proportion_misclassified1_by_monthyear[13], # January 2005
proportion_misclassified1_by_monthyear[10],# Febuary 2005
proportion_misclassified1_by_monthyear[22],
proportion_misclassified1_by_monthyear[2],
proportion_misclassified1_by_monthyear[25],
proportion_misclassified1_by_monthyear[19],
proportion_misclassified1_by_monthyear[16],
proportion_misclassified1_by_monthyear[5],
]
We generate plots to visualize our results.
fig = plt.figure(figsize=(15,10))
plt.bar(list(range(32)), proportion_misclassified0_ordered, align='center', color='lightgreen')
plt.xticks(list(range(32)), monthyear_ordered, rotation='vertical')
plt.xlim([-0.5, 31.5])
plt.title('Proportion of Survivals Misclassified', fontsize='xx-large')
plt.show()
fig = plt.figure(figsize=(15,10))
plt.bar(list(range(32)), proportion_misclassified1_ordered, align='center', color='red')
plt.xticks(list(range(32)), monthyear_ordered, rotation='vertical')
plt.xlim([-0.5, 31.5])
plt.title('Proportion of Hospital Deaths Misclassified', fontsize='xx-large')
plt.show()
The proportion of survivals misclassified is much more uniformly distributed than the proportion of deaths misclassified over time. However, we note that we have a smaller sample size for hospital deaths, so we cannot yet conclude anything of statistical significance.
Now, we build a classifier that includes time variables, since we are curious to see whether such a classifier would perform better than a classifier built excluding time variables.
# Load data
df = pd.read_csv("ED_with_time_indicators_2nd.csv")
# Re-order columns
df = df[['Unnamed: 0', 'ADMISSION_DATE', 'day', 'month', 'year', 'admission_datetime', 'weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003', 'X2004', 'X2005', 'sex', 'Age', 'EDLOS', 'first_1_to_4_prio', 'nbr_ward_instances', 'TroponinT', 'Lactate', 'Hemoglobin', 'CRP', 'WBC', 'pH', 'occupancy', 'available_beds', 'ED_occupancy_x', 'IPLOS', 'hospital_death']]
# Separate predictors and outcome (admission_datetime not included)
x = df.iloc[:, 6:-1].values
y = df.iloc[:, -1].values
# Standardization
x = preprocessing.scale(x)
# Split the dataset in two equal parts
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.5, random_state=0)
#Function for computing the accuracy a given model on the entire test set, the accuracy on class 0 in the test set
#and the accuracy on class 1
score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test),
model.score(x_test[y_test==0], y_test[y_test==0]),
model.score(x_test[y_test==1], y_test[y_test==1])],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
# Calculate weight/priors
class_proportion = np.array([np.mean(y==0), np.mean(y==1)])
#Weighted logistic regression
weighted_logistic = LogReg(penalty = 'l1', C=0.01, class_weight='balanced')
weighted_logistic.fit(x_train, y_train)
weighted_log_scores = score(weighted_logistic, x_test, y_test)
#LDA
lda = LDA(priors = class_proportion)
lda.fit(x_train, y_train)
lda_scores = score(lda, x_test, y_test)
#QDA
qda = QDA(priors = class_proportion)
qda.fit(x_train, y_train)
qda_scores = score(qda, x_test, y_test)
#Decision Tree
tree = DecisionTree(class_weight = "balanced")
tree.fit(x_train, y_train)
tree_scores = score(tree, x_test, y_test)
#Random Forest
rf = RandomForest(class_weight = 'balanced')
rf.fit(x_train, y_train)
rf_scores = score(rf, x_test, y_test)
#SVM
SVM = SVC(C=100, class_weight = 'balanced')
SVM.fit(x_train, y_train)
SVM_scores = score(SVM, x_test, y_test)
#Score Dataframe
score_df = pd.DataFrame({'weighted logistic': weighted_log_scores,
'weighted lda': lda_scores,
'weighted qda': qda_scores,
'weighted tree': tree_scores,
'weighted rf': rf_scores,
'weighted SVM': SVM_scores})
score_df
With imbalanced dataset, weighted logistic and weighted qda outperform the rest of the models. But we will try to rerun all models using balanced dataset and compare their performance.
Dataset is first split in half to form initial train set and test set. Then to construct a balanced train set, controls are randomly selected with replacement from the train set to match the number of cases in the train set, which is then used in model fitting. Model performance is then tested on the initial (imbalanced test set). Accuracy on both survival and death, as well as total accuracy is compared across models.
# Create balanced train set by sampling equal controls and cases
# Concatenate x_train and y_train
y_train = y_train.reshape(len(y_train),1)
train = np.concatenate((x_train,y_train),axis =1)
# separate cases and controls in the training set
cases = train[train[:, -1]==1]
controls = train[train[:, -1]==0]
# permute controls, and choose same number of controls as cases
n_control = controls.shape[0]
perm = np.random.randint(low=0, high=n_control, size=n_control) # Generate a list of indices 0 to n and permute it
controls_perm = controls[perm]
controls_subsample = controls_perm[:cases.shape[0]]
# Form balanced training set with equal cases and controls
train = np.concatenate((controls_subsample, cases), axis = 0)
# Separate predictors and labels
x_train = train[:, :-1]
y_train = train[:, -1]
#Weighted logistic regression
weighted_logistic = LogReg(penalty = 'l1', C=1, class_weight='balanced')
weighted_logistic.fit(x_train, y_train)
weighted_log_scores = score(weighted_logistic, x_test, y_test)
#LDA
lda = LDA()
lda.fit(x_train, y_train)
lda_scores = score(lda, x_test, y_test)
#QDA
qda = QDA()
qda.fit(x_train, y_train)
qda_scores = score(qda, x_test, y_test)
#Decision Tree
tree = DecisionTree(class_weight = "balanced")
tree.fit(x_train, y_train)
tree_scores = score(tree, x_test, y_test)
#Random Forest
rf = RandomForest(class_weight = 'balanced')
rf.fit(x_train, y_train)
rf_scores = score(rf, x_test, y_test)
#SVM
SVM = SVC(C=100, class_weight = 'balanced')
SVM.fit(x_train, y_train)
SVM_scores = score(SVM, x_test, y_test)
#Score Dataframe
score_df = pd.DataFrame({'weighted logistic': weighted_log_scores,
'weighted lda': lda_scores,
'weighted qda': qda_scores,
'weighted tree': tree_scores,
'weighted rf': rf_scores,
'weighted SVM': SVM_scores})
score_df
Model performance overvall is significantly improved using a balanced training set compared to using an imbalanced training set. Since lda, logistic, and random forest achieved more than 75% overall accuracy with larger than 70% accuracy in each group, we will randomly select on to tune the hyper-parameters. In this case, we chose random forest.
# Set the parameters by cross-validation
tuned_parameters = [{"n_estimators": [20, 40, 80, 160, 320], "max_depth": [2, 4, 8, 16, 32], 'max_features': [2, 4, 8, 16]}]
scores = ['precision', 'recall']
for score in scores:
print "# Tuning hyper-parameters for %s" % score
print
rf = GridSearchCV(RandomForest(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
rf.fit(x_train, y_train)
print "Best parameters set found on development set:"
print
print rf.best_params_
print
print "Detailed classification report:"
print
print "The model is trained on the full development set."
print "The scores are computed on the full evaluation set."
print
y_true, y_pred = y_test, rf.predict(x_test)
print classification_report(y_true, y_pred)
print
So the best parameter set is {'max_features': 4, 'n_estimators': 320, 'max_depth': 8}. The accuracy for class0 and class1 are 0.75 and 0.82 respectively.
from sklearn.metrics import precision_recall_fscore_support
# Separate predictors and outcome (admission_datetime not included)
x = df.iloc[:, 6:-1].values
y = df.iloc[:, -1].values
# Standardization
x = preprocessing.scale(x)
# Store stats for two classes separately
precision_0 = []
recall_0 = []
precision_1 = []
recall_1 = []
average_accuracy = []
# Store probability for two classes
proba_0 = []
proba_1 = []
for simulation in range(50):
# Split the dataset in two equal parts
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
# Create balanced train set by sampling equal controls and cases
# Concatenate x_train and y_train
y_train = y_train.reshape(len(y_train),1)
train = np.concatenate((x_train,y_train),axis =1)
# separate cases and controls in the training set
cases = train[train[:, -1]==1]
controls = train[train[:, -1]==0]
# permute controls, and choose same number of controls as cases
n_control = controls.shape[0]
perm = np.random.randint(low=0, high=n_control, size=n_control) # Generate a list of indices 0 to n and permute it
controls_perm = controls[perm]
controls_subsample = controls_perm[:cases.shape[0]]
# Form balanced training set with equal cases and controls
train = np.concatenate((controls_subsample, cases), axis = 0)
# Separate predictors and labels
x_train = train[:, :-1]
y_train = train[:, -1]
# Fit random forest
rf = RandomForest(n_estimators = 320, max_features = 4, max_depth = 8)
rf.fit(x_train, y_train)
# Compute stats
y_true, y_pred = y_test, rf.predict(x_test)
stats = precision_recall_fscore_support(y_true, y_pred)
precision_0.append(stats[0][0])
precision_1.append(stats[0][1])
recall_0.append(stats[1][0])
recall_1.append(stats[1][1])
average_accuracy.append(rf.score(x_test, y_true))
# Store probabilities
proba = pd.DataFrame(rf.predict_proba(x))
proba_0.append(proba[0])
proba_1.append(proba[1])
# Plot precision and accuracy of each class from simulations
print ("The average accuracy of overall dataset:", np.mean(average_accuracy))
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
ax1.hist(precision_0, alpha = 0.5)
mean = np.mean(precision_0)
ax1.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax1.set_title('Precision of Class 0')
ax1.set_xlabel('Precision of Class 0')
ax1.set_ylabel('Frequency')
ax1.legend(loc='upper left')
ax2.hist(precision_1, alpha = 0.5)
mean = np.mean(precision_1)
ax2.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax2.set_title('Precision of Class 1')
ax2.set_xlabel('Precision of Class 1')
ax2.set_ylabel('Frequency')
ax2.legend(loc='upper left')
ax3.hist(recall_0, alpha = 0.5)
mean = np.mean(recall_0)
ax3.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax3.set_title('Accuracy of Class 0')
ax3.set_xlabel('Accuracy of Class 0')
ax3.set_ylabel('Frequency')
ax3.legend(loc='upper left')
ax4.hist(recall_1, alpha = 0.5)
mean = np.mean(recall_1)
ax4.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax4.set_title('Accuracy of Class 1')
ax4.set_xlabel('Accuracy of Class 1')
ax4.set_ylabel('Frequency')
ax4.legend(loc='upper left')
fig.savefig("Random_forest_stats")
plt.show()
# Store average probability for each class from 50 simulations.
proba_0 = pd.DataFrame(proba_0)
proba_0_avg = proba_0.mean(axis=0)
proba_1 = pd.DataFrame(proba_1)
proba_1_avg = proba_1.mean(axis=0)
# Add predicted probability to dataset
df['predict_proba_class0'] = proba_0_avg
df['predict_proba_class1'] = proba_1_avg
# Predict label for each patient by comparing average predicted probability of class0 vs. class1
df['predict_label'] = pd.Series()
df['predict_label'][df['predict_proba_class0'] > df['predict_proba_class1']] = 0
df['predict_label'][df['predict_proba_class0'] < df['predict_proba_class1']] = 1
# Accuracy of whole dataset based on summary of 50 simulations
print ('Accuracy on class0:')
print (np.mean(df['predict_label'][df['hospital_death'] == 0] == 0))
print ('Accuracy on class1:')
print (np.mean(df['predict_label'][df['hospital_death'] == 1] == 1))
# Store in a .csv file
# df.to_csv("ED_with_prediction.csv", sep=',', index = False)
fig, ((ax1)) = plt.subplots(1, 1, figsize=(8, 4))
bins = np.linspace(0,1,50)
# histogram of probability of death given actually alive
ax1.hist(df['predict_proba_class1'][df['hospital_death'] == 0], alpha = 0.65, bins=bins, color = 'g', normed = True, label = 'True Survival')
# histogram of probability of death given dead
ax1.hist(df['predict_proba_class1'][df['hospital_death'] == 1], alpha = 0.35, bins=bins, color = 'r', normed = True, label = 'True Death')
ax1.axvline(x=0.5, linewidth=2, color = 'b', label = "Threshold")
ax1.set_title('Predicted Probability of Death by Random Forest')
ax1.set_xlabel('Probability of death')
ax1.set_ylabel('Normalized frequency')
ax1.legend(loc='best')
fig.savefig("Predicted probability hist")
plt.show()
Similarly to the weighted logisitc regression classifier, the optimal threshold is at the minimum of the curve traced out by the two probability distributions, and there is no clear separation of these distributions for true survival and true death.
# Associate features with importance
features = ['weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003',
'X2004', 'X2005', 'sex', 'Age', 'EDLOS', 'first_1_to_4_prio', 'nbr_ward_instances', 'TroponinT',
'Lactate', 'Hemoglobin', 'CRP', 'WBC', 'pH', 'occupancy', 'available_beds', 'ED_occupancy_x', 'IPLOS']
df_feature = pd.DataFrame({'feature': features,
'importance': rf.feature_importances_})
# Sort feature importance from high to low
df_feature = df_feature.sort('importance')
# Plot the top 5 important features
fig = plt.figure(figsize=(15, 8))
val = df_feature['importance'][-5:] # feature_importance
N = len(val) # number of bars
ind = np.arange(N) # the y locations for the groups
width = 0.25
ax1 = fig.add_subplot(121)
ax1.barh(ind, val, align='edge',height=0.5)
ax1.set_xlim(0, 0.20)
ax1.set_ylim(-width, len(ind)+width, auto= True)
ax1.set_yticks(ind+width)
ytickNames = ax1.set_yticklabels(df_feature['feature'][-5:])
plt.setp(ytickNames, rotation=0, fontsize=14)
ax1.set_xlabel('Feature Importance', fontsize=14)
ax1.set_title('Feature importance in random forest', fontsize=15)
ax1.grid(True)
# Plot all the time features
val = df_feature['importance'][:16] # feature_importance
N = len(val) # number of bars
ind = np.arange(N) # the y locations for the groups
width = 0.25
ax2 = fig.add_subplot(122)
ax2.barh(ind, val, align='edge',height=0.5)
ax2.set_xlim(0, 0.006)
ax2.set_ylim(-width, len(ind)+width, auto= True)
ax2.set_yticks(ind+width)
ytickNames = ax2.set_yticklabels(df_feature['feature'][:16])
plt.setp(ytickNames, rotation=0, fontsize=14)
ax2.set_xlabel('Feature Importance', fontsize=14)
ax2.set_title('Feature importance in random forest', fontsize=15)
ax2.grid(True)
plt.show()
So age, pH, lactate, IPLOS(In-patient length of stay), and CRP are the five most important features in predicting hospital death. (Single simulation, number of trees = 320). In presence of lab tests and demographic features, time features are the least importance group of features in random forest. In order to look at the association between time variables and hospital death, we have to exclude the rest of the predictors. So next, we will try a random forest with only time features.
Misclassification is defined as the predicted outcome different from the observed outcome. Misclassification in each mouth is calculated, and the proportion of misclassificatioin is obtained by dividing the number of misclassification by the number of patients admitted in a given month. The proportion is then plotted across time for analysis.
# Create datetime object
df['ADMISSION_DATE'] = pd.to_datetime(df['ADMISSION_DATE'])
# Create month_year variable
df['month_year'] = df['ADMISSION_DATE'].apply(lambda x: x.strftime('%B-%Y'))
visits_by_monthyear = df.groupby('month_year').size() #Pandas series
visits_by_monthyear_values = df.groupby('month_year').size().values
misclassified_df = df[df['predict_label'] != df['hospital_death']]
misclassified_by_monthyear = misclassified_df.groupby('month_year').size() #Pandas series
misclassified_by_monthyear_values = misclassified_by_monthyear.values
monthyear_ordered = [
misclassified_by_monthyear.index.values[11], #jan 2003
misclassified_by_monthyear.index.values[8],
misclassified_by_monthyear.index.values[20],
misclassified_by_monthyear.index.values[0], # april 2003
misclassified_by_monthyear.index.values[23],
misclassified_by_monthyear.index.values[17],
misclassified_by_monthyear.index.values[14],
misclassified_by_monthyear.index.values[3],
misclassified_by_monthyear.index.values[30], #september
misclassified_by_monthyear.index.values[28],
misclassified_by_monthyear.index.values[26],
misclassified_by_monthyear.index.values[6], # December 2003
misclassified_by_monthyear.index.values[12], # January 2004
misclassified_by_monthyear.index.values[9],
misclassified_by_monthyear.index.values[21],
misclassified_by_monthyear.index.values[1],
misclassified_by_monthyear.index.values[24],
misclassified_by_monthyear.index.values[18],
misclassified_by_monthyear.index.values[15],
misclassified_by_monthyear.index.values[4], # august 2004
misclassified_by_monthyear.index.values[31],
misclassified_by_monthyear.index.values[29],
misclassified_by_monthyear.index.values[27],
misclassified_by_monthyear.index.values[7], # December 2004
misclassified_by_monthyear.index.values[13], # January 2005
misclassified_by_monthyear.index.values[10],# Febuary 2005
misclassified_by_monthyear.index.values[22],
misclassified_by_monthyear.index.values[2],
misclassified_by_monthyear.index.values[25],
misclassified_by_monthyear.index.values[19],
misclassified_by_monthyear.index.values[16],
misclassified_by_monthyear.index.values[5],
]
proportion_misclassified_by_monthyear = np.divide(misclassified_by_monthyear_values.astype('float'), visits_by_monthyear_values.astype('float'))
proportion_misclassified_ordered = [
proportion_misclassified_by_monthyear[11], #jan 2003
proportion_misclassified_by_monthyear[8],
proportion_misclassified_by_monthyear[20],
proportion_misclassified_by_monthyear[0], # april 2003
proportion_misclassified_by_monthyear[23],
proportion_misclassified_by_monthyear[17],
proportion_misclassified_by_monthyear[14],
proportion_misclassified_by_monthyear[3],
proportion_misclassified_by_monthyear[30], #september
proportion_misclassified_by_monthyear[28],
proportion_misclassified_by_monthyear[26],
proportion_misclassified_by_monthyear[6], # December 2003
proportion_misclassified_by_monthyear[12], # January 2004
proportion_misclassified_by_monthyear[9],
proportion_misclassified_by_monthyear[21],
proportion_misclassified_by_monthyear[1],
proportion_misclassified_by_monthyear[24],
proportion_misclassified_by_monthyear[18],
proportion_misclassified_by_monthyear[15],
proportion_misclassified_by_monthyear[4], # august 2004
proportion_misclassified_by_monthyear[31],
proportion_misclassified_by_monthyear[29],
proportion_misclassified_by_monthyear[27],
proportion_misclassified_by_monthyear[7], # December 2004
proportion_misclassified_by_monthyear[13], # January 2005
proportion_misclassified_by_monthyear[10],# Febuary 2005
proportion_misclassified_by_monthyear[22],
proportion_misclassified_by_monthyear[2],
proportion_misclassified_by_monthyear[25],
proportion_misclassified_by_monthyear[19],
proportion_misclassified_by_monthyear[16],
proportion_misclassified_by_monthyear[5],
]
fig = plt.figure(figsize=(15,1.5))
plt.bar(list(range(32)), proportion_misclassified_ordered, align='center', color='thistle')
plt.xticks(list(range(32)), monthyear_ordered, rotation='vertical')
plt.xlim([-0.5, 31.5])
plt.title('Proportion of Hospital Visits Misclassified')
plt.show()
fig.savefig('misclassified_overtime_rf.png')
# Load data
df = pd.read_csv("ED_with_time_indicators_2nd.csv")
df = df[['Unnamed: 0', 'ADMISSION_DATE', 'day', 'month', 'year', 'admission_datetime', 'weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003', 'X2004', 'X2005', 'sex', 'Age', 'EDLOS', 'first_1_to_4_prio', 'nbr_ward_instances', 'TroponinT', 'Lactate', 'Hemoglobin', 'CRP', 'WBC', 'pH', 'occupancy', 'available_beds', 'ED_occupancy_x', 'IPLOS', 'hospital_death']]
df.shape
# Separate predictors and outcome (admission_datetime not included)
x = df.iloc[:, 6:22].values
y = df.iloc[:, -1].values
# Standardization
x = preprocessing.scale(x)
# Split the dataset in two equal parts
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.5, random_state=0)
#Function for computing the accuracy a given model on the entire test set, the accuracy on class 0 in the test set
#and the accuracy on class 1
score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test),
model.score(x_test[y_test==0], y_test[y_test==0]),
model.score(x_test[y_test==1], y_test[y_test==1])],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
# Create balanced train set by sampling equal controls and cases
# Concatenate x_train and y_train
y_train = y_train.reshape(len(y_train),1)
train = np.concatenate((x_train,y_train),axis =1)
# separate cases and controls in the training set
cases = train[train[:, -1]==1]
controls = train[train[:, -1]==0]
# permute controls, and choose same number of controls as cases
n_control = controls.shape[0]
perm = np.random.randint(low=0, high=n_control, size=n_control) # Generate a list of indices 0 to n and permute it
controls_perm = controls[perm]
controls_subsample = controls_perm[:cases.shape[0]]
# Form balanced training set with equal cases and controls
train = np.concatenate((controls_subsample, cases), axis = 0)
# Separate predictors and labels
x_train = train[:, :-1]
y_train = train[:, -1]
#Weighted logistic regression
weighted_logistic = LogReg(penalty = 'l1', C=1, class_weight='balanced')
weighted_logistic.fit(x_train, y_train)
weighted_log_scores = score(weighted_logistic, x_test, y_test)
#LDA
lda = LDA()
lda.fit(x_train, y_train)
lda_scores = score(lda, x_test, y_test)
#QDA
qda = QDA()
qda.fit(x_train, y_train)
qda_scores = score(qda, x_test, y_test)
#Decision Tree
tree = DecisionTree(class_weight = "balanced")
tree.fit(x_train, y_train)
tree_scores = score(tree, x_test, y_test)
#Random Forest
rf = RandomForest(class_weight = 'balanced')
rf.fit(x_train, y_train)
rf_scores = score(rf, x_test, y_test)
#SVM
SVM = SVC(C=100, class_weight = 'balanced')
SVM.fit(x_train, y_train)
SVM_scores = score(SVM, x_test, y_test)
#Score Dataframe
score_df = pd.DataFrame({'weighted logistic': weighted_log_scores,
'weighted lda': lda_scores,
'weighted qda': qda_scores,
'weighted tree': tree_scores,
'weighted rf': rf_scores,
'weighted SVM': SVM_scores})
score_df
# Set the parameters by cross-validation
tuned_parameters = [{"n_estimators": [20, 40, 80, 160, 320], "max_depth": [2, 4, 8, 16, 32], 'max_features': [2, 4, 8, 16]}]
scores = ['precision', 'recall']
for score in scores:
print "# Tuning hyper-parameters for %s" % score
print
rf = GridSearchCV(RandomForest(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
rf.fit(x_train, y_train)
print "Best parameters set found on development set:"
print
print rf.best_params_
print
print "Detailed classification report:"
print
print "The model is trained on the full development set."
print "The scores are computed on the full evaluation set."
print
y_true, y_pred = y_test, rf.predict(x_test)
print classification_report(y_true, y_pred)
print
So the best parameter set is {'max_features': 16, 'n_estimators': 20, 'max_depth': 8}. The accuracy for class0 and class1 are 0.58 and 0.45 respectively.
from sklearn.metrics import precision_recall_fscore_support
# Separate predictors and outcome (admission_datetime not included)
x = df.iloc[:, 6:22].values
y = df.iloc[:, -1].values
# Standardization
x = preprocessing.scale(x)
# Store stats for two classes separately
precision_0 = []
recall_0 = []
precision_1 = []
recall_1 = []
average_accuracy = []
# Store probability for two classes
proba_0 = []
proba_1 = []
for simulation in range(50):
# Split the dataset in two equal parts
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
# Create balanced train set by sampling equal controls and cases
# Concatenate x_train and y_train
y_train = y_train.reshape(len(y_train),1)
train = np.concatenate((x_train,y_train),axis =1)
# separate cases and controls in the training set
cases = train[train[:, -1]==1]
controls = train[train[:, -1]==0]
# permute controls, and choose same number of controls as cases
n_control = controls.shape[0]
perm = np.random.randint(low=0, high=n_control, size=n_control) # Generate a list of indices 0 to n and permute it
controls_perm = controls[perm]
controls_subsample = controls_perm[:cases.shape[0]]
# Form balanced training set with equal cases and controls
train = np.concatenate((controls_subsample, cases), axis = 0)
# Separate predictors and labels
x_train = train[:, :-1]
y_train = train[:, -1]
# Fit random forest
rf = RandomForest(n_estimators = 20, max_features = 16, max_depth = 8)
rf.fit(x_train, y_train)
# Compute stats
y_true, y_pred = y_test, rf.predict(x_test)
stats = precision_recall_fscore_support(y_true, y_pred)
precision_0.append(stats[0][0])
precision_1.append(stats[0][1])
recall_0.append(stats[1][0])
recall_1.append(stats[1][1])
average_accuracy.append(rf.score(x_test, y_true))
# Store probabilities
proba = pd.DataFrame(rf.predict_proba(x))
proba_0.append(proba[0])
proba_1.append(proba[1])
# Plot precision and accuracy of each class from simulations
print ("The average accuracy of overall dataset:", np.mean(average_accuracy))
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
ax1.hist(precision_0, alpha = 0.5)
mean = np.mean(precision_0)
ax1.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax1.set_title('Precision of Class 0')
ax1.set_xlabel('Precision of Class 0')
ax1.set_ylabel('Frequency')
ax1.legend(loc='upper left')
ax2.hist(precision_1, alpha = 0.5)
mean = np.mean(precision_1)
ax2.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax2.set_title('Precision of Class 1')
ax2.set_xlabel('Precision of Class 1')
ax2.set_ylabel('Frequency')
ax2.legend(loc='upper left')
ax3.hist(recall_0, alpha = 0.5)
mean = np.mean(recall_0)
ax3.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax3.set_title('Accuracy of Class 0')
ax3.set_xlabel('Accuracy of Class 0')
ax3.set_ylabel('Frequency')
ax3.legend(loc='upper left')
ax4.hist(recall_1, alpha = 0.5)
mean = np.mean(recall_1)
ax4.axvline(x=mean, linewidth=3, color = 'r', label = "Mean:" + str(round(mean,2)))
ax4.set_title('Accuracy of Class 1')
ax4.set_xlabel('Accuracy of Class 1')
ax4.set_ylabel('Frequency')
ax4.legend(loc='upper left')
plt.show()
# Store average probability for each class from 50 simulations.
proba_0 = pd.DataFrame(proba_0)
proba_0_avg = proba_0.mean(axis=0)
proba_1 = pd.DataFrame(proba_1)
proba_1_avg = proba_1.mean(axis=0)
# Add predicted probability to dataset
df['predict_proba_class0'] = proba_0_avg
df['predict_proba_class1'] = proba_1_avg
# Predict label for each patient by comparing average predicted probability of class0 vs. class1
df['predict_label'] = pd.Series()
df['predict_label'][df['predict_proba_class0'] > df['predict_proba_class1']] = 0
df['predict_label'][df['predict_proba_class0'] < df['predict_proba_class1']] = 1
# plot predicted probability of death for two classes
fig, ((ax1)) = plt.subplots(1, 1, figsize=(8, 4))
# histogram of probability of death given actually alive
ax1.hist(df['predict_proba_class1'][df['hospital_death'] == 0], alpha = 0.2, bins=30, color = 'g', normed = True, label = 'Survival')
# histogram of probability of death given dead
ax1.hist(df['predict_proba_class1'][df['hospital_death'] == 1], alpha = 0.5, bins=30, color = 'r', normed = True, label = 'Death')
ax1.axvline(x=0.5, linewidth=3, color = 'b', label = "Cutoff")
ax1.set_title('Predicted Probability of Death by Random Forest')
ax1.set_xlabel('Probability of death')
ax1.set_ylabel('Normalized frequency')
ax1.legend(loc='best')
plt.show()
features = ['weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003', 'X2004', 'X2005']
# Plot feature importance
fig = plt.figure(figsize=(25, 5))
ax = fig.add_subplot(111)
N = len(features)
ind = np.arange(N) # the x locations for the groups
width = 0.45 # the width of the bars
## the bars
ax.bar(ind, rf.feature_importances_, width)
# axes and labels
ax.set_xlim(-width, len(ind)+width)
ax.set_ylim(0, 0.25)
ax.set_ylabel('Feature importance', fontsize= 15)
ax.set_title('Importance of time features in Random Forest', fontsize = 25)
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(features)
plt.setp(xtickNames, rotation=90, fontsize=15)
#fig.savefig("Features_rf_timeonly")
plt.show()
# Associate features with importance
features = ['weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003',
'X2004', 'X2005']
df_feature = pd.DataFrame({'feature': features,
'importance': rf.feature_importances_})
# Sort feature importance from high to low
df_feature = df_feature.sort('importance')
# Plot the top 5 important features
fig = plt.figure(figsize=(5, 8))
val = df_feature['importance'] # feature_importance
N = len(val) # number of bars
ind = np.arange(N) # the y locations for the groups
width = 0.25
ax = fig.add_subplot(121)
ax.barh(ind, val, align='edge',height=0.5)
ax.set_xlim(0, 0.20)
ax.set_ylim(-width, len(ind)+width, auto= True)
ax.set_yticks(ind+width)
ytickNames = ax.set_yticklabels(df_feature['feature'])
plt.setp(ytickNames, rotation=0, fontsize=14)
ax.set_xlabel('Feature Importance', fontsize=14)
ax.set_title('Importance of Time Features in Random Forest', fontsize=15)
ax.grid(True)
plt.show()
So age, lactate, CRP, pH and IPLOS are the five most importance features in predicting hospital death. (Single simulation, number of trees = 320).
from sklearn.metrics import f1_score
#### BASELINE MODEL ####
# Randomly assign 0 or 1 to each patient
# simulate the process for 50 times, calculate f1 score
# Compare f1 score of baseline model to f1 score of random forest
# Load data
df = pd.read_csv("ED_with_time_indicators_2nd.csv")
df = df[['Unnamed: 0', 'ADMISSION_DATE', 'day', 'month', 'year', 'admission_datetime', 'weekday', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'X2003', 'X2004', 'X2005', 'sex', 'Age', 'EDLOS', 'first_1_to_4_prio', 'nbr_ward_instances', 'TroponinT', 'Lactate', 'Hemoglobin', 'CRP', 'WBC', 'pH', 'occupancy', 'available_beds', 'ED_occupancy_x', 'IPLOS', 'hospital_death']]
# Separate predictors and outcome (admission_datetime not included)
x = df.iloc[:, 6:22].values
y = df.iloc[:, -1].values
# Randomly assign 0 and 1 based on the proportion of their occurence in the dataset
# simulate 101 times
accuracy_0 = []
accuracy_1 = []
f1_bs = []
for simulation in range(50):
pred_label = np.random.binomial(1, 0.5, size = df.shape[0])
accuracy_0.append(np.mean(pred_label[y==0]==0))
accuracy_1.append(np.mean(pred_label[y==1]==1))
f1_bs.append(f1_score(y, pred_label))
print "Average accuracy on Class0:", np.mean(accuracy_0)
print "Average accuracy on Class1:", np.mean(accuracy_1)
print "F1 score:", np.mean(f1_bs)
#### RANDOM FOREST MODEL ####
# Standardization
x = preprocessing.scale(x)
# Store stats for two classes separately
recall_0 = []
recall_1 = []
average_accuracy = []
f1_rf = []
for simulation in range(50):
# Split the dataset in two equal parts
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
# Create balanced train set by sampling equal controls and cases
# Concatenate x_train and y_train
y_train = y_train.reshape(len(y_train),1)
train = np.concatenate((x_train,y_train),axis =1)
# separate cases and controls in the training set
cases = train[train[:, -1]==1]
controls = train[train[:, -1]==0]
# permute controls, and choose same number of controls as cases
n_control = controls.shape[0]
perm = np.random.randint(low=0, high=n_control, size=n_control) # Generate a list of indices 0 to n and permute it
controls_perm = controls[perm]
controls_subsample = controls_perm[:cases.shape[0]]
# Form balanced training set with equal cases and controls
train = np.concatenate((controls_subsample, cases), axis = 0)
# Separate predictors and labels
x_train = train[:, :-1]
y_train = train[:, -1]
# Fit random forest
rf = RandomForest(n_estimators = 20, max_features = 16, max_depth = 8)
rf.fit(x_train, y_train)
# Compute stats of prediction for the whole dataset
y_true, y_pred = y, rf.predict(x)
stats = precision_recall_fscore_support(y_true, y_pred)
recall_0.append(stats[1][0])
recall_1.append(stats[1][1])
f1_rf.append(f1_score(y, y_pred))
average_accuracy.append(rf.score(x, y_true))
print "Average accuracy on Class0:", np.mean(recall_0)
print "Average accuracy on Class1:", np.mean(recall_1)
print "F1 score:", np.mean(f1_rf)
The accuracy on both classes given by the baseline model and random forest are 50%. So do we really have any evidence that the random forest model using time variables are better at predicting hospital death? Let's do a two sample t-test in comparing the f1 scores.
mean_bs = np.mean(f1_bs)
std_bs = np.std(f1_bs)
mean_rf = np.mean(f1_rf)
std_rf = np.std(f1_rf)
print "The mean and standard deviation of baseline model:", mean_bs, std_bs
print "The mean and standard deviation of random forest time ONLY model:", mean_rf, std_rf
from scipy import stats
stats.ttest_ind(f1_bs, f1_rf)
Based on the t test, the difference between two f1 scores are highly significantly different (p<0.001). Random forest increases the f1 score by 10% compared to baseline model.
Despite the fact that people in the US as well as overseas increasingly rely on unscheduled healthcare services [6-8], the availability of such has been reported to decrease, with discontinuing of Emergency Department services cited as an important driver [9-10]. The rate by which this happens is not likely to be constant across geographical areas, potentially aggravating inequalities that are already prevalent in a multitude of countries.
The increasing availability of methods for predicting binary outcomes in high dimensional data, along with abundant data substrates, calls for future initiatives aiming at monitoring healthcare quality in real-time, to provide patients with up to date benchmarks of performance across providers and facilities.
Computationally, an interesting future direction would be to test anomalies online using more sensitive anomaly detection algorithms. Model performance with and without binary time covariates did not differ significantly as seen from our misclassification analysis for the weighted logistic regression (without time variables) and random forest (with time variables), which suggests that the effect of entry time on hospital death, if there is any, is very subtle and was not captured by our models. In order to detect the subtle impact of entry time on patient death in the future, we should explore other time series anomaly detection algorithms, hopefully those more sensitive than the martingales we tried that can capture small change in the proportion of death per day in our dataset. Another interesting analysis would be to match dead patients to survived patients based on similar demographic and biological features, and only compare the time entry of the dead patient to the alive patients to decide whether there the death is abnormal due to time of entry.
1 - Blom M, Landin-Olsson M, Lindsten M, Jonsson F, Ivarsson K. Patients presenting at the emergency department with acute abdominal pain are less likely to be admitted to inpatient wards at times of access block: a registry study. Scand J Trauma Resusc Emerg Med. 2015;23:78.
2 - Guttmann A, Schull MJ, Vermeulen MJ, Stukel TA. Association between waiting times and short term mortality and hospital admission after departure from emergency department: population based cohort study from Ontario, Canada. British Medical Journal. 2011;342:d2983.
3 - Sun BC, Hsia RY, Weiss RE, Zingmond D, Liang LJ, Han W, et al. Effect of emergency department crowding on outcomes of admitted patients. Annals of Emergency Medicine. 2013;61(6):605-11.
4 - Cha WCS, S D; Cho, J S; Song, K J; Singer, A J; Kwak, Y H. The association between crowding and mortality in admitted pediatric patients from mixed adult pediatric emergency departments in Korea. Pediatric Emergency Care. 2011;27(12):1136-41.
5 - Sprivulis PC, Da Silva JA, Jacobs IG, Frazer AR, Jelinek GA. The association between hospital overcrowding and mortality among patients admitted via Western Australian emergency departments. The Medical journal of Australia. 2006;184(5):208-12.