Incorporating the correlation among diseases into the model

Our intuition is that coupled classifiers of highly correlated pairs of diseases should have better performance than independent classifiers for the two diseases. To test this, we focused on textual judgments and only on the classes 'Y' and 'U' ('Q' and 'N' were converted to 'U'). We coded 'Y' as 1 and 'U' as 0, and we computed the correlation between every pair of diseases. Figure 3 show different pairwise distances between the diseases.

**Figure 3:** Heatmaps showing three different pairwise distances between the diseases: from top to bottom "1-correlation", "Cosine distance" and "Hamming distance". The values in the diagonal, originally equal to 0, were replaced in 3(a) and 3(b) by average row values to allow a better visualization of the matrices.
$\includegraphics[scale=0.5, clip]{confMatrix_Correlation.eps}$ $\includegraphics[scale=0.5, clip]{confMatrix_Cosine.eps}$ $\includegraphics[scale=0.5, clip]{confMatrix_Hamming.eps}$

We also computed the matrix of conditional probabilities for the 'Y' class of couples of diseases. A detail analysis of this matrix showed us interesting cases, for example, Obesity is very likely to happen if the patient has OSA ( $P(Obesity=Y\vert OSA=Y)=0.86$ ), and pretty unlikely to happen if the patient has PVD ( $P(Obesity=Y\vert PVD=Y)=0.21$ ).

With this example in mind, we thought that a good test case could be to include in the text of the records the new tokens HAS_DISEASEX for every DISEASEX that the patient has, except for Obesity, and see the results of training the classifier for the Obesity class on this new record set. So, for example, if a patient has OSA, CHF and Obesity, the tokens: HAS_OSA and HAS_CHF are added to the text of the record. And with these enriched records we estimate whether the patient has Obesity or not.

However, the outcome of the classifier for Obesity does not show any improvement (exactly the same accuracy) when using the enriched records. Features like HAS_OSA, which we thought should clearly improve the classification for the 'Y' class for Obesity are not pretty much taken into account by the classifier.

We were concerned about these results. And to test whether the whole idea makes sense, we did another run also including the token HAS_OBESITY in the records of the patients who actually have obesity (i.e. we included the correct answer as part of the training data of the classifier). Not surprisingly, in this case the classifier achieves a $99\%$ accuracy (not reaching the $100\%$ because there are probably errors for the classes 'Q' and 'N'), and the feature HAS_OBESITY obviously appears as the most informative for the classifier.

After looking at the numbers with more detail we decided to include also the answers (Y, N, Q, U) as new tokens: Y_DISEASE, U_DISEASE, Q_DISEASE, N_DISEASE for all the records and for all the diseases except for one, and see the results of the classifier for the excluded disease.

After carrying out that experiments we concluded that adding the answers for other diseases doesn't help the classification of a cercain disease. For example, in the textual case, it only helps for CAD and Hypercholesterolemia, but the improvement is very low: under 0.02 in averaged accuracy. Similar results are obtained for intuitive judgments. All the results are shown in tables 3 and 4.

Table 3: Average accuracy (with std. deviation in parenthesis) of 4-fold cross-validation runs on textual data with answers for other diseases added (answers added) and with no answers added. Answers for other diseases are added in the text in the form of the special tokens Y_DISEASE, U_DISEASE, Q_DISEASE, N_DISEASE.

Disease	Accuracy answers added	Accuracy no answers added
Asthma	0.929 (0.023)	0.929 (0.023)
CAD	0.774 (0.048)	0.763 (0.048)
CHF	0.854 (0.028)	0.856 (0.030)
Depression	0.866 (0.024)	0.866 (0.024)
Diabetes	0.805 (0.031)	0.805 (0.032)
Gallstones	0.847 (0.036)	0.847 (0.036)
GERD	0.843 (0.078)	0.843 (0.078)
Gout	0.921 (0.024)	0.921 (0.024)
Hypercholesterolemia	0.721 (0.053)	0.717 (0.053)
Hypertension	0.809 (0.053)	0.809 (0.053)
Hypertriglyceridemia	0.956 (0.016)	0.956 (0.016)
OA	0.859 (0.041)	0.859 (0.041)
Obesity	0.848 (0.044)	0.848 (0.044)
OSA	0.928 (0.017)	0.928 (0.017)
PVD	0.920 (0.012)	0.920 (0.012)
Venous Insufficiency	-	0.972 (0.014)

Table 4: Average accuracy (with std. deviation in parenthesis) of 4-fold cross-validation runs on intuitive data with answers for other diseases added (answers added) and with no answers added. Answers for other diseases are added in the text in the form of the special tokens Y_DISEASE, U_DISEASE, Q_DISEASE, N_DISEASE.

Disease	Accuracy answers added	Accuracy no answers added
Asthma	0.928 (0.016)	0.928 (0.016)
CAD	0.859 (0.036)	0.859 (0.036)
Depression	0.727 (0.048)	0.727 (0.048)
Diabetes	0.864 (0.019)	0.865 (0.018)
Gallstones	0.872 (0.040)	0.872 (0.040)
GERD	0.828 (0.030)	0.828 (0.030)
Gout	0.923 (0.024)	0.923 (0.024)
Hypercholesterolemia	0.738 (0.019)	0.745 (0.021)
Hypertension	0.836 (0.014)	0.836 (0.014)
Hypertriglyceridemia	0.928 (0.021)	0.928 (0.021)
OA	0.829 (0.023)	0.829 (0.023)
Obesity	0.847 (0.009)	0.848 (0.008)
OSA	0.933 (0.023)	0.933 (0.023)
PVD	0.907 (0.017)	0.907 (0.017)
Venous Insufficiency	-	0.894 (0.021)