Chapter 4 Linear Regression on Summary Statistics

With the processing and filtration of our data, as well as its visualization in Tableau, we turned to non-parametric statistical tests as well as linear regressions to attempt to evaluate how our covariates may be linked to the dynamics of the microbiome.

We began with various non-parametric tests, because they avoid the distribution assumptions requisite in linear regressions. While these methods limit the amount of information we can consider within a single test, because we consider just one variable instead of several, they are still potentially useful in identifying traits related to our summary statistics.

Performing these tests for our maximum distance metric, we see that graft source appears to be significant. When we perform these same tests for our last distance metric, we see that none of our variables are significant (Tables 3 and 4).

Table 3. P-values from non-parametric statistical tests for maximum distance as outcome
Covariate	P-value
Chronic Graft vs Host Disease	\(0.35\)
Acute Graft vs Host Disease	\(0.36\)
Gender	\(0.63\)
Vital Status	\(0.14\)
initialNegative	\(0.68\)
Batch	\(0.48\)
Race	\(0.33\)
Hispanic	\(0.15\)
Diagnosis	\(0.91\)
Transplant Type	\(0.28\)
Transplant Response	\(0.70\)
Graft Source	\(1.0*10^{-2}\)
Care Environment	\(0.60\)

Table 4. P-values from non-parametric statistical tests for last distance as outcome
Covariate	P-value
Chronic Graft vs Host Disease	\(0.51\)
Acute Graft vs Host Disease	\(0.70\)
Gender	\(0.61\)
Vital Status	\(0.17\)
initialNegative	\(0.43\)
Batch	\(0.14\)
Race	\(0.43\)
Hispanic	\(0.45\)
Diagnosis	\(0.77\)
Transplant Type	\(0.78\)
Transplant Response	\(0.65\)
Graft Source	\(0.15\)
Care Environment	\(0.78\)

In addition to conducting our non-parametric tests for our summary statistics, we also performed linear regressions to attempt to quantify the significance of our covariates with respect to the dynamics of the microbiome.

Based upon exploratory data analysis, we kept all of our continuous variables untransformed. There were no extremely concerning trends in any of our categorical variables either, so we did not transform them. There was also no need to transform the outcome variable of maximum distance.

We used lattice plots to search for interaction effects across our categorical variables, as well as boxplots to search for connections between categorical and quantitative covariates. Identified interactions were included in our baseline model. Backward selection was then performed using AIC as a scoring criterion.

For maximum distance as the outcome, our final model with the lowest AIC based on backward selection included fixed effects for transplant age, transplant type (Allo, AlloDLI, or Auto), graft source (bone marrow, cord blood, or PBPC), vital status (alive or dead), chronic graft v host disease (0 or 1), days after transplant that ANC is greater than 500, care environment (1, 2, or 3), and total time. An interaction effect was also included between vital status and care environment. This model had an AIC of -34.4, and it identified the following covariates as significant (Table 5):

Table 5. Significant P-values from linear regression with max distance as outcome
Covariate	Estimate	Confidence Interval	P-value
Age at Transplant	\(4.56*10^{-3}\)	\((3.210^{-4}, 8.810^{-3})\)	\(3.6*10^{-2}\)
Graft Source (Cord Blood)	\(-3.91*10^{-1}\)	\((-6.110^{-1}, -1.710^{-1})\)	\(8.9*10^{-4}\)
Vital Status (Dead)	\(-9.14*10^{-2}\)	\((-2.010^{-1}, 1.810^{-2})\)	\(9.8*10^{-2}\)
Vital Status(Dead)/Care Env(3)	\(2.86*10^{-1}\)	\((-2.610^{-2}, 6.010^{-1})\)	\(7.1*10^{-2}\)

Based upon our model, we note that the “maximum distance” traveled in an ordination plot by a patient’s microbiome is expected to go up by roughly \(4.56*10^{-3}\) units for each additional year of age at the time of transplant. In other words, it could be that the older a patient, the more variability in their microbiome.

We also see that a patient who ends up dead is expected to have a maximum distance that is roughly \(9.14*10^{-2}\) units less than that of a patient who is still alive. A possible interpretation of this behavior is that patients who are still alive have microbiome compositions that vary significantly from their sick status pre-transplant.

Similarly, a patient who has a graft source of Cord Blood is expected to have a maximum distance that is roughly \(3.91*10^{-1}\) units less than a patient who has a graft source of bone marrow. It may be the case that patients who receive cord blood stem cell donations have less disruption in their microbiomes than patients who receive bone marrow.

We note that our assumptions for equality of subpopulation standard deviations, as well as for our samples coming from a normally distributed population, are fulfilled by the residual plot and normal QQ-plot for our data.

We repeated the same backward selection process as above, but this time, we used last distance as our outcome variable. Based on backward selection according to AIC, our final model had last distance as the outcome, and fixed effects for race (White, Black, More than One, or Asian), hispanic ethnicity (no, unknown, or yes), graft source (bone marrow, cord blood, or PBPC), vital status (alive or dead), chronic graft v host disease (0 or 1), and total time span. Our model had an AIC of -32.3, and identified the following variables as significant (Table 6):

Table 6. Significant P-values from linear regression with last distance as outcome
Covariate	Estimate	Confidence Interval	P-value
Race (Black)	\(1.59*10^{-1}\)	\((-2.710^{-2}, 3.410^{-1})\)	\(9.3*10^{-2}\)
Graft Source (Cord Blood)	\(-3.01*10^{-1}\)	\((-5.110^{-1}, -9.310^{-2})\)	\(5.4*10^{-3}\)
Vital Status (Dead)	\(-1.09*10^{-1}\)	\((-2.110^{-1}, -8.210^{-3})\)	\(3.5*10^{-2}\)
Total Time Span	\(7.59*10^{-4}\)	\((1.610^{-5}, 1.510^{-3})\)	\(4.6*10^{-2}\)

Interestingly, graft source of cord blood and vital status still appear to be relevant. In particular, we see that according to our model, a patient who is confirmed dead is expected to have a last distance metric that is roughly \(1.09*10^{-1}\) units shorter than that of a patient who is confirmed alive.

Likewise, a patient with a graft source of cord blood is expected to have a last distance value that is roughly \(3.01*10^{-1}\) units shorter than that of a patient with a graft source of bone marrow. It seems that both of these traits are correlated with less change in the long-term for patients’ microbiome compositions. We also note that while transplant age is no longer significant, a race of “black” appears to be significant compared to the baseline of “white.”

A major concern we had after these analyses was that our summary metrics could potentially vary extensively based upon the addition and removal of samples from our dataset. To deal with this possible high sensitivity to data, we decided to make use of a more creative method for modeling the abundances of ASVs in our data: phylogenetic tree decomposition.