Original Article | OPEN ACCESS DOI: 10.23937/2474-3658/1510231

Predictors of Pulmonary Cavitation among Tuberculosis Patients

Oluwafemi O Balogun1,2*, Adetayo Fawole3, Etiosa Osemwinyen4 and Busola Balogun5

1Massachusetts General Hospital, Boston, USA

2Harvard Medical School, Boston, USA

3New York University School of Global Public Health, NY, USA

4Windsor University School of Medicine, St Kitts

5Department of Community Medicine, Federal Medical Center, Abeokuta, Nigeria



Globally, Tuberculosis (TB) remains one of the top ten causes of mortality. Furthermore, the incidence and prevalence of the disease spectrum remain high in low- and some middle-income countries. Due to the associated high morbidity and mortality, the World Health Organization (WHO) declared it a public health emergency. Furthermore, pulmonary cavitation, classified epidemiologically and clinically as the hallmark of TB, has been associated with increased bacillary burden, high infection transmission, and development of drug resistance. This complication has contributed to the persistence of TB worldwide and is further implicated in the high severity of the disease.


This study aimed to understand the sociodemographic and clinical risk factors associated with pulmonary cavitation among individuals with TB.


A retrospective analysis of a public health database with 958 eligible individuals was conducted. In addition to the primary analysis, machine learning was engaged in the development of predictive models. Receiver operating characteristics (ROC), sensitivity, specificity, accuracy, and Cohen's kappa were used as metrics to evaluate the best-performing models.


The results of the primary analysis revealed that the crude odds ratio (OR) for the effect of TB multidrug resistance on pulmonary cavitation was 3.10 (95% CI: 2.38-4.05, p-value < 0.001). The non-adjusted OR for the role of rifampin, isoniazid, and ethambutol resistance, as 'mono-drug resistance', on formation of pulmonary cavity are 4.26 (95% CI: 2.90-6.42 p-value < 0.001), 4.50 (95% CI: 3.07-6.76 p-value < 0.001), and 3.90 (95% CI: 2.55-6.18 p-value < 0.001), respectively. Other identified risk factors for emergence of pulmonary cavitation among individuals with TB include being a male (OR: 1.43, 95% CI: 1.04-1.96, p-value = 0.029), history of treatment failure (OR: 3.27, 95% CI: 2.50-4.30, p-value < 0.001), being disabled (OR: 5.36, 95% CI: 2.93-10.2, p-value < 0.001), unemployed (OR: 2.00, 95% CI: 1.39-2.90, p-value < 0.001), and homeless (OR: 2.01, 95% CI: 1.18-3.56, p-value = 0.010).


Pulmonary cavitation remains a significant driver for TB transmission within hospitals and in the community setting. Therefore, public health and clinical interventions that address risk factors associated with the incidence of pulmonary cavity offer a significant benefit in reducing the spread of TB, and subsequently, decreasing the associated morbidity and mortality.


Tuberculosis, Pulmonary cavitation, Multidrug resistance, Machine learning, Transmission


CART: Classification and Regression Tree; LDA: Linear Discriminant Analysis; LGM: Logistic Regression Model; KNN: K Nearest Neighbor; MDR: Multidrug Resistance; ML: Machine Learning; OR: Odds Ratio; RF: Random Forest; SVM: Support Vector Machine; TB: Tuberculosis


Globally, Tuberculosis (TB) remains one of the top ten causes of mortality [1]. Furthermore, the incidence and prevalence of the disease spectrum remain high in low- and some middle-income countries. Due to the associated high morbidity and mortality, the World Health Organization (WHO) declared it a public health emergency [1,2]. In addition, pulmonary cavitation, classified epidemiologically and clinically as the hallmark of TB, has been associated with increased bacillary burden, high infection transmission, and development of drug resistance [3,4]. This complication has contributed to the persistence of TB worldwide and is further implicated in the high severity of the disease. To understand the drivers of TB transmission on a population level, it is relevant to explore one of the significant promoters, lung cavities. Therefore, this study aimed to explore the sociodemographic and clinical risk factors associated with pulmonary cavitation among individuals with TB.


Study design and population

A cross-sectional cohort study was conducted on individuals diagnosed with tuberculosis and subsequently received TB treatment in Moldova between January 1, 2009, and December 31, 2010. We enrolled individuals between 18 and 85 years who had positive TB culture and drug-susceptibility testing (DST) as observed in the TB surveillance database. DST was performed on a solid culture using the absolute concentration method. Individuals with rifampin, isoniazid, ethambutol, and pyrazinamide susceptibility information were captured in the analysis. In addition, Mono-drug and Multidrug resistance TB (MDR-TB) were evaluated among the study cohort. MDR-TB strains were defined as study participants with both rifampin and isoniazid resistance. Individuals with no culture or negative culture results, treatment failure at the time of study enrolment, and missing drug sensitivity and susceptibility results were excluded from the study. After evaluating potential study participants using the inclusion and exclusion criteria (Figure 1), a total of 958 individuals were finally enrolled in the study.

Figure 1: Inclusion and exclusion algorithm. View Figure 1

Statistical analysis

Descriptive statistics were developed for sociodemographic and clinical variables captured in the analysis, and the characteristics were compared between study participants with pulmonary cavity and individuals without lung cavity. Categorical characteristics were summarized as absolute numbers and their corresponding proportion. Numerical or continuous variables were documented using means (or median for non-parametric distribution) and standard deviation. We further developed three models for the study. The first model represented the univariable associations between the potential risk factors and the prevalence of our primary outcome, pulmonary cavitation. A second model, the multivariable logistic regression model, was developed using variables designated as fit for the model with stepwise regression. In addition, we developed a third predictive model, using the machine learning (ML) method to assess how accurately the variables included in the study predicted the outcome of interest. In the third model, the dataset was split into two, the training set (80%) and the validation set (20%). Tenfold cross-validation was utilized in the six machine learning models built, which includes linear discriminant analysis (LDA), classification and regression tree (CART), random forest (RF), k nearest neighbor (KNN), logistic regression (LGM), and support vector machine (SVM). Receiver operating characteristics (ROC), sensitivity, specificity, accuracy, and Cohen's kappa were used as metrics to evaluate the best-performing models. All statistical analyses were carried out using R Studio (R version 4.1.1). A p-value less than 0.05 was considered significant.

Data source

The analyses were conducted on an anonymized, routinely collected public health surveillance dataset. This non-identifiable dataset was made available secondarily through the Boston university school of public health. There was no interaction with patients and no collection of human samples.


Study participant characteristics

In total, 958 individuals with culture-proven pulmonary TB were captured in the analysis between January 1, 2009, and December 31, 2010 (Table 1). The mean age for the total study sample was 51 (Standard Deviation, ± 13) years-old, and 767 (80%) participants were male. Stratifying by the primary outcome, pulmonary cavity vs. no pulmonary cavity, the mean age across the two strata was similar (51% vs. 52%, respectively). Most of the study participants were in the age category of 35 and 65 years (76%), unemployed (660; 69%), and sheltered (869; 91%). Many individuals in this study cohort attained education up to the primary level (214; 22%) and secondary level (566; 59%). Assessing the clinical characteristics, most participants were observed to have MDR (524; 55%). Stratifying MDR by the primary outcome, a higher percentage of MDR (67% vs. 39%) was seen among the cohort with a pulmonary cavity than those without a lung cavity. A similar trend is observed with rifampin, ethambutol, and isoniazid monodrug resistance. Fewer individuals (62; 6%) in this study were positive for the human immunodeficiency virus (HIV).

Table 1: Sociodemographic and clinical characteristics of study participants with culture positive pulmonary tuberculosis. View Table 1

Risk factors for pulmonary cavity outcomes

In the univariable logistic regression analysis (Table 2), men were more likely to develop pulmonary cavity (OR: 1.43, 95% CI: 1.04-1.96). MDR was significantly associated with the outcome of interest statistically (OR: 3.10, 95% CI: 2.38-4.05). Monodrug resistance with isoniazid (OR: 4.5, 95% CI: 3.07-6.76), rifampin (OR: 4.26, 95% CI: 2.90-6.41), and ethambutol (OR: 3.9, 95% CI: 2.55-6.18) were also associated with a higher odd of developing a pulmonary cavity. Developing the outcome of interest (pulmonary cavity) were more likely when study participants attained only primary education (OR: 5.36, 95% CI: 2.93-10.2), had no education (OR: 2.00 95% CI: 1.39-2.90), were homeless (OR: 2.01, 95% CI: 1.18-3.56), and had treatment failure (OR: 3.27, 95% CI: 2.50-4.30).

Table 2: Demographic and clinical risk factors associated with pulmonary cavity (univariable model). View Table 2

In the multivariable logistic regression analysis (Table 3), similar risk factors (attaining primary education, no education, MDR, isoniazid resistance, homelessness, and treatment failure) for pulmonary cavity were observed. However, the statistically significant association previously observed with rifampin and ethambutol in the univariable model was no longer detected (p = 0.5 and 0.7, respectively). Further stratification of age into a categorical variable revealed individuals between 50 and 64 years (aOR: 2.33, 95% CI: 0.9-6.17, p = 0.085) and above 65 years (aOR: 4.37, 95% CI: 1.16-17.0, p = 0.031) were likely to develop the outcome.

Table 3: Demographic and clinical risk factors associated with pulmonary cavity (multivariable model). View Table 3

Predictive models

The LDA model (Figure 2) performed better (than other machine learning models) using 'accuracy' as an evaluation metric. The corresponding LDA model accuracy was 67.8% (58.7-72.0), while the KNN performed the least (61.5%; CL: 53.3-69.9). The results of the 'kappa' metric comparison were less robust than the 'accuracy'. Like the results of the metric accuracy evaluation, LDA performed better than other models (kappa: 33.7, 14.7-43.7), while KNN likewise performed least (kappa: 22.0, 7.4-37.3). Using ROC as an evaluation metric, LDA was the best performing model (71.1; 63.6-80.0) and CART the least performing model (66.2; 60.6-71.2). The ML model with the highest sensitivity and specificity are LDA (58.4; 45.5-68.8) and RF (80.8; 66.7-92.7). The corresponding least performing model (Figure 3) using sensitivity and specificity are RF (48.1; 33.1-59.4) and KNN (73.3; 56.1-85.4), respectively.

Figure 2: Evaluation metrics (Accuracy and Kappa) for the machine learning models utilized in the prediction of pulmonary cavity among study participants. View Figure 2

Figure 3: Evaluation metrics (ROC, Sensitivity, and Specificity) for the machine learning models utilized in the prediction of pulmonary cavity among study participants. View Figure 3


Tuberculosis is prevalent in a substantial number of low- and middle-income countries and poses as one of the top ten causes of death worldwide [1,5,6]. It is also the leading cause of death among individuals living with HIV and a significant contributor to antimicrobial resistance-related mortality globally. The hallmark of pulmonary TB is pulmonary cavitation, which is associated with a high sputum bacillary load [7]. High mycobacteria burden among people with active TB disease contributes to the propagation of the disease within communities [4,8]. In addition, lung cavities have also been assessed as markers of disease severity [4,9]. In this study, we explored possible risk factors and determinants of pulmonary cavity among our study population.

Our research revealed that men and individuals above 65 years were more likely to develop lung cavities [1]. These findings are similar to other studies and reports that observe a higher risk of pulmonary TB among this demographic [1,7,8]. Although some of these studies did not explore the risk of pulmonary cavities, the findings reveal a similar demographic trend. Palaci, et al. in their study of cavitary disease and quantitative sputum bacillary load, likewise observed that being male was statistically significantly associated with developing cavitation [7]. This finding is also observed in Beynon, et al. paper on mycobacterium tuberculosis bacillary burden [8]. Zhang, et al. revealed a higher proportion of pulmonary cavitation among males; however, the association was not statistically significant. One possible explanation for the lack of statistical significance is the different reference used [10]. In our research, females were selected as the reference group. This approach was based on the priori that men are more likely to have incident TB disease, and in extension, develop lung cavities in a higher proportion. Also, participants 65 years and above were more at risk of the outcome as compared to the findings of Zhang, et al. where individuals between the ages of 45 and 64-years-old were more likely to develop lung cavities. However, some of the other reviewed studies on pulmonary cavitation did not explore age stratification to assess a specific sub-age group more at risk of the outcome of interest.

In addition to age and gender, MDR was also identified as a risk factor for developing lung cavities. Individuals with MDR were two times more likely to develop the outcome of interest than participants without MDR. Several studies have shown MDR to be clinically and statistically associated with worsening lung outcomes among people with active pulmonary TB [11,12]. Although MDR in our study is categorized as a predictor/independent variable of pulmonary cavity, the relationship between these two clinical variables is more likely to be bidirectional. The presence of lung cavities creates a favorable environment for high bacillary burden, increased mycobacterial replication, and formation of a quasi-safe haven against the body's innate and adaptive immune response. These resultant effects facilitate the emergence of TB multidrug resistance [10,12,13]. Likewise, the development of MDR will further promote the expansion of lung cavities due to unresponsiveness or decreased responsiveness to drug therapies [11,14]. Aside from MDR, mono drug resistance with Isoniazid was also associated with the study's outcome after adjusting for other variables and potential confounders.

Furthermore, we observed that attaining education up to the primary level only and receiving no education were significantly associated with a worse pulmonary outcome among individuals with active TB. People with only primary school education were four times more likely to develop the outcome than individuals who received education up to the tertiary level. Socioeconomic status influences access to education and healthcare, which are known risk factors for TB acquisition [15,16]. Inadequate access to or lack of education as a risk factor for TB disease or severity is similarly observed in other studies [17,18]. Aside from education, being homeless and treatment failure were twice more likely to develop a pulmonary cavity. Homeless individuals are more likely to be exposed to unfavorable environmental and living conditions, poor sanitation and hygiene, malnutrition, poor access to healthcare, and contact with high TB bacillary shedders [19,20]. Also important to note is the related effect of homelessness on treatment failure. Studies have explored this role and observed a marked reduction in treatment success on people newly diagnosed with pulmonary TB [21]. Based on these findings, the interaction between homelessness and treatment failure also contributes substantially, independent of the individual demographic and clinical variables.

In comparison to the earlier mentioned sociodemographic and clinical variables, and their role in promoting incidence of the outcome, we observed a different pattern with HIV status. Individuals with HIV in our study cohort were less likely to develop pulmonary cavities. Although the association was not statistically significant, the effect measures still reveal a decreased odd of lung destruction for participants with HIV positivity. Our finding is consistent with other studies that explored the role of HIV infection in suppressing tuberculosis-induced pulmonary cavitation [3,22]. Metalloproteinases (MMPs), a group of zinc-containing proteases, have been identified in their role in promoting the breakdown and destruction of extracellular matrix [3,23]. MMPs are also significant in facilitating the recruitment of immune cells and the remodeling of tissue [3]. Due to these functions, their activities have been instrumental in developing lung cavities in individuals with TB (specifically MMPs 1, 2, 3, 8) [3,24]. However, in HIV-TB co-infection, these concentrations are remarkably lower than those without the co-infection [3]. The remarkably low specific type MMPs is likely to have contributed to the decreased risk of pulmonary cavity observed in our study. Also, outcomes of individuals with HIV-TB co-infections are influenced by the CD4 count. A lower CD4 count signifies a weaker mounted immune response against the mycobacterium. One of the limitations of this study is the lack of information on the CD4 counts of individuals diagnosed with HIV. Research has shown that lung cavities are more prevalent during the earlier phases of infection with HIV [25]. During this phase, CD4 counts are relatively normal; hence, cellular immunity remains preserved. Studies also reveal reduced lung tissue destruction as HIV advances (CD4 counts < 200 cells/μL) [3].

The models scored less than seventy percent using the evaluation metrics (Accuracy, Kappa, ROC, Sn). This indicates a less robust prediction of our primary outcome of interest using the machine learning models (KNN, SVM, CART, LDA, RF, LGM). We believe having an assessment score of 85 percent or more in one or more of these models would have signified a more reliable model fit. Our hypothesis for the poorly performing models is attributed to the lack of better documentation and information on co-morbidities, specifically, diabetes mellitus (DM). The findings of previous studies have indicated a higher risk of lung cavities in people with DM as compared to those without DM. Two studies, Li-Kuo, et al. and Chiang, et al. demonstrated that poor glycemic control seen in a subpopulation of individuals with DM was a significant risk factor for developing destructive lung disease among individuals diagnosed with active TB [26,27]. Another study on poor treatment outcomes prediction among individuals with TB revealed a statistically significant association between lung cavitation and DM [28]. Availability of information on DM comorbidity would strengthen the models and produce a more robust prediction. In addition, the absence of relevant unmeasured variables would also reduce the strength of the models in accurately predicting the outcome. Although the models performed below expectation for Accuracy, Kappa, ROC, and Sn, they were, however, higher for specificity. Therefore, the models were a better fit in predicting individuals without pulmonary cavities in our study cohort.

In conclusion, it is imperative that addressing TB complications be explored holistically. Targeting sociodemographic and relevant clinical variables will promote better pulmonary outcomes individually and result in a decreased transmission within communities and at larger population levels. In addition, the identified significant associations in our research can be generalized to regions of low- and lower-middle-income class with high prevalence and incidence of TB.

Funding Sources

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Authors' Contributions

All authors contributed substantially to the production of the final version of this manuscript.


  1. Global tuberculosis report.
  2. Corbett EL, Watt CJ, Walker N, Maher D, Williams BG, et al. (2003) The growing burden of tuberculosis: Global trends and interactions with the HIV epidemic. Arch Intern Med 163: 1009-1021.
  3. Ong CWM, Elkington PT, Friedland JS (2014) Tuberculosis, pulmonary cavitation, and matrix metalloproteinases. Am J Respir Crit Care Med 190: 9-18.
  4. Perrin FMR, Woodward N, Phillips PPJ, McHugh TD, Nunn AJ, et al. (2010) Radiological cavitation, sputum mycobacterial load and treatment response in pulmonary tuberculosis. Int J Tuberc Lung Dis 14: 1596-1602.
  5. Harries AD, Lin Y, Kumar AMV, Satyanarayana S, Takarinda KC, et al. (2018) What can National TB Control Programmes in low- and middle-income countries do to end tuberculosis by 2030? F1000 Research 7: 1011.
  6. CDC (1990) Tuberculosis in developing countries.
  7. Palaci M, Dietze R, Hadad DJ, Ribeiro FKC, Peres RL, et al. (2007) Cavitary disease and quantitative sputum bacillary load in cases of pulmonary tuberculosis. J Clin Microbiol 45: 4064-4066.
  8. Beynon F, Theron G, Respeito D, Mambuque E, Saavedra B, et al. (2018) Correlation of Xpert MTB/RIF with measures to assess Mycobacterium tuberculosis bacillary burden in high HIV burden areas of Southern Africa. Sci Rep 8: 5201.
  9. Fortún J, Martín-Dávila P, Molina A, Navas E, Hermida JM, et al. (2007) Sputum conversion among patients with pulmonary tuberculosis: Are there implications for removal of respiratory isolation? J Antimicrob Chemother 59: 794-798.
  10. Kaplan G, Post FA, Moreira AL, Wainwright H, Kreiswirth BN, et al. (2003) Mycobacterium tuberculosis growth at the cavity surface: A microenvironment with failed immunity. Infect Immun 71: 7099-7108.
  11. Zhang L, Pang Y, Yu X, Wang Y, Lu J, et al. (2016) Risk factors for pulmonary cavitation in tuberculosis patients from China. Emerg Microbes Infect 5: e110.
  12. Kempker RR, Rabin AS, Nikolaishvili K, Kalandadze I, Gogishvili S, et al. (2012) Additional drug resistance in mycobacterium tuberculosis isolates from resected cavities among patients with multidrug-resistant or extensively drug-resistant pulmonary tuberculosis. Clin Infect Dis 54: e51-e54.
  13. Vandiviere HM, Loring WE, Melvin I, Willis S (1956) The treated pulmonary lesion and its tubercle bacillus. II. The death and resurrection. Am J Med Sci 232: 30-37.
  14. Vadwai V, Daver G, Udwadia Z, Sadani M, Shetty A, et al. (2011) Clonal population of Mycobacterium tuberculosis strains reside within multiple lung cavities. PLoS One 6: e24770.
  15. Thomson S (2018) Achievement at school and socioeconomic background-an educational perspective. NPJ Sci Learn 3: 5.
  16. Becker G, Newsom E (2003) Socioeconomic Status and dissatisfaction with health care among chronically ill African Americans. Am J Public Health 93: 742-748.
  17. Jiamsakul A, Lee M-P, Van Nguyen K, Merati TP, Cuong DD, et al. (2018) Socio-economic statuses and risk of tuberculosis - a case-control study of HIV-infected patients in Asia. Int J Tuberc Lung Dis 22: 179-186.
  18. Gupta S, Shenoy VP, Mukhopadhyay C, Bairy I, Muralidharan S (2011) Role of risk factors and socio-economic status in pulmonary tuberculosis: A search for the root cause in patients in a tertiary care hospital, South India. Trop Med Int Health 16: 74-78.
  19. Dias M, Gaio R, Sousa P, Abranches M, Gomes M, et al. (2017) Tuberculosis among the homeless: Should we change the strategy? Int J Tuberc Lung Dis 21: 327-332.
  20. Lee C-H, Jeong Y-J, Heo EY, Park JS, Lee JS, et al. (2013) Active pulmonary tuberculosis and latent tuberculosis infection among homeless people in Seoul, South Korea: A cross-sectional study. BMC Public Health 13: 720.
  21. Lalis A, Leblois R, Lecompte E, Denys C, Meulen JT, et al. (2012) The impact of human conflict on the genetics of mastomys natalensis and lassa virus in west africa. PLoS One 7: e37068.
  22. Walker NF, Clark SO, Oni T, Andreu N, Tezera L, et al. (2012) Doxycycline and HIV infection suppress tuberculosis-induced matrix metalloproteinases. Am J Respir Crit Care Med 185: 989-997.
  23. Khokha R, Murthy A, Weiss A (2013) Metalloproteinases and their natural inhibitors in inflammation and immunity. Nat Rev Immunol 13: 649-665.
  24. Ugarte-Gil CA, Elkington P, Gilman RH, Coronel J, Tezera LB, et al. (2013) Induced sputum MMP-1, -3 & -8 concentrations during treatment of tuberculosis. PLoS One 8: e61333.
  25. Gallant JE, Ko AH (1996) Cavitary pulmonary lesions in patients infected with human immunodeficiency virus. Clin Infect Dis 22: 671-682.
  26. Chiang C-Y, Lee J-J, Chien S-T, Enarson DA, Chang Y-C, et al. (2014) Glycemic control and radiographic manifestations of tuberculosis in diabetic patients. PLoS One 9: e93397.
  27. Huang L-K, Jiang L-D, Lai Y-C, Wu M-H, Chang S-C (2019) Pulmonary tuberculous cavities in diabetic patients: Glycemic control is still the dominant factor despite the emerging role of metformin. J Chin Med Assoc 82: 628-634.
  28. You N, Pan H, Zeng Y, Lu p, Zhu L, et al. (2021) A risk score for prediction of poor treatment outcomes among tuberculosis patients with diagnosed diabetes mellitus from eastern China. Sci Rep 11: 11219.


Balogun OO, Fawole A, Osemwinyen E, Balogun B (2021) Predictors of Pulmonary Cavitation among Tuberculosis Patients. J Infect Dis Epidemiol 7:231. doi.org/10.23937/2474-3658/1510231