Unveiling novel bladder cancer associations from multicentred primary and secondary care electronic health records by machine learning: a case-control study

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Objective
The rising incidence and mortality in bladder cancer (BC) underscore the importance of identifying asscociated features. Current reliance on haematuria as a primary indicator for BC proves inadequate. While mining electronic health records (EHRs) offer potential of identifying BC-related signals, traditional data-driven methods struggle with high-dimensional datasets. This study aims to uncover novel BC-associated clinical signals by developing Parsimony-driven cAtegory-balaNced binary Signal extractor for Primary Care EHRs (PanSPICE) tailored to extremely high-dimensional data linked from multi-centres.
Methods
We collected BC cases and control patients (n = 64,884) linked at patient-level from Welsh nationwide databases, yielding 48,261 features in primary care settings. The PanSPICE approach begins with information gain to pre-rank features, then applies Retentive Stickiness Binary Particle Swarm Optimisation (RSBPSO) combined with C5.0 classification tree to overcome computational barriers in feature selection. A two-layer optimisation treated clinical signals in care processes (POC), diagnoses (DIAG), and medications (MED) separately to prevent feature masking. A tailored fitness function for RSBPSO to simultaneously optimise model performance and feature sparsity. Associations of the selected features were interpreted using logistic regression models adjusted for deprivation indices.
Results
The PanSPICE identified 38 optimal features (AUC (area under the curve) = 0.81, 95 % CI: 0.80–0.82), including urinary tract infections (OR = 2.19, 95 % CI: 2.05–2.14) and inverse associations with stroke (OR = 0.64, 95 % CI: 0.54–0.74) and dementia (OR = 0.25, 95 % CI: 0.17–0.35). Gender stratification revealed female-specific urine glucose testing association (OR = 1.24, 95 % CI: 1.08–1.43). Certain medications, such as trimethoprim, were positively associated with BC, while others, including ramipril and prednisolone, showed protective effects.
Conclusion
The PanSPICE enables efficient high-dimensional EHR analysis, revealing under-recognised potential BC risk profiles and protective comorbidities. Gender-specific differences in BC associations highlight the importance of gender-stratified analyses, while computational advances provide a template for EHR-based clinical discovery. Findings warrant further mechanistic research into neurological protective pathways.
Original languageEnglish
Article number104959
JournalJournal of Biomedical Informatics
Volume172
DOIs
Publication statusPublished - 15 Nov 2025

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Keywords

  • Bladder cancer
  • Electronic health records
  • Feature selection
  • Machine Learning
  • Parsimony
  • Particle Swarm Optimisation
  • Primary care
  • Sex differences in bladder cancer

Fingerprint

Dive into the research topics of 'Unveiling novel bladder cancer associations from multicentred primary and secondary care electronic health records by machine learning: a case-control study'. Together they form a unique fingerprint.

Cite this