Complex diseases, such as cancer, have traditionally been studied using genetic data, or images alone. To understand the biology of such diseases, joint analysis of multiple data modalities could provide interesting insights. We propose the use of canonical correlation analysis (CCA) as a preliminary discovery tool for identifying connections across modalities, specifically between gene expression and features describing cell and nucleus shape, texture, and stain intensity in histopathological images.It is also important to capture the interaction between different types of cells, an important indicator of disease status. To that end, it is crucial to quantify and utilize the spatial distribution of various cell types within the examined tissue at different scales. We employ Ripley's K-statistic, a traditional feature employed in geographical information systems, which captures spatial distribution patterns of individual point sets and interactions between multiple point sets. We propose to improve the histopathology image features by incorporating this descriptor to capture the spatial distribution of the cells, and interactions between lymphocytes and epithelial cells.Applied to 615 breast cancer samples from The Cancer Genome Atlas, CCA revealed significant correlation of 0.736 (p approx 1e-14) and 0.471, (p approx 7e-3) for CCA and Sparse CCA, respectively, of several image features with expression of PAM50 genes, known to be linked to outcome. Sparse CCA, an extension of CCA based on sparsity, revealed associations with enrichment of pathways implicated in cancer without leveraging prior biological understanding. The utility of the Ripley's K-statistic on 710 TCGA breast invasive carcinoma (BRCA) patients' histopathology images in the context of imaging-genetics is demonstrated by its superior correlations with gene expressions. These findings affirm the utility of CCA for joint phenotype-genotype analysis of cancer, and the importance of capturing spatial features at multiple scales.
【 预 览 】
附件列表
Files
Size
Format
View
Multimodal data analysis applied to a medical setting