A weakly supervised clustering method for cancer subgroup identification
Please cite this item using this persistent URLhttp://hdl.handle.net/11693/32162
Okan, Öznur Taştan
Each cancer type is a heteregonous disease consisting of subtypes, which may be distinguished at the molecular, histopathological, and clinical level. Identifying the patient subtypes of a cancer type is critically important as the unique molecular characteristics of a particular patient subgroup reveal distinct disease states and opens up possibilities for targeted therapeutic regimens. Traditionally, unsupervised clustering techniques are applied on the genomic data of the tumor samples and the patient clusters are found to be of interest if they can be associated with a clinical outcome variable such as the survival of patients. In lieu of this unsupervised framework, we propose a weakly supervised clustering framework, WS-RFClust, in which the clustering partitions are guided with the clinical outcome of interest. In WS-RFClust a random forest is trained to classify the patients based on a categorical clinical variable of interest. We use the partitions of patients on the tree ensemble to construct a patient similarity matrix, which is then used as input to a clustering algorithm. WS-RFClust inherently uses the nonlinear subspace of the original features that is learned in the classiffication step for clustering. In this study, we demonstrate the effectiveness of WS-RFClust on hand-written digit datasets, which captures salient structural similarities of digit pairs. Finally, we employ WS-RFClust to find breast cancer subtypes using mRNA, protein and microRNA expressions as features. Our results on breast cancer subtype identiffication problem show that WS-RFClust could identify patients more effectively in comparison to the commonly used unsupervised clustering methods.