A weakly supervised clustering method for cancer subgroup identification
Author
Özçelik, Duygu
Advisor
Okan, Öznur Taştan
Date
2016-08Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
94
views
views
101
downloads
downloads
Abstract
Each cancer type is a heteregonous disease consisting of subtypes, which may be distinguished at the molecular, histopathological, and clinical level. Identifying the patient subtypes of a cancer type is critically important as the unique molecular characteristics of a particular patient subgroup reveal distinct disease states and opens up possibilities for targeted therapeutic regimens. Traditionally, unsupervised clustering techniques are applied on the genomic data of the tumor samples and the patient clusters are found to be of interest if they can be associated with a clinical outcome variable such as the survival of patients. In lieu of this unsupervised framework, we propose a weakly supervised clustering framework, WS-RFClust, in which the clustering partitions are guided with the clinical outcome of interest. In WS-RFClust a random forest is trained to classify the patients based on a categorical clinical variable of interest. We use the partitions of patients on the tree ensemble to construct a patient similarity matrix, which is then used as input to a clustering algorithm. WS-RFClust inherently uses the nonlinear subspace of the original features that is learned in the classiffication step for clustering. In this study, we demonstrate the effectiveness of WS-RFClust on hand-written digit datasets, which captures salient structural similarities of digit pairs. Finally, we employ WS-RFClust to find breast cancer subtypes using mRNA, protein and microRNA expressions as features. Our results on breast cancer subtype identiffication problem show that WS-RFClust could identify patients more effectively in comparison to the commonly used unsupervised clustering methods.
Keywords
ClusteringWeakly supervised clustering
Subspace clustering
Cancer subtype identi cation
Patient subgroup identi cation