Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices

Cassio P. de Campos, Paola M.V. Rancoita, Ivo Kwee, Emanuele Zucca, Marco Zaffalon, Francesco Bertoni

Research output: Contribution to journalArticleAcademicpeer-review

6 Citations (Scopus)
4 Downloads (Pure)

Abstract

In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.

Original languageEnglish
Article numbere79720
Number of pages12
JournalPLoS ONE
Volume8
Issue number11
DOIs
Publication statusPublished - 20 Nov 2013
Externally publishedYes

Fingerprint

Factorization
DNA
Microarrays
DNA Copy Number Variations
Inborn Genetic Diseases
genetic disorders
Cluster Analysis
genomics
methodology
Datasets
sampling
Therapeutics

Cite this

de Campos, Cassio P. ; Rancoita, Paola M.V. ; Kwee, Ivo ; Zucca, Emanuele ; Zaffalon, Marco ; Bertoni, Francesco. / Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices. In: PLoS ONE. 2013 ; Vol. 8, No. 11.
@article{7e5f908ab06947ef88ec27cdb6d5e4ef,
title = "Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices",
abstract = "In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.",
author = "{de Campos}, {Cassio P.} and Rancoita, {Paola M.V.} and Ivo Kwee and Emanuele Zucca and Marco Zaffalon and Francesco Bertoni",
year = "2013",
month = "11",
day = "20",
doi = "10.1371/journal.pone.0079720",
language = "English",
volume = "8",
journal = "PLoS ONE",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "11",

}

Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices. / de Campos, Cassio P.; Rancoita, Paola M.V.; Kwee, Ivo; Zucca, Emanuele; Zaffalon, Marco; Bertoni, Francesco.

In: PLoS ONE, Vol. 8, No. 11, e79720, 20.11.2013.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices

AU - de Campos, Cassio P.

AU - Rancoita, Paola M.V.

AU - Kwee, Ivo

AU - Zucca, Emanuele

AU - Zaffalon, Marco

AU - Bertoni, Francesco

PY - 2013/11/20

Y1 - 2013/11/20

N2 - In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.

AB - In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.

UR - http://www.scopus.com/inward/record.url?scp=84894283376&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0079720

DO - 10.1371/journal.pone.0079720

M3 - Article

C2 - 24278162

VL - 8

JO - PLoS ONE

JF - PLoS ONE

SN - 1932-6203

IS - 11

M1 - e79720

ER -