GenHap: a novel computational method based on genetic algorithms for haplotype assembly

Andrea Tangherloni (Corresponding author), Simone Spolaor, Leonardo Rundo, Marco S. Nobile, Paolo Cazzaniga, Giancarlo Mauri, Pietro Liò, Ivan Merelli, Daniela Besozzi

Research output: Contribution to journalArticleAcademicpeer-review

5 Citations (Scopus)

Abstract

Background
In order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications.

Results
To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets.

Conclusions
Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.
Original languageEnglish
Article number172
Number of pages14
JournalBMC Bioinformatics
Volume20
Issue numberSuppl 4
DOIs
Publication statusPublished - 18 Apr 2019
Externally publishedYes

Fingerprint

Haplotype
Computational methods
Computational Methods
Haplotypes
Nucleotides
Genetic algorithms
Polymorphism
Genetic Algorithm
Chromosomes
Sequencing
Single nucleotide Polymorphism
Methylation
Medical applications
Error correction
Multiobjective optimization
Single Nucleotide Polymorphism
Chromatin
Conformations
Computational complexity
Genes

Keywords

  • Combinatorial optimization
  • Future-generation sequencing
  • Genetic algorithms
  • Haplotype assembly
  • Weighted minimum error correction problem
  • Algorithms
  • Time Factors
  • Humans
  • Databases, Genetic
  • Computational Biology/methods
  • Haplotypes/genetics

Cite this

Tangherloni, A., Spolaor, S., Rundo, L., Nobile, M. S., Cazzaniga, P., Mauri, G., ... Besozzi, D. (2019). GenHap: a novel computational method based on genetic algorithms for haplotype assembly. BMC Bioinformatics, 20(Suppl 4), [172]. https://doi.org/10.1186/s12859-019-2691-y
Tangherloni, Andrea ; Spolaor, Simone ; Rundo, Leonardo ; Nobile, Marco S. ; Cazzaniga, Paolo ; Mauri, Giancarlo ; Liò, Pietro ; Merelli, Ivan ; Besozzi, Daniela. / GenHap : a novel computational method based on genetic algorithms for haplotype assembly. In: BMC Bioinformatics. 2019 ; Vol. 20, No. Suppl 4.
@article{f22064947c4a48aabb290b449a5299ca,
title = "GenHap: a novel computational method based on genetic algorithms for haplotype assembly",
abstract = "BackgroundIn order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications.ResultsTo reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets.ConclusionsFuture-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.",
keywords = "Combinatorial optimization, Future-generation sequencing, Genetic algorithms, Haplotype assembly, Weighted minimum error correction problem, Algorithms, Time Factors, Humans, Databases, Genetic, Computational Biology/methods, Haplotypes/genetics",
author = "Andrea Tangherloni and Simone Spolaor and Leonardo Rundo and Nobile, {Marco S.} and Paolo Cazzaniga and Giancarlo Mauri and Pietro Li{\`o} and Ivan Merelli and Daniela Besozzi",
year = "2019",
month = "4",
day = "18",
doi = "10.1186/s12859-019-2691-y",
language = "English",
volume = "20",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "Suppl 4",

}

Tangherloni, A, Spolaor, S, Rundo, L, Nobile, MS, Cazzaniga, P, Mauri, G, Liò, P, Merelli, I & Besozzi, D 2019, 'GenHap: a novel computational method based on genetic algorithms for haplotype assembly', BMC Bioinformatics, vol. 20, no. Suppl 4, 172. https://doi.org/10.1186/s12859-019-2691-y

GenHap : a novel computational method based on genetic algorithms for haplotype assembly. / Tangherloni, Andrea (Corresponding author); Spolaor, Simone; Rundo, Leonardo; Nobile, Marco S.; Cazzaniga, Paolo; Mauri, Giancarlo; Liò, Pietro; Merelli, Ivan; Besozzi, Daniela.

In: BMC Bioinformatics, Vol. 20, No. Suppl 4, 172, 18.04.2019.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - GenHap

T2 - a novel computational method based on genetic algorithms for haplotype assembly

AU - Tangherloni, Andrea

AU - Spolaor, Simone

AU - Rundo, Leonardo

AU - Nobile, Marco S.

AU - Cazzaniga, Paolo

AU - Mauri, Giancarlo

AU - Liò, Pietro

AU - Merelli, Ivan

AU - Besozzi, Daniela

PY - 2019/4/18

Y1 - 2019/4/18

N2 - BackgroundIn order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications.ResultsTo reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets.ConclusionsFuture-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.

AB - BackgroundIn order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications.ResultsTo reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets.ConclusionsFuture-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.

KW - Combinatorial optimization

KW - Future-generation sequencing

KW - Genetic algorithms

KW - Haplotype assembly

KW - Weighted minimum error correction problem

KW - Algorithms

KW - Time Factors

KW - Humans

KW - Databases, Genetic

KW - Computational Biology/methods

KW - Haplotypes/genetics

UR - http://www.scopus.com/inward/record.url?scp=85064431423&partnerID=8YFLogxK

U2 - 10.1186/s12859-019-2691-y

DO - 10.1186/s12859-019-2691-y

M3 - Article

C2 - 30999845

AN - SCOPUS:85064431423

VL - 20

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - Suppl 4

M1 - 172

ER -