Efficient representation of DNA data for pattern recognition using failure factor oracles

Loek Cleophas, Derrick G. Kourie, Bruce W. Watson

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4• 512, showing gains of up to 10%in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.

Original languageEnglish
Title of host publicationSAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa
Place of PublicationNew York
PublisherAssociation for Computing Machinery, Inc
Pages369-377
Number of pages9
ISBN (Print)9781450321129
DOIs
Publication statusPublished - 2013
Event2013 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2013) - Blue Lagoon Hotel & Conference Centre, East London, South Africa
Duration: 7 Oct 20139 Oct 2013
http://www.ict.ru.ac.za/saicsit2013/

Conference

Conference2013 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2013)
Abbreviated titleSAICSIT 2013
CountrySouth Africa
CityEast London
Period7/10/139/10/13
Other“A Connected Society”
Internet address

Fingerprint

DNA sequences
Pattern recognition
DNA
Pattern matching
Finite automata
Data storage equipment
Processing

Keywords

  • Algorithmics
  • Dictionary
  • DNA sequences
  • Pattern matching

Cite this

Cleophas, L., Kourie, D. G., & Watson, B. W. (2013). Efficient representation of DNA data for pattern recognition using failure factor oracles. In SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa (pp. 369-377). New York: Association for Computing Machinery, Inc. https://doi.org/10.1145/2513456.2513488
Cleophas, Loek ; Kourie, Derrick G. ; Watson, Bruce W. / Efficient representation of DNA data for pattern recognition using failure factor oracles. SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa. New York : Association for Computing Machinery, Inc, 2013. pp. 369-377
@inproceedings{029bbf09649648918d9f2a1a115c8e70,
title = "Efficient representation of DNA data for pattern recognition using failure factor oracles",
abstract = "In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4• 512, showing gains of up to 10{\%}in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.",
keywords = "Algorithmics, Dictionary, DNA sequences, Pattern matching",
author = "Loek Cleophas and Kourie, {Derrick G.} and Watson, {Bruce W.}",
year = "2013",
doi = "10.1145/2513456.2513488",
language = "English",
isbn = "9781450321129",
pages = "369--377",
booktitle = "SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa",
publisher = "Association for Computing Machinery, Inc",
address = "United States",

}

Cleophas, L, Kourie, DG & Watson, BW 2013, Efficient representation of DNA data for pattern recognition using failure factor oracles. in SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa. Association for Computing Machinery, Inc, New York, pp. 369-377, 2013 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2013), East London, South Africa, 7/10/13. https://doi.org/10.1145/2513456.2513488

Efficient representation of DNA data for pattern recognition using failure factor oracles. / Cleophas, Loek; Kourie, Derrick G.; Watson, Bruce W.

SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa. New York : Association for Computing Machinery, Inc, 2013. p. 369-377.

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Efficient representation of DNA data for pattern recognition using failure factor oracles

AU - Cleophas, Loek

AU - Kourie, Derrick G.

AU - Watson, Bruce W.

PY - 2013

Y1 - 2013

N2 - In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4• 512, showing gains of up to 10%in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.

AB - In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4• 512, showing gains of up to 10%in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.

KW - Algorithmics

KW - Dictionary

KW - DNA sequences

KW - Pattern matching

UR - http://www.scopus.com/inward/record.url?scp=84886248424&partnerID=8YFLogxK

U2 - 10.1145/2513456.2513488

DO - 10.1145/2513456.2513488

M3 - Conference contribution

AN - SCOPUS:84886248424

SN - 9781450321129

SP - 369

EP - 377

BT - SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa

PB - Association for Computing Machinery, Inc

CY - New York

ER -

Cleophas L, Kourie DG, Watson BW. Efficient representation of DNA data for pattern recognition using failure factor oracles. In SAICSIT '13 : Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, 7-9 October 2013, East London, South Africa. New York: Association for Computing Machinery, Inc. 2013. p. 369-377 https://doi.org/10.1145/2513456.2513488