Small repetitive sequences, called tandem repeats, are abundant throughout the human genome, both in coding and in non-coding regions. Their role is still mostly unknown, but at least 20 of those repetitive sequences have been related to neurodegenerative disorders. The mutational process that is the basis of these disorders is not yet fully understood. Comprehending the origin, function and possible usefulness of the tandem repeats, will require analysis of huge data from various sources. In this paper we attempt such a large scale analysis of short repeats. We describe and discuss the steps that are needed to be taken to perform large scale genomic analysis. We define tandem repeats and compare the results of repeat localization with genome annotations. We show that the degree of repetitiveness is different for the human chromosomes. Chromosome 19 and 17 have more repeats per mega base pair than any of the other chromosomes, the Y chromosome has the least. We also demonstrate that some repeat motifs are much more common than others. Mono- and dinucleotide repeats are the most abundant, with A and AAC the most common motifs, while CG is hardly present within the genome. Repeats with unit length three are under represented on the genome and repeats with unit length 9 are extremely rare.
|Title of host publication||Proceedings of the 20th International Workshop on Database and Expert Systems Application (DEXA 2009), 31 August - 4 September 2009, Linz, Austria|
|Publisher||IEEE Computer Society|
|Publication status||Published - 2009|