Zips : mining compressing sequential patterns in streams

T.L. Hoang, T.G.K. Calders, J. Yang, F. Mörchen, D. Fradkin

Onderzoeksoutput: Hoofdstuk in Boek/Rapport/CongresprocedureConferentiebijdrageAcademicpeer review

4 Citaten (Scopus)
2 Downloads (Pure)


We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.
Originele taal-2Engels
TitelProceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA'13, Chicago IL, USA, August 11-14, 2013)
RedacteurenD.H. Chau, J. Vreeken, M. Leeuwen, van, C. Faloutsos
Plaats van productieNew York NY
UitgeverijAssociation for Computing Machinery, Inc
ISBN van geprinte versie978-1-4503-2329-1
StatusGepubliceerd - 2013
Evenementconference; ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA'13); 2013-08-11; 2013-08-11 -
Duur: 11 aug. 201311 aug. 2013


Congresconference; ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA'13); 2013-08-11; 2013-08-11
AnderACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA'13)


Duik in de onderzoeksthema's van 'Zips : mining compressing sequential patterns in streams'. Samen vormen ze een unieke vingerafdruk.

Citeer dit