DidaWiki

The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions and so on. This volume is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the patterns, which are not random structures appearing surprisingly often in the input. Having chosen which class of patterns is of interest, the pattern discovery task consists of the following: we are given a text T and some constrains either on the combinatorial pattern structure or on the occurrence lists, and we have to find the patterns in T satisfying the given constraints, also reporting their occurrence lists. The goal of the thesis is to study new classes of patterns that can represent further properties of the repetitions, and propose novel algorithms for extracting them. We call these pattern unconventional to mean the unusual combinatorial structure of the patterns we are looking for. Our line of research intend to explore two different kind of patterns: mask patterns, where each pattern represent a set of string pattern with wildcards, and permutation patterns where each pattern is a multiset of characters, since the order of the contained symbols does not matter. Finally, we show how our pattern discovery techniques can overcome the limitations of the traditional approaches, by discussing the problem of mobile elements discovery in S. Cerevisiae genome.