bioin.replication.approximate_pattern_matching¶

bioin.replication.approximate_pattern_matching(pattern, text, d)[source]¶

Find all approximate occurences of a pattern in a string. We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern’ of Text having d or fewer mismatches with Pattern; that is, HammingDistance(Pattern, Pattern’) ≤ d. Our observation that a DnaA box may appear with slight variations leads to the following generalization of the Pattern Matching Problem.

Parameters:	text (str) – a DNA string genome. pattern (str) – a substring of DNA genome. d (int) – at most d mismatches between pattern and a substring in text.
Returns:	List, all starting positions where pattern appears as a substring of text with at most d mismatches.

Examples

Positions that a pattern appears as a substring of text with at most d mismatches fulfilled.

>>> pattern = 'ATTCTGGA'
>>> text = 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT'
>>> d = 3
>>> positions = approximate_pattern_matching(pattern, text, d)
>>> positions
    [6, 7, 26, 27]