bioin.replication.approximate_pattern_matching

bioin.replication.approximate_pattern_matching(pattern, text, d)[source]

Find all approximate occurences of a pattern in a string. We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern’ of Text having d or fewer mismatches with Pattern; that is, HammingDistance(Pattern, Pattern’) ≤ d. Our observation that a DnaA box may appear with slight variations leads to the following generalization of the Pattern Matching Problem.

Parameters:
  • text (str) – a DNA string genome.
  • pattern (str) – a substring of DNA genome.
  • d (int) – at most d mismatches between pattern and a substring in text.
Returns:

List, all starting positions where pattern appears as a substring of text with at most d mismatches.

Examples

Positions that a pattern appears as a substring of text with at most d mismatches fulfilled.

>>> pattern = 'ATTCTGGA'
>>> text = 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT'
>>> d = 3
>>> positions = approximate_pattern_matching(pattern, text, d)
>>> positions
    [6, 7, 26, 27]