bioin.replication.symbol_array¶
-
bioin.replication.symbol_array(genome, symbol)[source]¶ Calculate the symbol count in genome. Background: Analyzing a genome’s half-strands: Although most bacteria have circular genomes, we have thus far assumed that genomes were linear, a reasonable simplifying assumption because the length of the window is much shorter than the length of the genome. This time, because we are sliding a giant window, we should account for windows that “wrap around” the end of Genome. To do so, we will define a string ExtendedGenome as Genome+Genome[0:n//2]. That is, we copy the first len(Genome)//2 nucleotides of Genome to the end of the string. For example, this genome:
CTGCTTCGCCCGCCGGACCGGCCTCGTGATGGGGT_CTGCTTCGCCCGCCGGA
A DNA string Genome (shown before the underscore) containing 35 nucleotides that is extended by its first 17 nucleotides (shown after the underscore) to yield ExtendedGenome (the ExtendeGenome doesn’t contain the underscore).
We will keep track of the total number of occurrences of ‘C’ that we encounter in each window of ExtendedGenome by using a symbol array. The i-th element of the symbol array is equal to the number of occurrences of the symbol in the window of length len(Genome)//2 starting at position i of ExtendedGenome. For example, array[0] equals the number of A count in the extendedGenome from position index 0 to 0+4=4, i.e. ‘AAAAG’, there are 4 ‘A’ in it, next from position 1 to 5, i.e. ‘AAAGG’ there are 3 ‘A’ in it, and so forth. Finally return the key-value pair of all the i: array[i] in a dictionary.
ExtendedGenome A A A A G G G G A A A A
xxxxxxxxxxxxxxx i 0 1 2 3 4 5 6 7
xxxxxxxxx array[i] 4 3 2 1 0 1 2 3
The symbol array for Genome equal to “AAAAGGGG” and symbol equal to “A”.
Parameters: - genome (str) – a DNA string as the search space.
- symbol (str) – the single base to query in the search space.
Returns: Dictionary, a dictionary, position-counts pairs of symbol in each genome sliding window.
Examples
The symbol array for genome equal to “AAAAGGGG” and symbol equal to “A”.
>>> genome = 'AAAAGGGG' >>> symbol = 'A' >>> position_symbolcount_dict = symbol_array(genome, symbol) >>> position_symbolcount_dict {0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 5: 1, 6: 2, 7: 3}