This site supports a paper published in PLoS computational Biology on 29 Feb 2008



Distinguishing between informative and non-informative positions:
Two pairs of aligned motifs, both pairs have three identical positions and two different ones. However, the identical positions in pair number 1 are non-informative, while the identical positions in pair number 2 are informative. The desired score should distinguish between these two types of similarities and assign a higher similarity score to pair number 2. The nucleotide distribution in each motif is represented schematically (with a sequence logo, where the size of the nucleotide is proportional to its probability and the nucleotides are ordered according to their probability).

Problematic aspects of currently used motif similarity functions:
The similarity score of two PFMs decomposes into the sum of similarities of single aligned positions, due to the position-independence assumption in the model. Here we present similarity scores for pairs of positions in DNA motifs by the various similarity functions in addition to a desired score. The nucleotide distribution in each position is represented schematically (with a sequence logo using probabilities, as in A). As shown here, the Pearson-Correlation does not reflect the true sequence similarity and the Jensen-Shannon divergence and Euclidean distance do not differ between informative and background uniform positions. Clearly, position 1 should get a higher score than position 2, but the Pearson-Correlation scores for these positions are equal. Position 3 should get the lowest possible score, but the Pearson-Correlation does not capture this. Both in positions 1 and 4 identical distributions are compared, but position 1 should get a higher score than position 4, however all three methods fail to obtain this. Positions 4 and 5 should get similar scores, since in position 4 two identical positions are compared and in position 5 there are small differences between the position, however the Pearson-Correlation grades position 5 significantly lower than position 4 due to small deviations from the uniform distribution.

The BLiC (Bayesian Likelihood 2-Components) score