Paper (Postscript, PDF) to appear in RECOMB'03.
The availability of whole
genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This
includes methods for discovering and characterizing the binding sites of
DNA-binding proteins, such as transcription factors. A common representation of
transcription factor binding sites is a position specific score matrix
(PSSM). This representation makes the strong assumption that binding site
positions are independent of each other. In this work, we explore Bayesian
network representations of binding sites that provide different tradeoffs
between complexity (number of parameters) and the richness of dependencies
between positions. We develop the formal machinery for learning such models
from data and for estimating the statistical significance of putative binding
sites. We then evaluate the ramifications of these richer representations in
characterizing binding site motifs and predicting their genomic locations. We
show that these richer representations improve over the PSSM model in both
tasks.
A.
Supplement
to Section 3.2.
1.
Test data performance on aligned binding sites
from TRANSFAC
2. Improvement in log-loss/instance on 95 test sets
B. Supplements
to Section 6:
2. Test data
performance on location data of yeast genes (based on Lee
et al, 2002, Supplementary
data).
3. Comparison
to AlignAce on functional groups of genes (based
on Hughes
et al, 2000, Supplementary data).
4. Test
data performance on functional groups of genes (based on Hughes
et al, 2000, Supplementary data).
5. Test
data performance on gene expression clusters (based on Tavazoie et al, 2000, Supplementary data).
C. Supplement
to Section 4: Comparison of p-value
computation procedure