Defining the transcriptome, the repertoire of transcribed regions encoded in the genome, is a challenging experimental task. Current approaches, relying on sequencing of expressed sequence tags (ESTs) or cDNA libraries, are expensive and labor-intensive. Consequently, we know little about the transcriptome of most sequenced species. Advances in massively parallel sequencing can revolutionize the study of transcriptomes. Here, we present a novel approach for ab initio discovery of the complete transcriptome of the budding yeast, based only on the (unannotated) genome sequence and millions of short reads from a single sequencing run. Using novel algorithms, we automatically construct a highly accurate transcript catalogue, including most known transcripts, and adding 160 novel transcripts and 25 introns. Our results demonstrate that massive parallel sequencing provides accurate definition of a eukaryotic transcriptome without any prior knowledge. This framework can be applied to poorly understood organisms, for which only the genomic sequence is known.


1. School of Computer Science and Engineering, The Hebrew University, Jerusalem, 91904, Israel

2. Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, Massachusetts 02142, USA

3. Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem 91120, Israel

4. Illumina, Inc., 25861 Industrial Boulevard, Hayward, CA 94545

5. Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02142

* These authors contributed equally to this work

† Corresponding Authors

Moran Yassour1,2,*, Tommy Kaplan1,3,*, Hunter B. Fraser2, Joshua Z. Levin2, Jenna Pfiffner2, Xian Adiconis2, Gary Schroth4, Shujun Luo4, Irina Khrebtukova4, Andreas Gnirke2, Chad Nusbaum2, Dawn-Anne Thompson2, Nir Friedman1,†, and Aviv Regev2,5,†