Specific Aim 1: High-quality reference genome sequences of loblolly pine and three other conifer species
Effective deployment of new technologies in a hierarchical WGS approach will yield reference sequences based on well-defined milestones. An initial and early deliverable will be 21X WGS sequence and preliminary assemblies (gene-boosted and whole genome) of the loblolly pine genome based on >= 100 bp paired-end Illumina sequences of a mix of 500-bp, 5-kbp, and 40-kbp (fosmid-diTag) libraries. In less than two years a 10×18 hierarchical WGS (180X total read depth) based on 18X (read depth of 500-bp, 5-kbp and 40-kbp libraries) of many small pools of fosmids will be the fundamental data for two types of assemblies: a consensus based on all the data and a second consensus based on hierarchical analysis of subassemblies of the haploid fosmid pools. Polishing will follow that includes longer end reads from a 10X BAC library, deep fosmind-end sequencing, and existing or emerging long-read technologies which are deemed effective for improving assembly quality. A high-resolution (0.1 cM) genetic scaffold based on a new genotyping resource will incorporate all genotypable contigs and validate the contiguity larger ones. In the later years comparable reference sequences for sugar pine, slash pine, and Douglas fir will be created. Comparative genomic analysis of these four conifer genomes will provide a solid and rich annotation and further improve assembly quality and contiguity.
Specific Aim 2: Transcriptome sequencing for gene discovery, reference building, and aids to genome assembly
We will build transcriptome references using multiple sequencing approaches to maximize evidence-based gene discovery in parallel with the reference genome assembly and annotation and we will provide full transcript assemblies for functional genomics studies. Initially, RNAs from a large number of loblolly pine organs, stages of development, and tissues exposed to biotic and abiotic stresses will be sequenced using the long reads of Roche/454 GS-FLX Titanium technology. Subsequently, higher-depth RNA-Seq approaches will be employed using the Illumina platform, including the sequencing of various mRNA and noncoding RNA libraries. Data will be used first to add depth and detail to the transcriptome and to catalog transcribed polymorphisms. Transcriptome analysis will profile gene expression differences of biological importance, including changes in the development of reproductive tissues, embryos and seedlings, and wood and in response to biotic and abiotic stresses.
Specific Aim 3: Dendrome and TreeGenes databases: Annotation, data integration, and distribution
The transcriptome and genome sequences will be delivered via TreeGenes to the community as sequence becomes available. Collaboration with GDR will provide the primary annotation and integrate a custom web-based tool known as GenSAS from GDR with GBrowse from Dendrome to facilitate community-level annotation. We will apply and expand existing pipelines to deliver a comprehensive SNP resource and distribute this through the existing DiversiTree interface. We will work continuously with existing projects like Gene Ontology and Plant Ontology to implement specific conifer-based ontologies to consistently describe gene products and phenotypes. All pipelines and tools developed in this project will be made freely available to the academic community.
Empowerment: Our goal is to develop the technologies, platforms and bioinformatics infrastructures to rapidly and inexpensively sequence large and complex genomes of coniferous forest trees. This will allow the forestry community to begin sequencing the many genomes of economic and ecological importance without a dependence on centralized genome centers.
Adaptive: We recognize the sequencing technologies are developing rapidly and that we must have the expertise and flexibility to rapidly adopt new approaches into our overall sequencing strategy.
Comparative: We recognize the power of comparative genomics approaches in assembling and annotating genome sequences and will use this approach throughout the project.
OpenAccess: We have a policy of sharing all data generated from this project with the research community