RadishDB:Analysis

From RadishDB

Jump to: navigation, search

Contents

Content of radish transcriptome and orthologous group identification

Gene content

The EST sequences generated by the proposed study will provide a wealth of information on gene content in radish. All the sequences will be cleaned, including trimming of vector and adaptor sequences, removal of all low-quality sequence and any contamination, and then will be assembled by a modified CAP3 program (1) and clustered to generate a radish gene index or transcript assemblies (TAs) (Quackenbush, 2000; Quackenbush, 2001; Lee, 2005). We estimate based our experience that the project should produce about 30,000 unique sequences, both tentative consensus sequences (TCs) and singletons. All TAs including TCs and singletons will be searched using the basic local alignment search tool (BLAST; Altschul, 1990) against the TIGR non-identical amino acid (niaa) database, which is made up of all proteins available from GenBank (http://www.ncbi.nlm.nih.gov), PIR (http://pir.georgetown.edu), SWISS-PROT (http://www.expasy.ch/sprot), and TIGR's CMR database, the Omniome (http://cmr.tigr.org). These searches will enable us to annotate all transcript assemblies, identify the possible novel ones from radish, and discover whether crop and wild radish differ in their transcript assemblies. At the same time, this search will identify possible full-length cDNA sequences and untranslated regions (UTRs) by looking for the in frame ATG position relative to the start codon of the matched protein. From our recent Medicago EST study, we estimate that at least 40% of our sequences will be full-length cDNA; these will constitute an invaluable resource for gene annotation, gene prediction and functional genomic studies (Urbanek, 2005; Xiao, 2005; Alexandrov, 2006).

Repetitive elements and transposons

Since radish has an estimated genome size of 573Mbp (2), repetitive elements such as transposons likely constitute a large part of the radish genome, but transposable elements (TE) have never been studied in this species. To distinguish transcribed transposon sequences from radish genes, the sequences generated will be searched against a TIGR database of plant TE peptide sequences using BLASTX which will identify the contents of TE in our radish ESTs including class-I DNA elements and class-II RNA elements (Kuhl, 2004). The orientations of ESTs that match will be inspected to determine whether the ESTs were products of directionally cloned transcripts, genomic contamination, or read-through from neighboring retrotransposons (Elrouby, 2001).

Orthologous groups

Orthologous groups will be identified using phylogeny-based approaches (3). First, gene family clusters will be constructed by Markov Clustering (4) using annotated protein sequences from the reference species A. thaliana, poplar, and rice. Genome information from additional species (e.g., A. lyrata, Capsella, and Brassica) will be incorporated as they become available. Phylogenetic trees of all family clusters will be constructed as in Shiu et. al (5). The TAs will be mapped to the tri-species gene family trees by identifying the best matches of each TA in the three reference species. Each gene family tree and associated radish TA mapping information will then be superimposed onto the species trees of Arabidopsis, radish, poplar, and rice to identify orthologous groups based on maximum parsimony. It should be noted that some TAs may belong to the same genes but in different contigs. These partial TAs will map to the same orthologous group as separate entries. We will remediate this problem by taking only the largest cluster of overlapping radish sequences in each orthologous group. In short, in orthologous groups where non-overlapping TAs are found, distance trees will be generated among these TAs in each group based on pairwise nucleotide sequence identity and rooted at the mid-point of the longest path. The clusters will be identified by applying a similarity threshold that is the average identity between all TAs and the A. thaliana member in an orthologous group. In case the A. thaliana member is not present due to gene loss, the identity ≥ 95% of the identity between all TAs and the A. thaliana member in an orthologous group will be used as a conservative threshold. The largest TA cluster in each orthologous group will be used in section C. In addition, a few TAs may contain paralogous sequence or splice variants. Note that we will assemble TAs with a 97% threshold, which was chosen to reduce the chances of mis-assembling paralogs into the same TA while maximizing the number of unique sequences based on the identity distribution of paralogs in A. thaliana. For example, in A. thaliana, 96.98% of the paralogs (reciprocal best match pairs) have ≤ 97% coding sequence nucleotide identity (Shiu, unpublished). Therefore, we expect ≤ 3% of TAs contain paralogs. On the other hand, alternatively spliced variants will be identified as outlined in section B.

Data mining for the three classes of markers

The Raphanus ESTs will be mined to generate the three general classes of markers with decreasing order of level of polymorphism and increasing level of transferability across species (see above) including: (a) SSR from 5’ UTR, (b) SSR from translated regions (EST-SSR), plus SNPs, CAPs, and dCAPs, and (c). Intron-spanning markers. Below we outline how SSR and exons will be identified and how SNPs and some of the variation in SSRs can be uncovered from the Raphanus EST sequences. Screening for further SSR variation as well as intron-length variation will be left for our future work or other investigators.

SSR

Transcript assemblies will be screened for simple sequence repeats (SSRs) using the MISA program ([Thiel, 2003 #13]), which removes poly A/T tracks, identifies microsatellites, and finally, can design primers for experimental verification of the detected microsatellites using Primer 3 (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi). We will conduct an analysis similar to that of Lawson and Zhang ([, 2006 #7949]) on the radish cDNA sequence generated to compare the frequency of SSR among 3’UTR, 5’UTR, and exons.

Intron-spanning marker

Although EST sequences will not contain intron-spanning variation, we can lay the ground work for identifying them by identifying exons in radish ESTs. Based on the orthologous groups defined in the previous section, the putative orthologs in Arabidopsis for each radish ESTs and TAs will be identified. Based on the EST to orthologous gene protein alignments, we will extract the translated sequences from ESTs. Each translated EST sequence will then be used to search against orthologous gene sequences of the reference species. In case where the protein-to-nucleotide alignment is interrupted by a length longer than the pre-defined threshold for each reference species, the alignment breakpoints are regarded as exon boundaries. The threshold is defined as the number of base pairs that is <99% of the intron in a reference species. This approach is feasible due to the relatively short divergence time between Arabidopsis and Raphanus. Based on an analysis of paralogous Arabidopsis genes, we expect intron gain/loss rate to be only ~2-5/1000 introns between these two species (6).

Sequence variation identification

Sequence variation (SSR and SNPs) will be identified by comparing different TAs or ESTs. First, we will map all TAs to the annotated genes of Arabidopsis or poplar based on sequence similarity (> 80% identical, over 300 nucleotides aligned). In cases where multiple TAs are mapped to the same gene in Arabidopsis or poplar and the identities between these TAs are ≥90%, these TAs are regarded as potential variants. This threshold is chosen based on the sequence identity distribution of paralogs originating from the most recent whole genome duplication in the Brassica-Raphanus lineage (7). If the differences between two TAs are indels that overlap with introns, then they will be regarded as alternatively spliced variants and discarded as well. The remaining TAs form a number of “orthologous groups” with Arabidopsis, poplar, and rice protein genes as described above. In orthologous groups containing TAs from different species of Raphanus, sequence variation will be identified from alignments of each group. For TAs that do not map to Arabidopsis or poplar genes, single linkage clusters of TAs will be generated with an identity threshold of 90% and an alignment length threshold of 300 bp; each cluster is regarded as an orthologous group. Differences between libraries will be regarded as distinct variants only if >80% of the TAs within each library have the same nucleotide. While the between TA approach can identify rapidly accumulated sequence variation between Raphanus species, the relatively low identity threshold for transcript assembly precludes the identification of relatively subtle differences between the two libraries. Therefore, we will map each EST to the reference species, identify ESTs in the same orthologous group but from different libraries, and identify variations among species if >80% of the ESTs within each library have the same indel or substitution. Sequencing error is the major concern in using EST sequences for uncovering polymorphism or variation, particularly for SNPs. In addition to checking the consistency among reads, sequencing errors will be checked using an established TIGR pipeline that evaluates the quality value of each base of every EST component in a TA, and the frequency of that polymorphic base among these EST components, to identify SNPs.

Gene gain/loss inference and lineage-specific selection

Gain/loss inference

Gene duplications and losses will be identified by the reconciled tree approach, in which gene family trees constructed in section A will be superimposed on the species tree (8). The results will provide information on gene gain and loss events that occurred in the Arabidopsis lineage after its divergence from the Raphanus-Brassica lineage. As mentioned earlier, disjointed TAs of the same gene will impact the analysis, particularly in over-estimating the number of gene duplications in the radish lineage. Therefore, here we will use only the largest radish TA cluster in each orthologous group as identified in section A.

Lineage-specific selection

The phylogenetic trees generated will also provide the framework for comparison of evolutionary rates in the Arabidopsis and Raphanus-Brassica lineages. For each orthologous group tree containing Raphanus, Arabidopsis, and poplar sequences, the number of synonymous (ds) and non-synonymous (dn) substitutions in each branch will be estimated using PAML (9) and RateEstimator (Hanada and Shiu, unpublished). Using poplar sequence as an outgroup, significant differences in dn/ ds will indicate lineage-specific selection. Genes currently or recently experiencing positive selection will have a dn/ ds value significantly greater than one; we will use this criterion to identify positively selected genes in radish. In this framework, we will identify genes that experience common selection pressure across the Brassicaceae species analyzed as well genes subject to lineage-specific selection. Since two radish species will be sequenced, we are particularly interested in identifying genes with contrasting selection regimes between species. In the cultivated radish, this will identify candidate domestication genes. Similarly, genes under positive selection in weedy radish are possible contributors to their success as weeds. Finally, to see if genes in outbred plants experience positive selection at the same frequency as inbred plants, we will determine the sequence polymorphism and variation in Raphanus as outlined in section B to estimate the number of positively selected genes.

References

  1. Huang, F., et al., Inheritance of Resistance to Bacillus thuringiensis Toxin (Dipel ES) in the European Corn Borer. Science, 1999. 284(7 May 1999): p. 965-967.
  2. Johnston, J.S., et al., Evolution of genome size in Brassicaceae. Annals of Botany, 2005. 95(1): p. 229-235.
  3. Shiu, S.-H., M.-C. Shih, and W.H. Li, Transcription factor families have much higher expansion rates in plants than in animals. Plant Physiol, 2005. In press.
  4. Van Dongen, S.M., Graph clustering by flow simulation. 2000, University of Utrecht. p. 169.
  5. Shiu, S.H., et al., Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci U S A, 2006. 103(7): p. 2232-6.
  6. Knowles, D.G. and A. McLysaght, High Rate of Recent Intron Gain and Loss in Simultaneously Duplicated Arabidopsis genes. Mol. Biol. Evol., in press, 2006.
  7. Town, C.D., et al., Comparative Genomics of Brassica oleracea and Arabidopsis thaliana Reveal Gene Loss, Fragmentation, and Dispersal after Polyploidy. Plant Cell, 2006.
  8. Page, R.D. and M.A. Charleston, From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. Mol Phylogenet Evol, 1997. 7(2): p. 231-40.
  9. Yang, Z., PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci, 1997. 13(5): p. 555-6.