Analysis:Radish SSR

From RadishDB

Jump to: navigation, search

Contents

Synopsis

  • The goals of the analysis are to:
    1. Identify SSRs in radish EST sequences.
    2. Determine if the SSRs are located within UTRs or coding sequences.

MISA run

  • SSR identified by Cedric and Tamika (Jackson State University) using the EST contigs built from the 1st 5000 clones.
  • Default parameters used.

Define CDS boundaries

Plan

  • Generate radish EST contigs - This has been done by Yongli in TIGR using their in-house assembler.
  • Similarity search of radish contigs with Arabidopsis predicted coding sequences.

Similarity search

  • Add library name prefix to each contig file and concatenate them into a single file.
 e.g.
 python ~/codes/FastaManager.py -f prefix -fasta RS2_RS/contigs -prefix RS2_RS
 ...
 cat ./*/contigs.mod.fa > contigs_all
  • BLAST run on calculon, 16 processes. Radish contigs as queries, Arabidopsis peptide sequences as subject database.
 python ~/codes/BlastUtility.py -f batch_blast2 -D ./ -db TAIR7_pep_20070425.mod.fa -fasta contigs_all -stype pep -bdir ~/bin/blast -by 16 -pdir ~/codes -pm "-p blastx -v 0 -b 1 -m 8"

First look

  • Many radish contigs have multiple matches to a single Arabidopsis peptide seq.
  • Quite a few radish sequences hit Arabi pep in the middle with a lot of flanking that are unknown. What criteria should be used to eliminate this kind of sequences?
  • Some assembled contigs apppear to have frameshifts, error?
  • Some do not have good match to known At peptides. These should go for a Uniprot run.
  • Any At pep with multiple radish hits?
  • What will the criteria be for specifying putative orthologs between sequences from different libraries?
  • Mis-assembly of radish ESTs? Mistakes in At gene models? Evolutionary divergence?
  • Multiple match with different orientations.

Second look, classifying potential scenarios

  • Match insignificant: none of the aligned region has identity >= 60% and E value <= 1e-3.
  • Straightforward ones: At pep start/stop lies inside the radish contig
 R -------------------------
       -------------------->
       |||||||||||||||||||||
       -------------------->
 A     ---------------------------- 
  • Type 1 var 1: Multiple matches, may or may not be contiguous.
 R -------------------------
       ------> ----->   --->
       ||||||| ||||||   ||||
       ------> ----->   --->
 A     ---------------------------
  • Type 1 var 2: Multiple matches, some opposite orientation
 R -------------------------
       ------> <-----   --->
       ||||||| ||||||   ||||
       ------> <-----   <---
 A     ---------------------------
  • CDS internal match
 R -------------------------
         |||||||  ||||
 A ---------------------------

Processing

  • Get qualified lines: This output is NOT used. Prefiltering not necessary.
 python ~/codes/ParseBlast.py -f get_qualified4 -blast radish_est_contigs_vs_atpep -fasta contigs_all -E 3 -I 50 -L 30 -P 0.01 -Q 1
 Get sequence sizes...
 Parse blast output:
  eT    : 3.0
  idenT : 50.0
  matchL: 30
  lengT : 0.01
 25017 total, 20544 qualified
  • Get sequence sizes
 python ~/codes/FastaManager.py -f get_sizes -fasta TAIR7_pep_20070425.mod.fa
 python ~/codes/FastaManager.py -f get_sizes -fasta contigs_all
  • Consolidate matches
 python ~/project/radish/_script/script_match_cluster.py radish_est_contigs_vs_atpep > run.log
 Total 18840 queries, 14362 has one matching region
 Thresholds:
  idT: 50
  evT: 1e-05
 Lib     #EC     #Atpep
 RR1_CS  2475    2236
 RR2_MS  2399    2020
 RR3_NY  2729    2439
 RR4_PB  2646    2353
 RS1_AR  2208    1966
 RS2_RS  2271    2030
 RS3_RT  2398    2118
  • Mapping SSRs
 ln -s */contigs_*.misa ./
 python ~/project/radish/_script/script_map_misa_location.py radish_est_contigs_vs_atpep.match_cluster RR3_NY.fasta.misa
 ...

Results