Analysis:Radish SSR
From RadishDB
Contents |
[edit]
Synopsis
- The goals of the analysis are to:
- Identify SSRs in radish EST sequences.
- Determine if the SSRs are located within UTRs or coding sequences.
[edit]
MISA run
- SSR identified by Cedric and Tamika (Jackson State University) using the EST contigs built from the 1st 5000 clones.
- Default parameters used.
[edit]
Define CDS boundaries
[edit]
Plan
- Generate radish EST contigs - This has been done by Yongli in TIGR using their in-house assembler.
- Similarity search of radish contigs with Arabidopsis predicted coding sequences.
[edit]
Similarity search
- Add library name prefix to each contig file and concatenate them into a single file.
e.g. python ~/codes/FastaManager.py -f prefix -fasta RS2_RS/contigs -prefix RS2_RS ... cat ./*/contigs.mod.fa > contigs_all
- BLAST run on calculon, 16 processes. Radish contigs as queries, Arabidopsis peptide sequences as subject database.
python ~/codes/BlastUtility.py -f batch_blast2 -D ./ -db TAIR7_pep_20070425.mod.fa -fasta contigs_all -stype pep -bdir ~/bin/blast -by 16 -pdir ~/codes -pm "-p blastx -v 0 -b 1 -m 8"
[edit]
First look
- Many radish contigs have multiple matches to a single Arabidopsis peptide seq.
- Quite a few radish sequences hit Arabi pep in the middle with a lot of flanking that are unknown. What criteria should be used to eliminate this kind of sequences?
- Some assembled contigs apppear to have frameshifts, error?
- Some do not have good match to known At peptides. These should go for a Uniprot run.
- Any At pep with multiple radish hits?
- What will the criteria be for specifying putative orthologs between sequences from different libraries?
- Mis-assembly of radish ESTs? Mistakes in At gene models? Evolutionary divergence?
- Multiple match with different orientations.
[edit]
Second look, classifying potential scenarios
- Match insignificant: none of the aligned region has identity >= 60% and E value <= 1e-3.
- Straightforward ones: At pep start/stop lies inside the radish contig
R -------------------------
-------------------->
|||||||||||||||||||||
-------------------->
A ----------------------------
- Type 1 var 1: Multiple matches, may or may not be contiguous.
R -------------------------
------> -----> --->
||||||| |||||| ||||
------> -----> --->
A ---------------------------
- Type 1 var 2: Multiple matches, some opposite orientation
R -------------------------
------> <----- --->
||||||| |||||| ||||
------> <----- <---
A ---------------------------
- CDS internal match
R -------------------------
||||||| ||||
A ---------------------------
[edit]
Processing
- Get qualified lines: This output is NOT used. Prefiltering not necessary.
python ~/codes/ParseBlast.py -f get_qualified4 -blast radish_est_contigs_vs_atpep -fasta contigs_all -E 3 -I 50 -L 30 -P 0.01 -Q 1 Get sequence sizes... Parse blast output: eT : 3.0 idenT : 50.0 matchL: 30 lengT : 0.01 25017 total, 20544 qualified
- Get sequence sizes
python ~/codes/FastaManager.py -f get_sizes -fasta TAIR7_pep_20070425.mod.fa python ~/codes/FastaManager.py -f get_sizes -fasta contigs_all
- Consolidate matches
python ~/project/radish/_script/script_match_cluster.py radish_est_contigs_vs_atpep > run.log Total 18840 queries, 14362 has one matching region Thresholds: idT: 50 evT: 1e-05
Lib #EC #Atpep RR1_CS 2475 2236 RR2_MS 2399 2020 RR3_NY 2729 2439 RR4_PB 2646 2353 RS1_AR 2208 1966 RS2_RS 2271 2030 RS3_RT 2398 2118
- Mapping SSRs
ln -s */contigs_*.misa ./ python ~/project/radish/_script/script_map_misa_location.py radish_est_contigs_vs_atpep.match_cluster RR3_NY.fasta.misa ...
[edit]
Results
- First try: radish_est5000_ssr_map.xls
- Full dataset: radish_est25000_ssr_map_080529.xls.gz
