(bio)-informatics, data processing and visualization

Sunday, December 20, 2009

sequence trimming using 'cut'

original FASTA file (there is an assumption that sequence string in one single line):

bash-2.03$ less fastassy_original.fa
>3SEQS_D L:60
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG


trim 10 nt from the right side (3') - one step approach:
(original sequences are 60 nt long)
bash-2.03$ cut -c1-50 fastassy_original.fa > fastassy_trim_right.fa
bash-2.03$ less fastassy_trim_right.fa
>3SEQS_D L:60
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGAT
>1SEQS_E L:60
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGA
>2SEQS_B L:60
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGA
>1SEQS_A L:60
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTA


trim 10 nt from the left side (5') - two step approach:
bash-2.03$ cp -i fastassy_original.fa fastassy_temporary.fa
bash-2.03$ perl -p -i -e 's/^\>/XXXXXXXXXX\>/' fastassy_temporary.fa
(step 1 - to modify FASTA header only)
bash-2.03$ less fastassy_temporary.fa
XXXXXXXXXX>3SEQS_D L:60
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
XXXXXXXXXX>1SEQS_E L:60
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
XXXXXXXXXX>2SEQS_B L:60
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
XXXXXXXXXX>1SEQS_A L:60
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG
bash-2.03$
bash-2.03$ cut -c11-60 fastassy_temporary.fa > fastassy_trim_left.fa
(step 2 - to cut first 10 characters)
bash-2.03$
bash-2.03$ less fastassy_trim_left.fa
>3SEQS_D L:60
AAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60
CAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60
CCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60
TTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

Saturday, December 19, 2009

non-redundant FASTA dataset using 'sort' and 'uniq'

original file:
>SEQS_A
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG
>SEQS_B
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>SEQS_C
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>SEQS_D
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>SEQS_E
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>SEQS_F
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>SEQS_G
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT

step A - use seqs_processor to generate tab-delimited file:
http://code.google.com/p/atgc-tools/wiki/seqs_processor_and_translator
SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG
SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_C 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
SEQS_F 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_G 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT

step B - use Unix sort and uniq to generate non-redundant set -
with redundancy count (option -c):
sort -k 3 fastassy.tab | uniq -f 2 -c > fastassy.tab.uniq
or without redundancy count:
sort -k 3 fastassy.tab | uniq -f 2 > fastassy.tab.uniq

with option '-c' we will get:
3 SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1 SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2 SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1 SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

without redundancy count:
SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

step C - back to FASTA format:

cp -i fastassy.tab.uniq fastassy.nr.fasta

to remove leading white-space:
perl -p -i -e 's/^ {1,}//' fastassy.nr.fasta
3 SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1 SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2 SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1 SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to remove all remaining white-space:
perl -p -i -e 's/ {1,}//' fastassy.nr.fasta
3SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to restore FASTA '>' sign:
perl -p -i -e 's/^/\>/' fastassy.nr.fasta
to replace first 'tab' with length info:
perl -p -i -e 's/\t/ L\:/' fastassy.nr.fasta
>3SEQS_D L:60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to replace remaining 'tab' with a new line:
perl -p -i -e 's/\t/ \n/' fastassy.nr.fasta
>3SEQS_D L:60
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

Done!

Friday, December 18, 2009

bash batch run with echo and sed

TUTORIAL - HOW TO BLAST DNA SEQUENCES IN fastassy.dna FILE
VERSUS SEVERAL DB FILES LOCATED IN ANOTHER DIRECTORY:


[ EXAMPLE OF FILE NAME MODIFICATIONS USING echo AND sed ]

### WORKING DIRECTORY ###
bash-2.03$ pwd
/net/bfs3/solexa_assembly_analysis
bash-2.03$
### BLAST INPUT FILE IN CURRENT WORKING DIRECTORY ###
bash-2.03$ ls -l fastassy.*

-rw-r--r-- 1 akozik rwm 989 Dec 18 09:27 fastassy.dna
bash-2.03$
### BLAST DATABASE FILES IN ANOTHER DIRECTORY ###
bash-2.03$ ls -l /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta
-rw-r--r-- 1 akozik other 339278682 Nov 22 07:45 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 475988197 Nov 22 07:46 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 566040760 Nov 22 07:46 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 580446573 Nov 22 07:47 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 638547695 Nov 22 07:47 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 636473338 Nov 22 07:48 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125.fasta
-rw-r--r-- 1 akozik other 738071638 Nov 22 07:48 /net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125.fasta
bash-2.03$
### CHECK FULL PATH TO TARGET FILES ###
bash-2.03$ for long_crap in /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta; do echo $long_crap; done
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125.fasta
/net/bfs3/solexa_assembly_input_lact/090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125.fasta
bash-2.03$
### GET TARGET FILE NAMES - SINGLE LINE SYNTAX ###
bash-2.03$ for long_crap in /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta; do short_crap=$(echo $long_crap | sed -e "s/\/.*\///"); echo $short_crap; done
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125.fasta
bash-2.03$

### GET TARGET FILE NAMES - SYNTAX IN SEVERAL LINES ###
bash-2.03$ for long_crap in /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta;
> do short_crap=$(echo $long_crap | sed -e "s/\/.*\///");
> echo $short_crap; done
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125.fasta
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125.fasta
bash-2.03$
### GET TRUNCATED FILE NAMES ###
bash-2.03$ for long_crap in /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta;
> do short_crap=$(echo $long_crap | sed -e "s/\/.*\///");
> tiny_crap=$(echo $short_crap | sed -e "s/\.fasta//");
> echo $tiny_crap;
> done
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125
bash-2.03$
### FINALLY - BATCH BLAST RUN ###
bash-2.03$ for long_crap in /net/bfs3/solexa_assembly_input_lact/090721*.100_125.fasta;
> do short_crap=$(echo $long_crap | sed -e "s/\/.*\///");
> tiny_crap=$(echo $short_crap | sed -e "s/\.fasta//");
> echo $tiny_crap;
> blastall -p blastn -F F -V T -e 1e-10 -b 1000 -v 1000 -i fastassy.dna -d $long_crap -o fastassy.dna.$tiny_crap;
> done
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125
090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125
bash-2.03$
### BLAST OUTPUT FILES ###
bash-2.03$ ls -l fastassy.*
-rw-r--r-- 1 akozik rwm 989 Dec 18 09:27 fastassy.dna
-rw-r--r-- 1 akozik rwm 9584 Dec 18 10:17 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125
-rw-r--r-- 1 akozik rwm 10100 Dec 18 10:18 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L2._GC_.100_125
-rw-r--r-- 1 akozik rwm 9405 Dec 18 10:19 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L3._GC_.100_125
-rw-r--r-- 1 akozik rwm 10768 Dec 18 10:19 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L4._GC_.100_125
-rw-r--r-- 1 akozik rwm 10156 Dec 18 10:20 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L5._GC_.100_125
-rw-r--r-- 1 akozik rwm 6940 Dec 18 10:21 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L6._GC_.100_125
-rw-r--r-- 1 akozik rwm 6797 Dec 18 10:22 fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L7._GC_.100_125
bash-2.03$
### DONE ! ###
bash-2.03$ less fastassy.dna.090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125
BLASTN 2.2.17 [Aug-26-2007]
...
Query= 1LTV1A_235441_266_4 L:300
(300 letters)

Database: 090721_SOLEXA1_PEFC42D3MAAXX_Lact_Genomic.L1._GC_.100_125.fa
sta
2,177,045 sequences; 248,580,403 total letters

Searching..................................................done

Score E
Sequences producing significant alignments: (bits) Value

SBP1_1_48_724_728 | 34843 [ 122 ] 194 1e-48
SBP1_1_87_749_1889 | 24530 [ 125 ] 192 5e-48
SBP1_1_40_892_1154 | 36901 [ 122 ] 168 8e-41
SBP1_1_58_664_973 | 37778 [ 110 ] 159 8e-38
SBP1_1_68_519_348 | 27743 [ 121 ] 92 1e-17

>SBP1_1_48_724_728 | 34843 [ 122 ]
Length = 122

Score = 194 bits (98), Expect = 1e-48
Identities = 113/118 (95%)
Strand = Plus / Plus


Query: 1 atggagcttttaaggaatcgttgaccaatgacattcatgagatgcttgagttatatgagg 60
|||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||
Sbjct: 5 atggagcttttaaggaatcgttgaccaatgacatccatgagatgcttgagttatatgagg 64


Query: 61 caacatatatgagggtgaaaggagaagttgtactagaggaagctcttctttttacaaa 118
|||||| |||||| ||||||||||||||||||||||| |||||||||||||| |||||
Sbjct: 65 caacatttatgagagtgaaaggagaagttgtactagacgaagctcttcttttcacaaa 122

...

Just got started

perl -p -i -e 's/\r//' *.txt