(bio)-informatics, data processing and visualization

Saturday, December 19, 2009

non-redundant FASTA dataset using 'sort' and 'uniq'

original file:
>SEQS_A
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG
>SEQS_B
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>SEQS_C
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>SEQS_D
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>SEQS_E
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>SEQS_F
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>SEQS_G
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT

step A - use seqs_processor to generate tab-delimited file:
http://code.google.com/p/atgc-tools/wiki/seqs_processor_and_translator
SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG
SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_C 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
SEQS_F 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_G 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT

step B - use Unix sort and uniq to generate non-redundant set -
with redundancy count (option -c):
sort -k 3 fastassy.tab | uniq -f 2 -c > fastassy.tab.uniq
or without redundancy count:
sort -k 3 fastassy.tab | uniq -f 2 > fastassy.tab.uniq

with option '-c' we will get:
3 SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1 SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2 SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1 SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

without redundancy count:
SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

step C - back to FASTA format:

cp -i fastassy.tab.uniq fastassy.nr.fasta

to remove leading white-space:
perl -p -i -e 's/^ {1,}//' fastassy.nr.fasta
3 SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1 SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2 SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1 SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to remove all remaining white-space:
perl -p -i -e 's/ {1,}//' fastassy.nr.fasta
3SEQS_D 60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
1SEQS_E 60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
2SEQS_B 60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
1SEQS_A 60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to restore FASTA '>' sign:
perl -p -i -e 's/^/\>/' fastassy.nr.fasta
to replace first 'tab' with length info:
perl -p -i -e 's/\t/ L\:/' fastassy.nr.fasta
>3SEQS_D L:60 AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60 ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60 CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60 GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

to replace remaining 'tab' with a new line:
perl -p -i -e 's/\t/ \n/' fastassy.nr.fasta
>3SEQS_D L:60
AAATCCTGTCAAAATGGAAATTTTATATTTAAGAAAAGTAACAAAATGATAATTTTATAT
>1SEQS_E L:60
ATCTATACGACAAAATTGCAGTTTTTTTTGTCCTATACAAAAAGGCGGGACAAAGAATCT
>2SEQS_B L:60
CTTTGTGAATCCGCTATATTTTCTTTCTTTGCCCATTTCGGCGCATATGATTTCTAGAAG
>1SEQS_A L:60
GGTTGTTGGTTTTTTGGTGTTTTTGCATACAAGTTTTAGAAGGTGTTGTATCATTCTCAG

Done!

No comments:

Post a Comment