cadmium-gcat

(bio)-informatics, data processing and visualization

Tuesday, June 11, 2024

Choosing the right parameters - LAST: find & align related regions of sequences

LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.
https://genome.cshlp.org/content/21/3/487.long
https://gitlab.com/mcfrith/last 
https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst 

Example outputs using various parameters comparing lettuce ribosomal NOR unit 

lastdb -w1 -W1 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta

lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0  -D100000 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.A 



lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0    -D1000 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.B 


 

lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0     -D100 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.C 


 

lastal -k1 -l1 -m100 -j4 -g1.0 -u2 -w0     -D100 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.D 


 

lastal -k1 -l1 -m100 -j4 -g1.0 -u2 -w0 -W1 -D100 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.E 


 

lastal -k1 -l1 -m100 -j4 -g1.0 -u0 -w0 -W1 -D100 -s2 LST_01_Ribo_12K.LastDB LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.F 


 

===============================================

lastdb -c -w1 -W1 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta

lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0  -D100000 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.A 



lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0    -D1000 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.B 



lastal -k1 -l1  -m10 -j4 -g1.0 -u2 -w0     -D100 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.C 



lastal -k1 -l1 -m100 -j4 -g1.0 -u2 -w0     -D100 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.D 



lastal -k1 -l1 -m100 -j4 -g1.0 -u2 -w0 -W1 -D100 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.E 



lastal -k1 -l1 -m100 -j4 -g1.0 -u0 -w0 -W1 -D100 -s2 LST_01_Ribo_12K.LastDB.x LST_01_Ribo_12K.Fasta > LST_01_Ribo_12K.vs.Self.TEST.x.F 


 

===============================================

>LST_01 Lsat_Tandem_01 m64069_200209_192322.967
CGCGGGTAGAATCCTTTGCAGACGACTTAAATACGCGACGGGGTATTGTAAGTGGCAGAGTGGCCTTGCTGCCACGATCCACTGAGATTCAGCCCTGCGTCGCTCAGATTCGTCCCTCCCCCCCAAAACAAGCCCCCTCATTTTTCCTTCCATGCATACGGACGAGAGGCTGGCTCCCCGACACTTGGTAAAATTTCAGACATTTTGTGACTTGGCGAAAAAAAAGTTCCAAGTCAACCTAAAAAGTTGCCCTTGTCGTATAATGAGTGATGATAGGCCATGGGGGACTACCACCACTTGGTGCCCAGAAGCATATAATGAGTGGACAAGGCATGGTGTTGGGAATTATGCATCCTCGGGGAAATCAGTGTCTGTTCCTGTCACCAAGCTCTTTATGCAATATGTATATATAGGGGGTACATGGGGACTAATACTACCCTTGGTGCCCGACGAGTGTGTTGGGAAACCTAAGCAAGCGAGGCTGGCAAGGCAGGCTACCCAAGGGAACAAGGCACCTGGCCACACACATGCCCATGAACGGCCAAGGGAACAAGGCACCTTGCCACACACACGCCCATGAACTCCCAAGGGAACGAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACATGCCCATCAATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATGAACGGCCAAGGAAACAAGGCACCTTGCCACACAGCATGCCCATGAACGACCAAGGGACCAAGGCACCTTGCCACACACATGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATGAACGGCCAAGGAAGCTAGGCCCCTTGCCACACAGCATGCCCCATGAACGACCAAGGGAACGAGGCACCTTGCCACACACATGCCCATCCATGAACGACCAAGGGAAGAATGCATGTGAGGCTGGCAAGGCTAGGTCATGGCAAGCCACACAGTCAGGATACCTATGGGAACAAGGCATCTTGCCACACACATGCCCATGAACGACCAAGGGAACAAGGCACCTTGCCACACACATGCCCATGAACGACCAAGGAAACGAGGCACCTTGCCACACAGCATGCCCATGAACGACCAAGGGAACGAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAAGGGAACGAGGCCCCTTGCCACACACAGCCCCATGAATGACCAAGGAAGCTAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAATGGAACGAGGCCCCTTGCCACACAACATGCCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACATGGCCATCCATGAACGACCAAGGGAACATGGCTACTTCCCACACACAGCCCCATGAACGGCCAAGGGAACAGGACTTCTTCCCAAACAGACACCCATGAACGACCAAGTGAACAAGGCTTGCTCCCACACACAGCCCCATGAACGACTAAGGGAACAAGCCTACTTCCTACACAAACACTCATGAACGACCATGAAAACGAGGCATCTTGCCACACACATGCCCTTGAACGAACCAATCCCATCCTCATCGTTCTAAGGAACAAGGGGCCTTGACAAACCATGAGCAGATTTCTAAGCAAGTCAAACGATGAAGGGAGCAAAGTGTTGTAACCGAATCCGTTTAAGTGTTGTAGTTCCTACTTACTACATAGCCACCAGGCGAACTGCTCGTGCTCTTTCGGGTTCTTTCGTTTTGCGACTAATGAATAGCTAGAAGCCTTATCTGCCTACCTATCAGTAGGTGTAGGCGACAGAGACAAAAACCCGAACACGATATATTTCAATTTCAAGGTCTATGTTCACAATGACCCACGCAAAGTTTCAATGACATTCCAACTTGATTTGAGCTGCATATCAACAAGTAGGGAACCGAACGCTTCAAGGCATGGACAGGGATAGGTCATGGACATTTATGCCCCAAACATGACATTTTTTTTATTTCAACGCCTTTCATTTATAATAAAAAGCATCCCTACAAAATTTCGTGGGAATCCGAATTAATTCGAGCGAGTTATGGACGATTTCGCAAAATTATGCATCCCTGGGCAAATCAATGTCTGTTCCTGTCACCAAGCTCTTTGTGCAATATGTATATATAGGGGGTACCAGGGGACTGCTCTCCATCGGCGCCCTGGGGGGTGTCTAGCGCCCAAGGCTGTCGAGGCTGGCACGCTCGAGGCCATGGCCAAGGCGCCCGGTCTTCACGAAAACGAGGCCATCAACGCCCCCATAGACGCCCCACTCGCGTGTCCCCTCGCCCCGACGTCGTGCGTGTCCAAAAATTATGCATCATCAGGGAAATCCATGTCCGTTCCTGTCACCAAGCTCTTTGTGCAATATGTATATATAGGGGGGTACCATGGGGGTCTCATTCGCATGGGTGCCCGACTTGGGCGATGGGCAGCTCAGAGTCATGAGGCTGTATCGAGCACCAACAAGGCAAGCCAAGCCAAGCCCCACGACCCATGAAAGAGGCCAACACCCCCAGCACGCTGCCTTGATTTCTCGAACGATGGCCTGAGAAATAGCCCCGTTGCCTTGACTTTCCTCGACGCCGTGCGCATGAACTAGCCCCGTTGCCTTGATTTTCCTCGACGCCGTGCGCACAAACTAGCCCCGTTGCCTTGATTTTCCTCGACGCCGTGCGCATATTAAAAATTATGCATAATCAGGGAAATCCATGTCTGTTCCTGTCACCAAGCTCTTTGTGCAATATGTATATATAGGGGGGGAGCCGTGAGTGAGCACGGCAAGGGGTGAACGCGCCAGATGGCCTTTCAGCGCGATCATGGGTGACGGTTGCTTAGTTCCGAGTTGCTTAGATAATACCCCTCCGAGTGGGGGCCTTGGCGGGGTGTGGGGATGCCTCGTCAGGTCGCTGCGACAACAGTCCAGGGTTGTGGTCTAGCCTTGGAAATGTGGGATGAGCTGTCTGTGTGGCAATATGATGACGATGTGTGCTTGCTAATCTTGTTCGACGTGCTTCTGGTGCTAGCTAGCATTTGATAATGTGCCGATGATGGGGAAGTGATAGGCGTTAAAGTTGCATGTGTTGCTTTTTGTGATACTACTACGGTACGCTGGTCTATCCTTGTTAGACGTGCGGTTGGGTGCGTAAGAGCTGCCATTGGTTAAGCACCGATAACGTGGAGGACGAGCACTATCGGGAAGCATTGTCAGACATTCAGGATCATCACGGCGTGTTGAAGTTATCTTGGAAGCACAAAAGATGCCAAGCCTGGCTAGCGTCTTTGTGTGGGCGGGTAGCTTCGTACACTGCATCAACCCAAGACTTGCCTTCTCTGAGGTACGATTGGTGCCCATGCCCAGCTAGTGCTGGTCGTACAGGATCAGAGATAGGCATTAAGAGGTCCCTTTTTGCTATCCCCGGCCCAAGCACATACTACATTGCCCATGCCCAGCTAGCAATGATGTGCTTAGGTCAGCAGCGTCGCCTGTCCCTTGTTGCGCATCGTTTGGTGCATTCTGGGGGTTGTTTAAGCTTGTGGTTGCTTGCATGGCCTTGTGCCAAGTGGGTGGCTGTGAGTTTGATGATCTTCGGGGATGTCTACCCTAAAGGTGCATGAGTGGTGTTTGGTTTGTAACGGGTGGTTGGATGTCTGCTTGAGCAGCAACTTCCATGCGTTCTTACCTCTTCAGTTGTGTTACAAGGCGAATTTGCCTTGAACATTGTGGGTTCCTGTGTTGCATACCTAATTGATGGCATTATGCTGTTTCAACAAAGTTTGCTTTCGTTAAGCATCGCTTGCGGTGCCTACGAACCTTGAAGCTGTCTTTGTGGTCCATGTTGTCAATGCGGATGTGTCGATGGCATGGCCTATGAAGTGTTGCTTGGTCTCTTGGATATGGAAGCTGGTGTGGGCACGTGGTCAGTCATGATCATGTATTTTGCCCTACATGAGCGTTTCGCTTCTCTAGACGACTGTCTACCTTGCATTGACTTGTTGGTGCAGGGTAGACTGAGTCGAAGAGGAATGCTACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAGTATGAACAAATTCAGACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTGTTTGATGGTATCTGCTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCAACAAACCCCCGACTTCTGGAAGGGATGCATTTATTAGATAAAAGGTCGACGCGGGCTCTGCCCGTTGCTGCGATGATTCATGATAACTCGACGGATCGCACGGCCCTCGTGCCGGCGACGCATCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGTGGCCTACTATGGTGGTGACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAAATAACAATACCGGGCTCTTTCGAGTCTGGTAATTGGAATGAGTACAATCTAAATCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGACTTTGGGTTGGGTCGGCCGGTCCGCCTTCAGGTGTGCACCGGTTTACTCGTCCCTTCTGTCGGCGATGCGCTCCTGGCCTTAATTGGCCGGGTCGTGCCTCCGGCGCTGTTACTTTGAAGAAATTAGAGTGCTCAAAGCAAGCCTACGCTCTGTATACATTAGCATGGGATAACATCATAGGATTTCGGTCCTATTACGTTGGCCTTCGGGATCGGAGTAATGATTAACAGGGACAGTCGGGGGCATTCGTATTTCATAGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACAACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGCTCGAAGACGATCAGATACCGTCCTAGTCTCAACCATAAACGATGCCGACCAGGGATCAGCGGATGTTGCTTTTAGGACTCCGCTGGCACCTTATGAGAAATCAAAGTTTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGACTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCTCAGCCTGCTAACTAGCTATGTGGAGGTATCCCTCCACGGCCAGCTTCTTAGAGGGACTATGGCCTTTTAGGCCACGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGTATTCAACGAGTATATAGCCTTGGCCGACAGGCCCGGGAAATCTTTGAAATTTCATCGTGATGGGGATAGATCATTGCAATTGTTGGTCTTCAACGAGGAATTCCTAGTAAGCGCGAGTCATCAGCTCGCGTTGACTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAATGGTCCGGTGAAGTGTTAGGATCGCGGCGACGTGGGCGGTTCGCCGCCGGCGACGTCGCGAGAATTCCACTGAACCTTATCATTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTCGAACCCTGCAAGGCAGAACGACCCGTGAACATGTAACCACAACGGGGTGACCGTGATAAGGGCCTCGGTCCTTATCCCCTAACCCTTCCCGACGTGAGTTCGTGGTGTCTTTTTTGGGGCATCATGGATTCCGTTGGACCATAACAAAACCCCGGCACGGTATGTGCCAAGGAAAACAAAAATGAGAAGGACACTACCTGTTTCGCCCCGTTTGCGGTGTGCGTACAGGTCGTGGCCTCCTTGGAATCACAAACGACTCTCGGCAACGGATATCTCGGCTCACGCATCGATGAAGAACGTAGCAAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCCGAAGCCATCCGGCTGAGGGCACGCCTGCCTGGGCGTCACGCATCGCGTCGCTCCCCACCATACCTCCCCAACGGGTTGGCATGGTGTTGGGGGCGGATAATGGCCTCCCGTGCTTGTGTTTCGGTTGGCCTAAATAAGAGTTCCCTTCGGCGGACACACGACTAGTGGTGGTTGAATAGACCCTCGTCTTTTGTTGCGTGTCGTGAGCTGTAAGGGTAGCCCTCATCAAAGACCCCATTGTATCGTCTTCGGATGATGCTTCGACCGCGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTTACAAGGATTCCCTTAGTAACGGCGAGCGAACCGGGATCAGCCCAGCTTGAAAATCGGGCGGCCTCGCTGTCCGAATTGTAGTCTGGAGAAGCGTCCTCAGCGGCGGACCGGGCCCAAGTCCCCTGGAAGGGGGCGCCAGAGAGGGTGAGAGCCCCGTCGTGCCCGGACCCTGTCGCACCACGAGGCGCTGTCTGCGAGTCGGGTTGTTTGGGAATGCAGCCCCAATAGGGCGGTAAATTCCGTCCAAGGCTAAATACCGGCGTGAGACCGATAGCAAACAAGTACCGCGAGGGAAAGATGAAAAGGACTTTGAAAAGAGAGTCAAAGAGTGCTTGAAATTGTCGGGAGGGAAGCGAATGGGGGCCGGCGATGCGTCCCGGTCGGATGTGGAACGGGCGTAAGCCGGTCTGCCGATCGACTCGGGGCGTGGACCGGTGCGGATTGGTGCGGCGGCCAAAGCCCGGACTGTTGATAGGCCCGTGGAGATGCCGTCGCGTCGATCGTGGTTGGCAGCGCGCGCCGTCACGGCGTGCCTCGGCACCTGCGCGCTCCCGGCACCGGCCTGCGGGCACCCCATTCGGCCCGTCTTGAAACACGGACCAAGGAGTCTGACATGTGTGCGAGTCAACGGGTGAGTAAACCCGCAAGGCGTAAGGAAGCTGATTGGCGGGATCCCCCTAGCGGGGTGCACCGCCGACCGACCTTGATCTTCTGAGAAGGGTTCGAGTGTGAGCATGCCTGTCGGGACCCGAAAGATGGTGAACTATGCCTGAGCGGGGCGAAGCCAGAGGAAACTCTGGTGGAGGCCCGCAGCGATACTGACGTGCAAATCGTTCGTCTGACTTGGGTATAGGGGCGAAAGACTAATCGAACCGTCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGAGCCCGGGTGCGAGTTCTATCGGGTTAAAGCGAATGATTAGAGGCATCGGGGGCGCAACGCCCTCGACCTATTCTCAAACTTTAAATAGGTAGGACGGTGCGGCTGCTTTGTTGAGCCGTACCACGGAATCGAGAGCTCCAAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGGAAGCCGGGTTACGGTGCCAAAACTACGCGCTAACCTAGAACCCACAAAGGGTGTTGGTCGATTAAGACAGCAGGACGGTGGTCATGGAAGTCGAAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCCGAAAATGGATGGCGCTTAAGCGCGTGACCTACACCCGGCCGTCGGGGCAAGTGCCAGGCCCCGATGAGTAGGGAGGGCGCGGCGGTCGCTGCAAAACCTTGGGCGTGAGCCTGGGCGGAGCGGCCGTCGGTGCGGATCTTGGTGGTAGTAGCAAATATTCAAATGAGAACTTTGAAGGCCGAAGAGGGGAAAGGTTCCATGTGAACGGCACTTGCACATGGGTTAGTCGATCCTAAGAGACGGGGGAAGCCCGTCAGATAGCGCGTTTCGCGCGAGCTTCGAAAGGGAATCGGGTTAAAATTCCTGAACCGGGACGTGGCGGCTGACGGCAACGTTAGGGAGTCCGGAGACGTCGGCGGGGGCCTCGGGAAGAGTTATCTTTTCTGTTTAACAGCCTGCCCACCCTGGAAACGACTCAGTCGGAGGTAGGGTCCAGCGGCTGGAAGAGCACCGCACGTCGCGCGGTGTCCGGTGCGCCCCCGGCGGCCCTTGAAAATCCGGAGGACCGAGTGCCTCCCACGCCCGGTCGTACTCATAACCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGATGGAACAATGTAGGCAAGGGAAGTCGGCAAAATGGATCCGTAACCTCGGGAAAAGGATTGGCTCTGAGGGCTGGGCACGGGGGTCCCAGTCCCGAACCCGTCGGCTGTTGGCGGACTGCTCGAGCTGCTTCCGCGGCGGAGAGCGGGTCGCTGCGTGCCGGCCGGGGGGACGGACTGGGAACGGCTCCTTCGGGGGCCTTCCCCGGGCGTCGAACAGCCAACTCAGAACTGGTACGGACAAGGGGAATCCGACTGTTTAATTAAAACAAAGCATTGCGATGGTCCCTGCGGATGCTAACGCAATGTGATTTCTGCCCAGTGCTCTGAATGTCAAAGTGAAGAAATTCAACAAGCGCGGGTAAACGGCGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTAACGAGATTCCCACTGTCCCTGTCTACTATCCAGCGAAACCACAGCCAAGGGAACGGGCTTGGCAGAATCAGCGGGGAAAGAAGACCCTGTTGAGCTTGACTCTAGTCCGACTTTGTGAAATGACTTGAGAGGTGTAGTATAAGTGGGAGCCTTCGGGCGAAAGTGAAATACCACTACTTTTAACGTTATTTTACTTATTCCGTGAATCGGAAGCGGGGCAATGCCCCTCTTTTTGGACCCAAGGCCTGCTTCGGCGGGCCGATCCGGGCGGAAGACATTGTCAGGTGGGGAGTTTGGCTGGGGCGGCACATCTGTTAAAAGATAACGCAGGTGTCCTAAGATGAGCTCAACGAGAACAGAAATCTCGTGTGGAACAGAAGGGTAAAAGCTCGTTTGATTCTGATTTTCCAGTACGAATACGAACCGTGAAAGCGTGGCCTAACGATCCTTTAGACCTTCGGAATTTGAAGCTAGAGGTGTCAGAAAAGTTACCACAGGGATAACTGGCTTTGTGGCAGCCAAGCGTTCATAGCGACGTTGCTTTTTGATCCTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGTGTTGGATTGTTCACCCACCAATAGGGAACGTGAGCTGGGTTTAGACCGTCGTGAGACAGGTTAGTTTTACCCTACTGATGACAGTGTCGCAATAGTAATTCAACCTAGTACGAGAGGAACCGTTGATTCGCACAATTGGTCATCGCGCTTGGTTGAAAAGCCAGTGGCGCGAAGCTACCGTGCGCTGGATTATGACTGAACGCCTCTAAGTCAGAATCCGGGCTAGAAGCGACGCGTGTGCCCGCCGCCTGTTTGCCGACCAGCAGTAGGGGCCTCGGCCCCCAAAGGCACGTGTCGTTGGCTAAGCCTGTGCGACGGATGAGTCGTGCAGGCCGCCATGAAGTATAATTCCCATCAAGCGGCGGGGTAGAATCCTTTGCAGACGACTTAAATACGCGACGGGGTATTGTAAGTGGCAGAGTGGCCTTGCTGCCACGATCCACTGAGATTCAGCCCTGCGTCGCTCAGATTCGTCCCTCCCCCCCAAAACAAGCCCCCTCATTTTTCCTTCCATGCATACGGACGAGAGGCTGGCTCCCCGACACTTGGTAAAATTTCAGACATTTTGTGACTTGGCGAAAAAAAAGTTCCAAGTCAACCTAAAAAGTTGCCCTTGTCGTATAATGAGTGATGATAGGCCATGGGGGACTACCACCACTTGGTGCCCAGAAGCATATAATGAGTGGACAAGGCATGGTGTTGGGAATTATGCATCCTCGGGGAAATCAGTGTCTGTTCCTGTCACCAAGCTCTTTATGCAATATGTATATATAGGGGGTACATGGGGACTAATACTACCCTTGGTGCCCGACGAGTGTGTTGGGAAACCTAAGCAAGCGAGGCTGGCAAGGCAGGCTACCCAAGGGAACAAGGCACCTGGCCACACACATGCCCATGAACGGCCAAGGGAACAAGGCACCTTGCCACACACACGCCCATGAACTCCCAAGGGAACGAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACATGCCCATCAATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATGAACGGCCAAGGAAACAAGGCACCTTGCCACACAGCATGCCCATGAACGACCAAGGGACCAAGGCACCTTGCCACACACATGCCCATCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACTTGCCCATGAACGGCCAAGGAAGCTAGGCCCCTTGCCACACAGCATGCCCATGAACGACCAAGGGAACGAGGCACCTTGCCACACACATGCCCATCCATGAACGACCAAGGGAAGAATGCATGTGAGGCTGGCAAGGCTAGGTCATGGCAAGCCACACAGTCAGGATACCTATGGGAACAAGGCATCTTGCCACACACATGCCCATGAACGACCAAGGGAACAAGGCACCTTGCCACACACATGCCCATGAACGACCAAGGAAACGAGGCACCTTGCCACACAGCATGCCCAATGAACGACCAAGGGAACGAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAAGGGAACGAGGCCCCTTGCCACACACAGCCCCATGAATGACCAAGGAAGCTAGGCCCCTTGCCACACACATGCCCATCCATGAACGACCAATGGAACGAGGCCCCTTGCCACACAACATGCCCATGAACGACCAAGGAAGCTAGGCCCCTTGCCACACACATGGCCATCCATGAACGACCAAGGGAACATGGCTACTTCCCACACACAGCCCCATGAACGGCCAAGGGAACAGGACTTCTTCCCAAACAGACACCCATGAACGACCAAGTGAACAAGGCTTGCTCCCACACACAGCCCCATGAACGACTAAGGGAACAAGCCTACTTCCTACACAAACACTCATGAACGACCATGAAAACGAGGCATCTTGCCACACACATGCCCTTGAACGAACCAATCCCATCCTCATCGTTCTAAGGAACAAGGGGCCTTGACAAACCATGAGCAGATTTCTAAGCAAGTCAAACGATGAAGGGAGCAAAGTGTTGTAACCGAATCCGTTTAAGTGTTGTAGTTCCTACTTACTACATAGCCACCAGGCGAACTGCTCGTGCTCTTTCGGGTTCTTTCGTTTTGCGACTAATGAATAGCTAGAAGCCTTATCTGCCTACCTATCAGTAGGTGTAGGCGACAGAGACAAAAACCCCGAACACGATATATTTCAATTTCAAGGTCTATGTTCACAATGACCCACGCAAAGTTTCAATGACATTCCAACTTGATTTGAGCTGCATATCAACAAGTAGGGAACCGAACGCTTCAAGGCATGGACAGGGATAGGTCATGGACATTTATGCCCCAAACATGACATTTTTTTTATTTCAACGCCTTTCATTTATAATAAAAAGCATCCCTACAAAATTTCGTGGGAATCCGAATTAATTCGAGCGAGTTATGGACGA

 



Wednesday, December 22, 2021

sum of numbers in column

cat numbers_in_column | paste -sd+ - | bc

 

Tuesday, August 10, 2021

MUMmer on large genomes memo

nucmer -l 1000 -g 1000 --maxmatch --nosimplify --prefix=test1 GenomeX.fa GenomeY.fa

nucmer -l 1000 -g 1000 --prefix=test2 GenomeX.fa GenomeY.fa
 
nucmer -l  300 -g 1000 --prefix=test2 GenomeX.fa GenomeY.fa

-l Minimum length of an maximal exact match (default 20)
-g Maximum gap between two adjacent matches in a cluster (default 90)
--maxmatch Use all anchor matches regardless of their uniqueness
--[no]simplify Simplify alignments by removing shadowed clusters. Turn this option off if aligning a sequence to itself to look for repeats (default --simplify)

Although MUMmer was not specifically designed to identify repeats, it does has a few methods of identifying exact and exact tandem repeats. In addition to these methods, the nucmer alignment script can be used to align a sequence (or set of sequences) to itself. By ignoring all of the hits that have the same coordinates in both inputs, one can generate a list of inexact repeats. When using this method of repeat detection, be sure to set the --maxmatch and --nosimplify options to ensure the correct results.

mummerplot test2.delta -t png -p test2.plot.D (first try with default layout)

mummerplot -l test2.delta -t png -p test2.plot.L (second try with all hits on main diagonal)

mummerplot "test2.delta" --filter --png --large --prefix "test2-Plot" --title "test2"

To get SVG plot you have to edit out.gp file, find and change following lines:

set terminal svg size 1600,1600 font "Helvetica-Bold,24 bold"
set terminal svg size 1200,1200 font "Helvetica-Bold,16 bold"
 
set output "test2.X.svg"
.....
set style line 1  lt 1 lw 3 pt 6 ps 0.5
set style line 2  lt 3 lw 3 pt 6 ps 0.5
set style line 3  lt 2 lw 3 pt 6 ps 0.5
set grid layerdefault linewidth 6

set style line 1  lt 1 lw 3 pt 6 ps 0.3
set style line 2  lt 3 lw 3 pt 6 ps 0.3
set style line 3  lt 2 lw 3 pt 6 ps 0.3
set grid layerdefault linewidth 3


and then re-run gnuplot

http://mummer.sourceforge.net/manual/


Monday, May 17, 2021

BLAST-N Plus run parameters and tab-delimited output



==================================================

makeblastdb -in Genome_DNA.Fasta -out Genome_DNA.bdb -dbtype nucl -input_type fasta -max_file_sz 2GB -hash_index -parse_seqids 

makeblastdb \
    -in Genome_DNA.Fasta \
    -out Genome_DNA.bdb \
    -dbtype nucl \
    -input_type fasta \
    -max_file_sz 2GB \
    -hash_index \
    -parse_seqids

==================================================

blastn -task blastn -query Query_Seqs.Fasta -db Genome_DNA.bdb -out Query_Seqs.vs.Genome_DNA.blastplus.out.m0 -outfmt 0 -evalue 1e-120 -dbsize 1000000 -dust no -word_size 24 -xdrop_ungap 50 -xdrop_gap 500 -xdrop_gap_final 1000 -max_target_seqs 240 -line_length 100 -num_threads 6

blastn \
    -task blastn \
    -query Query_Seqs.Fasta \
    -db Genome_DNA.bdb \
    -out Query_Seqs.vs.Genome_DNA.blastplus.out.m0 \
    -outfmt 0 \
    -evalue 1e-120 \
    -dbsize 1000000 \
    -dust no \
    -word_size 24 \
    -xdrop_ungap 50 \
    -xdrop_gap 500 \
    -xdrop_gap_final 1000 \
    -max_target_seqs 240 \
    -line_length 100 \
    -num_threads 6

==================================================

blastn -task blastn -query Query_Seqs.Fasta -db Genome_DNA.bdb -out Query_Seqs.vs.Genome_DNA.blastplus.out.m7 -outfmt '7 qseqid sseqid evalue pident score length nident mismatch gaps frames qstart qend sstart send qcovhsp qlen slen' -evalue 1e-120 -dbsize 1000000 -dust no -word_size 24 -xdrop_ungap 50 -xdrop_gap 500 -xdrop_gap_final 1000 -max_target_seqs 240 -num_threads 6 

blastn \
    -task blastn \
    -query Query_Seqs.Fasta \
    -db Genome_DNA.bdb \
    -out Query_Seqs.vs.Genome_DNA.blastplus.out.m7 \
    -outfmt '7 qseqid sseqid evalue pident score length nident mismatch gaps frames qstart qend sstart send qcovhsp qlen slen' \
    -evalue 1e-120 \
    -dbsize 1000000 \
    -dust no \
    -word_size 24 \
    -xdrop_ungap 50 \
    -xdrop_gap 500 \
    -xdrop_gap_final 1000 \
    -max_target_seqs 240 \
    -num_threads 6   

# Fields: 

query id - 1 [0] (qseqid)
subject id - 2 [1] (sseqid)
evalue - 3 [2] (evalue)
% identity - 4 [3] (pident)
score - 5 [4] (score)
alignment length - 6 [5] (length)
identical - 7 [6] (nident)
mismatches - 8 [7] (mismatch)
gaps - 9 [8] (gaps)
query/sbjct frames - 10 [9] (frames)
q. start - 11 [10] (qstart)
q. end - 12 [11] (qend)
s. start - 13 [12] (sstart)
s. end - 14 [13] (send)
% query coverage per hsp - 15 [14] (qcovhsp)
query length - 16 [15] (qlen)
subject length - 17 [16] (slen)

==================================================


Tuesday, May 4, 2021

BLAST-N of low quality long sequences

blastall -p blastn -V T -F F -e 1e-20 -y 50 -X 75 -Z 500 -b 240 -v 240 -d DATABASE -i INPUT -o OUTPUT

Where:

-y X  dropoff value for ungapped extensions in bits (0.0 invokes default behavior)

      blastn 20, megablast 10, all others 7 [Real]


-X X  dropoff value for gapped alignment (in bits) (zero invokes default behavior)

      blastn 30, megablast 20, tblastx 0, all others 15 [Integer]


-Z X  dropoff value for final gapped alignment in bits (0.0 invokes default behavior)

      blastn/megablast 100, tblastx 0, all others 25 [Integer]


Wednesday, April 25, 2012

batch extraction of Pfam HMM domains

Batch run for Pfam HMM domains - one model per search versus large fasta file with protein sequences:

Pfam-A.hmm - file with Pfam HMM models
Pfam-A.names.IDs - file with Pfam model names


head Pfam-A.names.IDs
1-cysPrx_C
120_Rick_ant
14-3-3
2-Hacid_dh
2-Hacid_dh_C
2-oxoacid_dh
2-ph_phosp
2CSK_N
2C_adapt
2Fe-2S_Ferredox


mkdir _hmm_files_

while read line; do hmmfetch Pfam-A.hmm $line > _hmm_files_/$line.hmm; echo $line; done < Pfam-A.names.IDs


mkdir _hmm_out_e20_

for long_crap in _hmm_files_/*.hmm; do short_crap=$(echo $long_crap | sed -e "s/.*\///"); hmmsearch -E 1e-20 $long_crap Lsat_CDS_BGI_V4_Prot.aa > _hmm_out_e20_/$short_crap.vs.Lsat_CDS_BGI_V4_Prot.e20; done &


ls -l _hmm_files_ | head
-rw-r--r--+ 1 akozik akozik  118664 Apr 25 12:37 120_Rick_ant.hmm
-rw-r--r--+ 1 akozik akozik  109882 Apr 25 12:37 14-3-3.hmm
-rw-r--r--+ 1 akozik akozik   19555 Apr 25 12:37 1-cysPrx_C.hmm
-rw-r--r--+ 1 akozik akozik   71642 Apr 25 12:37 2_5_RNA_ligase2.hmm
-rw-r--r--+ 1 akozik akozik   18154 Apr 25 12:37 2C_adapt.hmm
-rw-r--r--+ 1 akozik akozik   68415 Apr 25 12:37 2CSK_N.hmm
-rw-r--r--+ 1 akozik akozik   16791 Apr 25 12:37 2Fe-2S_Ferredox.hmm
-rw-r--r--+ 1 akozik akozik   83202 Apr 25 12:37 2-Hacid_dh_C.hmm
-rw-r--r--+ 1 akozik akozik   62452 Apr 25 12:37 2-Hacid_dh.hmm

ls -l _hmm_out_e20_ | head
-rw-r--r--+ 1 akozik akozik    1870 Apr 25 13:29 120_Rick_ant.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik   32467 Apr 25 13:29 14-3-3.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik    1879 Apr 25 13:29 1-cysPrx_C.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik    1879 Apr 25 13:29 2_5_RNA_ligase2.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik    1850 Apr 25 13:29 2C_adapt.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik    1871 Apr 25 13:29 2CSK_N.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik    1881 Apr 25 13:29 2Fe-2S_Ferredox.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik   30616 Apr 25 13:29 2-Hacid_dh_C.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20
-rw-r--r--+ 1 akozik akozik   25730 Apr 25 13:29 2-Hacid_dh.hmm.vs.Lsat_CDS_BGI_V4_Prot.e20

Wednesday, October 26, 2011

rsync memo

rsync -avz username@remote_hostname:/path/to/data/*.fastq ./