(bio)-informatics, data processing and visualization

Wednesday, December 22, 2021

sum of numbers in column

cat numbers_in_column | paste -sd+ - | bc

 

Tuesday, August 10, 2021

MUMmer on large genomes memo

nucmer -l 1000 -g 1000 --maxmatch --nosimplify --prefix=test1 GenomeX.fa GenomeY.fa

nucmer -l 1000 -g 1000 --prefix=test2 GenomeX.fa GenomeY.fa
 
nucmer -l  300 -g 1000 --prefix=test2 GenomeX.fa GenomeY.fa

-l Minimum length of an maximal exact match (default 20)
-g Maximum gap between two adjacent matches in a cluster (default 90)
--maxmatch Use all anchor matches regardless of their uniqueness
--[no]simplify Simplify alignments by removing shadowed clusters. Turn this option off if aligning a sequence to itself to look for repeats (default --simplify)

Although MUMmer was not specifically designed to identify repeats, it does has a few methods of identifying exact and exact tandem repeats. In addition to these methods, the nucmer alignment script can be used to align a sequence (or set of sequences) to itself. By ignoring all of the hits that have the same coordinates in both inputs, one can generate a list of inexact repeats. When using this method of repeat detection, be sure to set the --maxmatch and --nosimplify options to ensure the correct results.

mummerplot test2.delta -t png -p test2.plot.D (first try with default layout)

mummerplot -l test2.delta -t png -p test2.plot.L (second try with all hits on main diagonal)

mummerplot "test2.delta" --filter --png --large --prefix "test2-Plot" --title "test2"

To get SVG plot you have to edit out.gp file, find and change following lines:

set terminal svg size 1600,1600 font "Helvetica-Bold,24 bold"
set terminal svg size 1200,1200 font "Helvetica-Bold,16 bold"
 
set output "test2.X.svg"
.....
set style line 1  lt 1 lw 3 pt 6 ps 0.5
set style line 2  lt 3 lw 3 pt 6 ps 0.5
set style line 3  lt 2 lw 3 pt 6 ps 0.5
set grid layerdefault linewidth 6

set style line 1  lt 1 lw 3 pt 6 ps 0.3
set style line 2  lt 3 lw 3 pt 6 ps 0.3
set style line 3  lt 2 lw 3 pt 6 ps 0.3
set grid layerdefault linewidth 3


and then re-run gnuplot

http://mummer.sourceforge.net/manual/


Monday, May 17, 2021

BLAST-N Plus run parameters and tab-delimited output



==================================================

makeblastdb -in Genome_DNA.Fasta -out Genome_DNA.bdb -dbtype nucl -input_type fasta -max_file_sz 2GB -hash_index -parse_seqids 

makeblastdb \
    -in Genome_DNA.Fasta \
    -out Genome_DNA.bdb \
    -dbtype nucl \
    -input_type fasta \
    -max_file_sz 2GB \
    -hash_index \
    -parse_seqids

==================================================

blastn -task blastn -query Query_Seqs.Fasta -db Genome_DNA.bdb -out Query_Seqs.vs.Genome_DNA.blastplus.out.m0 -outfmt 0 -evalue 1e-120 -dbsize 1000000 -dust no -word_size 24 -xdrop_ungap 50 -xdrop_gap 500 -xdrop_gap_final 1000 -max_target_seqs 240 -line_length 100 -num_threads 6

blastn \
    -task blastn \
    -query Query_Seqs.Fasta \
    -db Genome_DNA.bdb \
    -out Query_Seqs.vs.Genome_DNA.blastplus.out.m0 \
    -outfmt 0 \
    -evalue 1e-120 \
    -dbsize 1000000 \
    -dust no \
    -word_size 24 \
    -xdrop_ungap 50 \
    -xdrop_gap 500 \
    -xdrop_gap_final 1000 \
    -max_target_seqs 240 \
    -line_length 100 \
    -num_threads 6

==================================================

blastn -task blastn -query Query_Seqs.Fasta -db Genome_DNA.bdb -out Query_Seqs.vs.Genome_DNA.blastplus.out.m7 -outfmt '7 qseqid sseqid evalue pident score length nident mismatch gaps frames qstart qend sstart send qcovhsp qlen slen' -evalue 1e-120 -dbsize 1000000 -dust no -word_size 24 -xdrop_ungap 50 -xdrop_gap 500 -xdrop_gap_final 1000 -max_target_seqs 240 -num_threads 6 

blastn \
    -task blastn \
    -query Query_Seqs.Fasta \
    -db Genome_DNA.bdb \
    -out Query_Seqs.vs.Genome_DNA.blastplus.out.m7 \
    -outfmt '7 qseqid sseqid evalue pident score length nident mismatch gaps frames qstart qend sstart send qcovhsp qlen slen' \
    -evalue 1e-120 \
    -dbsize 1000000 \
    -dust no \
    -word_size 24 \
    -xdrop_ungap 50 \
    -xdrop_gap 500 \
    -xdrop_gap_final 1000 \
    -max_target_seqs 240 \
    -num_threads 6   

# Fields: 

query id - 1 [0] (qseqid)
subject id - 2 [1] (sseqid)
evalue - 3 [2] (evalue)
% identity - 4 [3] (pident)
score - 5 [4] (score)
alignment length - 6 [5] (length)
identical - 7 [6] (nident)
mismatches - 8 [7] (mismatch)
gaps - 9 [8] (gaps)
query/sbjct frames - 10 [9] (frames)
q. start - 11 [10] (qstart)
q. end - 12 [11] (qend)
s. start - 13 [12] (sstart)
s. end - 14 [13] (send)
% query coverage per hsp - 15 [14] (qcovhsp)
query length - 16 [15] (qlen)
subject length - 17 [16] (slen)

==================================================


Tuesday, May 4, 2021

BLAST-N of low quality long sequences

blastall -p blastn -V T -F F -e 1e-20 -y 50 -X 75 -Z 500 -b 240 -v 240 -d DATABASE -i INPUT -o OUTPUT

Where:

-y X  dropoff value for ungapped extensions in bits (0.0 invokes default behavior)

      blastn 20, megablast 10, all others 7 [Real]


-X X  dropoff value for gapped alignment (in bits) (zero invokes default behavior)

      blastn 30, megablast 20, tblastx 0, all others 15 [Integer]


-Z X  dropoff value for final gapped alignment in bits (0.0 invokes default behavior)

      blastn/megablast 100, tblastx 0, all others 25 [Integer]