Wednesday, August 28, 2013

Blast, updated

I have found a way of parallelising blast across a fasta file and multiple databases at the same time. I often wish to blast a .fasta file of sequences against a heap of databases, so this represents quite a speed-increase.

DBs= "/path/to/TriFLDB \
/path/to/harvardTC_Ta12"
in_fd=/path/to/input_file.fasta
out_fmt='\"6\"'

parallel --gnu "cat ${in_fd} | parallel --gnu --nice 19 --block 100k --recstart '>' --pipe ${BLAST_PATH}/${blast_type} -outfmt "\""${out_fmt}"\"" -query - -db {1} | awk '{n=split(\"{1}\",prnt, \"/\"); print \$0,prnt[n] }'" ::: ${DBs} 

I have included an awk in the pipe within parallel that appends a column with the name of the database. You can remove that if you wish, but you will then not know which database a given hit came from.

This code works by using parallel to call an instance of parallel for each database to be analysed. The sub-parallel then parallelises over the fasta file, as per my previous post on this subject.
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.