Wednesday, September 18, 2013

knitR and LaTeX for bioinformatics

knitR is a great way of dynamically generating reports, and ensuring that they're always based on up-to-date data. Having the reporting closely tied to the data and code that generated it is one of the tenets of reproducible research. In bioinformatics, one often finds oneself repeatedly generating figures, graphs, etc, for multiple labels (eg. probes, genes, etc). knitR is a great fit for this.

I had some difficulty finding the best way of iterating and generating multiple reports, and the code below is the best way I have found so far.

The following example code iterates over each label in a data.frame, and generates a separate .pdf for each. In a real scenario, each label could be a gene, and we could pull out whatever data/tables/figures we wanted.

Note that you may need to install some R libraries for this code to work:

  • ggplot2
  • xtable
  • knitr
Below is example code to do this. A control R-script. Adjust the path in the knit2pdf command, put this in a file, make it executable, and run it: The actual knitR file is below. This is essentially a LaTeX file with blobs (called "chunks" in knitR) of R in it that are evaluated and their output included in the final document, as desired. I saw the xtable technique on this blog.

Wednesday, August 28, 2013

Blast, updated

I have found a way of parallelising blast across a fasta file and multiple databases at the same time. I often wish to blast a .fasta file of sequences against a heap of databases, so this represents quite a speed-increase.

DBs= "/path/to/TriFLDB \
/path/to/harvardTC_Ta12"
in_fd=/path/to/input_file.fasta
out_fmt='\"6\"'

parallel --gnu "cat ${in_fd} | parallel --gnu --nice 19 --block 100k --recstart '>' --pipe ${BLAST_PATH}/${blast_type} -outfmt "\""${out_fmt}"\"" -query - -db {1} | awk '{n=split(\"{1}\",prnt, \"/\"); print \$0,prnt[n] }'" ::: ${DBs} 

I have included an awk in the pipe within parallel that appends a column with the name of the database. You can remove that if you wish, but you will then not know which database a given hit came from.

This code works by using parallel to call an instance of parallel for each database to be analysed. The sub-parallel then parallelises over the fasta file, as per my previous post on this subject.

Thursday, June 27, 2013

I have been working with blast for aligning sequences. This is a fairly computationally intensive exercise that is very worth parallelising. GNU parallel is a great tool for this, but I found it rather unintuitive to use. I also didn't realise that my first implementation was not actually parallelising blast, but it was still using only one processor core (traps for the unwary!). I found an implementation that worked, though, and it's below:

cat $in_fd | parallel --gnu --max-procs ${max_cores} --block 100k --recstart '>' --pipe blastx -evalue 0.01 -outfmt 6 -db ${blast_DB_loc} -query - > output_file.dat

# ${in_fd} is a fasta-formatted input file
# ${blast_DB_loc} is the database to blast against
# ${max_cores} is the number of cores one wishes to use

Please note that parellisation like this is only appropriate when the order does not matter and the analysis of one segment does not rely on the output of the analysis of another segment.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.