Wednesday, September 18, 2013

knitR and LaTeX for bioinformatics

knitR is a great way of dynamically generating reports, and ensuring that they're always based on up-to-date data. Having the reporting closely tied to the data and code that generated it is one of the tenets of reproducible research. In bioinformatics, one often finds oneself repeatedly generating figures, graphs, etc, for multiple labels (eg. probes, genes, etc). knitR is a great fit for this.

I had some difficulty finding the best way of iterating and generating multiple reports, and the code below is the best way I have found so far.

The following example code iterates over each label in a data.frame, and generates a separate .pdf for each. In a real scenario, each label could be a gene, and we could pull out whatever data/tables/figures we wanted.

Note that you may need to install some R libraries for this code to work:

  • ggplot2
  • xtable
  • knitr
Below is example code to do this. A control R-script. Adjust the path in the knit2pdf command, put this in a file, make it executable, and run it:
#!/usr/bin/Rscript
setwd('~')
library(knitr)
# note that, because the seq only has a single string, R automatically
# replicates is across every row in the dataframe (traps for the unwary)
data = data.frame(labs=c(1,1,2,2,3,3,4,4,5,5),dat=matrix(rnorm(20), nrow=10), seq=c(
"AAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCC")
)
for (i in 1:5){
these_dat <- data[data$labs==i,]
knit2pdf("~/code/one-off_scripts/knitR_demo.Rnw", output=paste('report_', i, '.tex', sep=""))
}
view raw gistfile1.r hosted with ❤ by GitHub
The actual knitR file is below. This is essentially a LaTeX file with blobs (called "chunks" in knitR) of R in it that are evaluated and their output included in the final document, as desired. I saw the xtable technique on this blog.
\documentclass{article}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
\usepackage{geometry}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\setcounter{secnumdepth}{2}
\setcounter{tocdepth}{2}
\usepackage{url}
\usepackage{booktabs}
\usepackage[unicode=true,pdfusetitle,
bookmarks=true,bookmarksnumbered=true,bookmarksopen=true,bookmarksopenlevel=2,
breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false]
{hyperref}
\hypersetup{
pdfstartview={XYZ null null 1}}
\usepackage{breakurl}
\begin{document}
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
library(knitr)
library(reshape)
library(ggplot2)
library(stats)
library(xtable)
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
\title{Demo report: \Sexpr{as.character(i)}}
\author{Angus Wallace}
\maketitle
A basic report to demo knitR
Text can go here. This is standard \LaTeX. Below this, we will show a bit of data about each sub-component of the data. We can list the identifier like this: \Sexpr{as.character(i)} (R code within the curly brackets will be evaluated and the output printed inline).
% below is a block of R code that will be evaluated, and its output included in the document
<<print_data, results='asis', fig.width=5, fig.height=4, echo=FALSE, cache=FALSE>>=
# display a table for this label
strCaption <- paste0("all the data for this label")
print(xtable(these_dat[,1:3], digits=2, caption=strCaption, label="Test_table"),
size="footnotesize", #Change size; useful for bigger tables
include.rownames=FALSE, #Don't print rownames
include.colnames=FALSE, #We create them ourselves
caption.placement="top",
hline.after=NULL, #We don't need hline; we use booktabs
add.to.row = list(pos = list(-1,
nrow(these_dat[,1:3])),
command = c(paste("\\toprule \n",
"Label & Data1(\\%) & Data2 \\\\\n",
"\\midrule \n"),
"\\bottomrule \n")
))
seq<- gsub('(.{90})', '\\1\\\n\\2',as.character(these_dat[1,4]))
@
<<prepare_figure, echo=FALSE, cache=FALSE>>=
#invisible ensures that nothing unwanted goes to the document
invisible(p.tmp <- ggplot(data[,1:3], aes(x=dat.1,y=dat.2)) + geom_point() + labs(title = expression("microarray data, this instance shown in red")) + geom_point(data=these_dat[,1:3], aes(x=dat.1,y=dat.2, colour='red', size=3)))
@
<<disp_figure, results='asis', fig.width=5, fig.height=4, echo=FALSE, cache=FALSE>>=
invisible(print(p.tmp))
@
This label has the sequence as follows:\newline
\begin{verbatim}
> ParentID \Sexpr{i} length \Sexpr{nchar(seq)}
\Sexpr{seq}
\end{verbatim}
\end{document}
view raw gistfile1.txt hosted with ❤ by GitHub
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.