Tuesday, October 2, 2012

fastplot

fastplot is a program for the rapid and highly-efficient creation of QQ and Manhattan plots of GWAS results. Because of its extremely low memory (RAM) requirements, it is suitable for visualising large GWAS data sets, such as those imputed using the 1,000 Genome Project reference sets, in approximately 10 minutes, while requiring less than 50 MB RAM for the calculation stages and approximately 800 MB RAM for the final figure rendering. This was done using the -macImputed option, which is the fastest way of processing 1000G data, if they are formatted correctly (ie. they're the output of minimac). If your data are formatted differently, contact me and I will add an optimised switch to suit your data.


fastplot creates both QQ and Manhattan plots from GWAS results. The data are all manipulated on disk on-the-fly: only a small subset of the data are RAM-resident at any time. Because of this, the memory requirements for calculation are very low (below 50 MB). The program requires access to the freely-available GNU toolset (bash, sort, awk, sed) and GNUplot (version 4.4 or higher allows better plots, but fastplot will work with version 4.0 as well). Rendering the figures generally requires more memory (13 million results requires approximately 800 MB), although this is still very modest and easily allocable on most computers produced in recent years. It is also fast: QQ and Manhattan plots are both created, and an estimate for lambda returned, within approximately 10 minutes.

Please note that fastplot is beta software. There may be bugs so I strongly suggest that for your first few uses you backup your data first!

Use:


The program requires that the user pass it two arguments:
-in=input_file.dat -refdat=reference_file.bim
input_file contains columns of results (see table 1) and reference_file position information in the standard plink bim file format (see table 2)
If a user's data differ from these formats, the operation of fastplot can be adjusted using the following switches
-in=filename (specify a filename of input data)
-cwd             (work in the current directory)
-workdir=dir          (specify a directory in which to work)
-pvalueCol=num (specify the column number of pvals in the datafile [default=11])
-SNPCol=n (specify the column number of SNP names in the datafile [default=2])
-refdat=path (specify the location of reference data)
-refChromoCol=n (specify the chromasome column in the reference datafile [default=1])
-refSNPCol=n (specify the SNP column in the reference datafile [default=2])
-refBPCol=n (specify the base-pair column in the reference datafile [default=4])
-nolabels (specify that the datafile has no column labels)
-macImputed (specify that the input data are produced by the minimac imputation program. A separate reference data file is not needed, but can be supplied for faster execution)

To use the program, please cut and paste the text (please ignore the "</plot></plot></http:>" at the end -- blogger insists on putting that there), below, into a text editor (vi, notepad, emacs, gedit, etc) and save it as fastplot somewhere in your path. You will need to make the file executable before you try and run it (eg. chmod +x /path/to/fastplot)

#!/bin/bash

# Angus Wallace, 2012
# wallace.angus@gmail.com

#   This program is free software: you can redistribute it and/or modify   #
#   it under the terms of the GNU General Public License as published by   #
#   the Free Software Foundation, either version 3 of the License, or      #
#   (at your option) any later version.                                    #
#                                                                          #
#   This program is distributed in the hope that it will be useful,        #
#   but WITHOUT ANY WARRANTY; without even the implied warranty of         #
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the          #
#   GNU General Public License for more details.                           #
#                                                                          #
#   You should have received a copy of the GNU General Public License      #
#   along with this program.  If not, see .  #

# version history
# v0.1 Angus Wallace Aug 2012
# v1.0 Angus Wallace Sep 2012


# a program to create Manhattan and QQ plots for huge GWAS data sets

### ############## ###
### check software ###
### ############## ###
if [ `which gnuplot | wc -l` -eq 0 ]; then echo "ERROR: gnuplot not found. Please install."; exit 1; fi
if [ `gnuplot -V | grep -o -E "[0-9]\.[0-9]" | awk '{ if($1 < 4.4) print 1}' | wc -m` -eq 1 ]; then #check gnuplot version number
   echo "ERROR: please install gnuplot version 4.4 or newer."; exit 1; fi
if [ `which awk | wc -l` -eq 0 ]; then echo "ERROR: awk not found. Please install."; exit 1; fi
if [ `which sort | wc -l` -eq 0 ]; then echo "ERROR: sort not found. Please install."; exit 1; fi
if [ `which sed | wc -l` -eq 0 ]; then echo "ERROR: sed not found. Please install."; exit 1; fi

### ################# ###
### process arguments ###
### ################# ###

if [ $# -eq 0 ]; then echo -e "\nCreate QQ and Manhattan plots for sanity-checking GWAS results. Written by Angus Wallace (wallace.angus@gmail.com), \n Use (switches are case-sensitive):\n  -in=filename    (specify a filename of input data)\n  -cwd            (work in the current directory)\n  -workdir=dir    (specify a directory in which to work)\n  -pvalueCol=num  (specify the column number of pvals in the datafile [default=11])\n  -SNPCol=n       (specify the column number of SNP names in the datafile [default=2])\n  -refdat=path    (specify the location of reference data)\n  -refChromoCol=n (specify the chromasome column in the reference datafile [default=1])\n  -refSNPCol=n    (specify the SNP column in the reference datafile [default=2])\n  -refBPCol=n     (specify the base-pair column in the reference datafile [default=4])\n  -nolabels       (specify that the datafile has no column labels)\n  -macImputed     (specify that the input data are produced by the minimac imputation program.  A separate reference data file is not needed, but can be supplied for faster execution)\n\nIf the input data filename has a full path, and neither -cwd or -workdir are used, then the path to the input file will be the working directory. If no paths are specified the current directory will be used.\n\nExamples:\nRun fastplot on the output of Merlin for a particular trait, using the standard reference data set:\n  fastplot -in=trait_data.out -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim\nRun fastplot on the output of minimac imputed 1000Genome data:\n  fastplot -in=subcort.dat -macImputed\nRun fastplot on some data with columns arranged differently to the defaults:\n  fastplot -in=trait_data.out -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim -pvalueCol=3 -SNPCol=2 -nolabels refChromoCol=3\n\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program.  If not, see http://www.gnu.org/licenses.\n"; exit 1; fi

col_lab=1 #assume the data file has labels unless told otherwise
imputed=0 #assume the data are not imputed unless told otherwise
subsample=0
sort_mem=50M
for arg in "$@" ; do
   case $arg in
      -in=*) # input file name
         in_dat=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -cwd) #work in the current directory
         workdir=`pwd`
         ;;
      -workdir=*) #specify a directory in which to work
         workdir=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -pvalueCol=*) #specify which column has the pvalues
         pvalueCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -SNPCol=*) #specify which column has the SNP names
         SNPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refdat=*) #specify reference data
         ref_dat=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refChromoCol=*) #specify reference data chromasome col
         refChromoCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refSNPCol=*) #specify reference data SNP col
         refSNPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refBPCol=*) #specify reference data's base-pair col
         refBPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -nolabels) #line one has column labels
         col_lab=0
         ;;
      -macImputed) #line one has column labels
         imputed=1
         ;;
      -sortmem=*) # specify how much RAM is allocated to sort commands
         sort_mem=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
#      -subsample) #line one has column labels
#         subsample=1
#         ;;
   esac
done  

# if there's no path in the filename, then use the cwd
if [ `echo ${workdir} | wc -m` -eq 1 ]; then 
   workdir=`echo ${in_dat} | grep -E -o ".*/"`
   if [ `echo ${workdir} | wc -m` -eq 1 ]; then 
      workdir=`pwd`; fi; fi
cd $workdir

if [ `echo ${in_dat} | wc -m` -eq 1 ]; then 
   echo "ERROR: no input file specified."
   exit 1
elif [ `ls ${workdir}/${in_dat} | wc -m` -eq 0 ]; then 
   echo "ERROR: input file not found."
   exit 1
fi
 


### ######## ###
### defaults ###
### ######## ###

if [ `echo ${pvalueCol} | wc -m` -eq 1 ]; then pvalueCol=11; fi
if [ `echo ${SNPCol} | wc -m` -eq 1 ]; then SNPCol=2; fi
if [ `echo ${refChromoCol} | wc -m` -eq 1 ]; then refChromoCol=1; fi 
if [ `echo ${refSNPCol} | wc -m` -eq 1 ]; then refSNPCol=2; fi
if [ `echo ${refBPCol} | wc -m` -eq 1 ]; then refBPCol=4; fi

if [ `echo ${ref_dat} | wc -m` -eq 1 ]; then 
   if [ $imputed -eq 0 ]; then
      echo "ERROR: no reference file specified. try -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim"
      exit 1; 
   else #then the data were inputed with minimac
      # create reference data from input data SNP names
      awk " \$${pvalueCol} != \"-\" {sub(\".*:\",\"\",\$2); print \$1, \$1\":\"\$2, \$2}" $in_dat | sort  -S ${sort_mem} -g > fastplot_refdata.dat
# possible way of specifying different columns for imputed data.
#      awk " \$${pvalueCol} != \"-\" {\$0= \$${SNPCol}\" \"\$${SNPCol}\" \"\$${SNPCol}; sub(\":.*\",\"\",\$1);sub(\".*:\",\"\",\$3); print \$0}" $in_dat | sort  -S ${sort_mem} -g > fastplot_refdata.dat
      ref_dat=${workdir}/fastplot_refdata.dat
      refBPCol=3
   fi
fi

### ####################################### ###
### create the Manhattan data and plot them ###
### ####################################### ###

# from ref_dat, create a list of cumulative SNP counts across chromasomes
rm -f fastplot_chr_len.dat
for i in `cat ${ref_dat} | awk '{print $1}' | sort -S ${sort_mem} | uniq | sort -g | uniq`; do
   awk "BEGIN {max = 0} {if (\$1==${i} && \$${refBPCol}>max) max=\$${refBPCol}} END {print max}" ${ref_dat} >> fastplot_chr_len.dat
done

echo "1 0" > abs_BP.dat
awk 'BEGIN{sum=0}{sum += $1; print (NR+1), sum}' fastplot_chr_len.dat | head -n 21 >> abs_BP.dat

chr_len=abs_BP.dat

if [ $imputed -eq 0 ]; then #are we dealing with the imputed output of mac?
   # prune input data to pvals column, remove first line, sort
   if [ $col_lab -eq 1 ]; then
      awk "NR > 1 && \$${pvalueCol} != \"-\" {print \$${SNPCol}, \$${pvalueCol} } " ${in_dat} | sort  -S ${sort_mem} > ${workdir}/manh_in_sorted.dat
   else
      awk "\$${pvalueCol} != \"-\" {print \$${SNPCol}, \$${pvalueCol} } " ${in_dat} | sort  -S ${sort_mem} > ${workdir}/manh_in_sorted.dat
   fi
   # get the base-pairs of the SNPs, translate to absolute BPs (not relative to chr)
   awk "{print \$${refChromoCol}, \$${refSNPCol}, \$${refBPCol} }" ${ref_dat} > ${workdir}/ref_dat.dat
   join ${workdir}/ref_dat.dat $chr_len | awk {'print $2, ($3 + $4), ($1 % 2)+1'} | sort  -S ${sort_mem} > ${workdir}/step_1_all.dat

   #merge the two files crop to pval/BP and take the -log10
   join ${workdir}/manh_in_sorted.dat ${workdir}/step_1_all.dat | awk {'print $3, -log($2) / log(10), $4'} > ${workdir}/all_dat.dat
   ### clean up ###
   rm ${workdir}/step_1_all.dat ${workdir}/manh_in_sorted.dat 
else 
   # then we're using minimac (13M) data. Do as much as possible in one line for speed.
   awk "NR==FNR{a[\$1]=\$2;next;}{  {sub(\$1,a[\$1]\" \"\$1) sub(\".*:\",\"\",\$3)}; print \$1+\$3, (-log(\$12 )/2.302585), (\$2 % 2)+1 }" abs_BP.dat ${in_dat} > all_dat.dat
#   fi
fi

# create the plot
if [ `gnuplot -V | grep -o -E "[0-9]\.[0-9]" | awk '{ if($1 < 4.4) print 1}' | wc -m` -ge 1 ]; then
   echo -e "\nWARNING: gnuplot 4.4 or newer is needed for prettier Manhattan plot output, colour-coded by chromosome\n"
gnuplot -persist < ${workdir}/qq_in_sorted.dat

# how many data are there?
dat_len=`cat ${workdir}/qq_in_sorted.dat | wc -l`

# generate chi_squared data of the same length
seq ${dat_len} -1 1 | awk {'print -log($1/'$dat_len')/ 2.302585'}  > ${workdir}/chi2_temp.dat # ${workdir}/ints.dat

# combine into the data to be plotted
paste ${workdir}/chi2_temp.dat ${workdir}/qq_in_sorted.dat > ${workdir}/chi2_plot_data.dat
paste ${workdir}/chi2_temp.dat ${workdir}/chi2_temp.dat > ${workdir}/x_eq_y.dat

# create the QQ plot
gnuplot -persist < ${workdir}/lambda_val.txt

echo "QQ lambda estimate: " `cat ${workdir}/lambda_val.txt`


### clean up ###
rm chi2_plot_data.dat x_eq_y.dat chi2_temp.dat qq_in_sorted.dat


This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au

Thursday, February 23, 2012

Political ecology in Australia (Labor's woes)

In the book Dune, by Frank Herbert, one of the main protagonists observes that the biggest threat to most creatures is one of their own kind. This is for the reason that they are in direct competition and that "they share the same basic requirements" -- they eat the same food, seek the same shelter, compete for the same mates, etc.
I thought of this quote recently, watching the antics of the Australian Labor party. Their ideological enemy, the Liberal party, are sitting on amusedly watching Labor tear itself apart. Why are members of Labor so busy fighting each other when their foe sits clearly across the chamber?
I think the reason is highlighted in the above quote. Labor has its niche, and the Liberal party has its niche -- members of each party tend not to change allegiance, and so remain in their party's "territory" in the political "landscape" as though they were fenced within it. Because of this, individuals within each party are effectively competing with others in the same party for positions of power and influence within that party's territory -- in exactly the same way as animals in an ecosystem compete for food and procreation. The fact that they are trashing their "ecosystem" to compete for the resources it contains has an unhappy similarity to humans fighting over resources and reminds me of Solomon's edict to "cut it in two."

This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au

Wednesday, February 1, 2012

Energy efficiency at home

I am running a half-day workshop this coming Saturday at Northey Street City Farm on how to save energy at home. It covers

  • how to estimate how much power devices are using
  • how to find the best places to invest in energy efficiency at your place
  • how to use the garden to reduce cooling costs
  • some technical understanding of electricity and its supply
  • how to examine behaviours seeking simple changes to save energy and money
  • how to cook to reduce energy use

Here is a link to the manual I have developed. I will refine this in the future and keep this site up-to-date. It is covered by a CC license (details in the pdf). Any comments or suggestions are welcome and will be considered for incorporation.

This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au

Tuesday, January 24, 2012

Winsorisation in R

I wrote a function to hard-winsorise each column of a data.frame to 3 standard deviations. I couldn't find anything that did this neatly.
It doesn't do any checking of the data, you need to do that yourself. This is licensed under GPL3 or later. Please link back here if cross-posting it elsewhere.

winsor_clip_data <- function (x, std = 3, na.rm = TRUE)
{
clip_vec <- function(dat, min, max){
# hard clip dat to the rangs max, min
dat[dat > max] <- max
dat[dat < min] <- min
return(dat)
}

sds<-as.matrix(apply(x, 2, sd, na.rm=TRUE))
means<-as.matrix(apply(x, 2, mean, na.rm=TRUE))
mins<-means-3*sds
maxs<- means+3*sds
output<- mapply(clip_vec, x, mins, maxs)

return(output)
}

Summary stats in R

I've been looking for an easy way of creating a data.frame of summary statistics in R, and haven't been able to find anything. The summary() function seems to output a list, and it isn't easily malleable into a data.frame. This makes it hard to add other stats to the list, or to query it from other functions. I've written a simple function that uses boxplot() plus a few other bits to make a nice data.frame.

It doesn't do any checking of the data, you need to do that yourself. This is licensed under GPL3 or later. Please link back here if cross-posting it elsewhere.

summary_stats <- function(these_data, output_dir) {

num_NAs=as.data.frame(t(colSums(is.na(these_data))))
rownames(num_NAs)<-"NA count"

means<-as.data.frame(t(colMeans(these_data, na.rm=TRUE)))
rownames(means)<-"means"

num_dat=as.data.frame(t(rep(nrow(these_data),ncol(these_data))))
rownames(num_dat)<-"num data"
names(num_dat)<-names(these_data)

stats<-boxplot(these_data,plot=FALSE)
stats<-as.data.frame(stats$stats[1:5,])
names(stats)<-names(these_data)
rownames(stats)<-c("minimum (excl outliers)","lower quartile","median", "upper quartile", "maximum (excl outliers)")

output<-as.data.frame(rbind(
num_NAs,
num_dat,
means,
stats
))
return(output)
}
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.