Guesstimated approximations

Wednesday, June 4, 2014

filter strings with awk, extended regular expressions

I've been wanting to filter out a DNA sequence within a line of text using awk, and have been unsuccessful. I was trying:

gawk '{gsub(/[ACTG]{10,}/,""); print}'

which I expected to work, but it was not. I've found that gawk (GNU awk) has an extra setting that allows this syntax:

gawk --re-interval '{gsub(/[ACTG]{10,}/,""); print}'

This works.

Sunday, April 20, 2014

Rainwater modeling

I have spent a fair amount of time thinking about water. My family and I live in Adelaide, where water is hard to come by. The energy intensity of Adelaide's water supply is about 0.8 - 1.8 kWh/kL [1], while Adelaide's new desalination plant is expected to use 3.5 - 5.5 kWh/kL [2]. Since the average household uses about 191 kL per year, the energy associated with this is between 152 - 1000 kWh per year -- a significant amount of energy.

Thus, we want to use less of this expensive resource, and more of the free water that falls on our roof and currently goes to waste. For this reason, we have installed 35 kL of rainwater tanks, and they are currently collecting from the back part of the house, the pergola and the garage. As part of my plans for developing this system, I wanted to understand what sort of performance we can expect from this system when it is completed. In other words, do we have enough storage so that we will not run out of water?

How much storage is required?

I went to the Australian Bureau of Meteorology and downloaded the all the monthly rainfall data for Adelaide [3]. Then I built a simple model that estimates our monthly average consumption. The model pretends that we installed our tanks when rainfall records began (in 1884). I then ran the model through all the data to see how our system would have performed historically. Thus we have 129 years' real data to test things.

Assumptions:

We collect water from every roof on our land (not currently the case, but a near- to medium-term goal)
We collect 85% of the water that falls on these surfaces (the rest splashes, evaporates, etc)
There are many assumptions about how much we use, but it seems to be approximately similar to the bills we get from our utility (although, our usage in summer may be higher than this model suggests).

Results

With our system as is (when the above assumptions are met), we would have run out of water in 100 out of 129 years.
If we capture grey water from the kitchen sink and washing machine, and use that instead of potable water for watering the garden, then we would have run out of water in 41 our of 129 years.
If in addition to the grey water, we move to a dry toilet (eg. composting) then we would have run out of water in three years.
If we increase our water storage to 45 kL, but retain the flushing toilet, we would have run out of water in 17 out of 129 years. All my modelling was performed in a simple spreadsheet that can be downloaded here [4] (the model was created in a spreadsheet in LibreOffice [5] -- a free office program that you may install for no cost if you want) -- you can play with the numbers yourself. Please let me know in comments or by email if you find any problems.

Modeled rainwater captured, in storage and consumption on a month-by-month basis for an average year (rainfall averaged monthly since records began in 1884)

Conclusions

If the goal is not to run out of water, it is much more effective to reduce consumption than to increase storage. Also, capturing water from a larger roof area is also much more effective than increasing storage. Grey water is very powerful here, as we will be able to use water twice: once in the home (washing, etc) and then again in the garden. This is useful provided the use of grey water in the garden offsets water we would otherwise use in the garden.

Other considerations

I have also been reading about the energy cost of pumping domestic rainwater. CSIRO have published a paper on this [6], which says that having a pump that directly pressurises the pipes is a very energy expensive way to run things and results in energy consumption that is close to desalination.
I am hoping, as much as possible, to run the garden on gravity-fed water
wherever possible to mitigate this. Also, instead of having a system where a pump directly pressurises our existing water pipes in the house, I'm hoping to create a header tank up high, and then gravity feed the house [EDIT: I have decided not to do this, and have instead installed a variable speed pump. I will discuss this in a future post]. The problem is that, because we live in an urban setting, I'm limited by Council regulations how elevated such a tank can be. I will most more on this in another document.

References

[1] Climate Change and Water: International Perspectives on Mitigation... edited by Carol Howe, Joel B. Smith, MS. Jim Henderson

[2] http://www.sawater.com.au/NR/rdonlyres/2AF55919-F858-4AB2-93E3-534E62E6DC73/0/DesalEISChapter6.pdf
[3] http://www.bom.gov.au/climate/data/
[4] https://www.dropbox.com/s/od108as4l3v4ogu/water%20calculations.ods
[5] www.libreoffice.org
[6] https://publications.csiro.au/rpr/download?pid=csiro:EP114797&dsid=DS4

This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au

Wednesday, September 18, 2013

knitR and LaTeX for bioinformatics

knitR is a great way of dynamically generating reports, and ensuring that they're always based on up-to-date data. Having the reporting closely tied to the data and code that generated it is one of the tenets of reproducible research. In bioinformatics, one often finds oneself repeatedly generating figures, graphs, etc, for multiple labels (eg. probes, genes, etc). knitR is a great fit for this.

I had some difficulty finding the best way of iterating and generating multiple reports, and the code below is the best way I have found so far.

The following example code iterates over each label in a data.frame, and generates a separate .pdf for each. In a real scenario, each label could be a gene, and we could pull out whatever data/tables/figures we wanted.

Note that you may need to install some R libraries for this code to work:

ggplot2
xtable
knitr

Below is example code to do this. A control R-script. Adjust the path in the knit2pdf command, put this in a file, make it executable, and run it: The actual knitR file is below. This is essentially a LaTeX file with blobs (called "chunks" in knitR) of R in it that are evaluated and their output included in the final document, as desired. I saw the xtable technique on this blog.

Wednesday, August 28, 2013

Blast, updated

I have found a way of parallelising blast across a fasta file and multiple databases at the same time. I often wish to blast a .fasta file of sequences against a heap of databases, so this represents quite a speed-increase.

DBs= "/path/to/TriFLDB \
/path/to/harvardTC_Ta12"
in_fd=/path/to/input_file.fasta
out_fmt='\"6\"'

parallel --gnu "cat ${in_fd} | parallel --gnu --nice 19 --block 100k --recstart '>' --pipe ${BLAST_PATH}/${blast_type} -outfmt "\""${out_fmt}"\"" -query - -db {1} | awk '{n=split(\"{1}\",prnt, \"/\"); print \$0,prnt[n] }'" ::: ${DBs}

I have included an awk in the pipe within parallel that appends a column with the name of the database. You can remove that if you wish, but you will then not know which database a given hit came from.

This code works by using parallel to call an instance of parallel for each database to be analysed. The sub-parallel then parallelises over the fasta file, as per my previous post on this subject.

Thursday, June 27, 2013

I have been working with blast for aligning sequences. This is a fairly computationally intensive exercise that is very worth parallelising. GNU parallel is a great tool for this, but I found it rather unintuitive to use. I also didn't realise that my first implementation was not actually parallelising blast, but it was still using only one processor core (traps for the unwary!). I found an implementation that worked, though, and it's below:

cat $in_fd | parallel --gnu --max-procs ${max_cores} --block 100k --recstart '>' --pipe blastx -evalue 0.01 -outfmt 6 -db ${blast_DB_loc} -query - > output_file.dat

# ${in_fd} is a fasta-formatted input file
# ${blast_DB_loc} is the database to blast against
# ${max_cores} is the number of cores one wishes to use

Please note that parellisation like this is only appropriate when the order does not matter and the analysis of one segment does not rely on the output of the analysis of another segment.

Tuesday, October 2, 2012

fastplot

fastplot is a program for the rapid and highly-efficient creation of QQ and Manhattan plots of GWAS results. Because of its extremely low memory (RAM) requirements, it is suitable for visualising large GWAS data sets, such as those imputed using the 1,000 Genome Project reference sets, in approximately 10 minutes, while requiring less than 50 MB RAM for the calculation stages and approximately 800 MB RAM for the final figure rendering. This was done using the -macImputed option, which is the fastest way of processing 1000G data, if they are formatted correctly (ie. they're the output of minimac). If your data are formatted differently, contact me and I will add an optimised switch to suit your data.

fastplot creates both QQ and Manhattan plots from GWAS results. The data are all manipulated on disk on-the-fly: only a small subset of the data are RAM-resident at any time. Because of this, the memory requirements for calculation are very low (below 50 MB). The program requires access to the freely-available GNU toolset (bash, sort, awk, sed) and GNUplot (version 4.4 or higher allows better plots, but fastplot will work with version 4.0 as well). Rendering the figures generally requires more memory (13 million results requires approximately 800 MB), although this is still very modest and easily allocable on most computers produced in recent years. It is also fast: QQ and Manhattan plots are both created, and an estimate for lambda returned, within approximately 10 minutes.

Please note that fastplot is beta software. There may be bugs so I strongly suggest that for your first few uses you backup your data first!

Use:

The program requires that the user pass it two arguments:

-in=input_file.dat -refdat=reference_file.bim

input_file contains columns of results (see table 1) and reference_file position information in the standard plink bim file format (see table 2)

If a user's data differ from these formats, the operation of fastplot can be adjusted using the following switches

-in=filename (specify a filename of input data)

-cwd (work in the current directory)

-workdir=dir (specify a directory in which to work)

-pvalueCol=num (specify the column number of pvals in the datafile [default=11])

-SNPCol=n (specify the column number of SNP names in the datafile [default=2])

-refdat=path (specify the location of reference data)

-refChromoCol=n (specify the chromasome column in the reference datafile [default=1])

-refSNPCol=n (specify the SNP column in the reference datafile [default=2])

-refBPCol=n (specify the base-pair column in the reference datafile [default=4])

-nolabels (specify that the datafile has no column labels)

-macImputed (specify that the input data are produced by the minimac imputation program. A separate reference data file is not needed, but can be supplied for faster execution)

To use the program, please cut and paste the text (please ignore the "</plot></plot></http:>" at the end -- blogger insists on putting that there), below, into a text editor (vi, notepad, emacs, gedit, etc) and save it as fastplot somewhere in your path. You will need to make the file executable before you try and run it (eg. chmod +x /path/to/fastplot)

#!/bin/bash

# Angus Wallace, 2012
# wallace.angus@gmail.com

#   This program is free software: you can redistribute it and/or modify   #
#   it under the terms of the GNU General Public License as published by   #
#   the Free Software Foundation, either version 3 of the License, or      #
#   (at your option) any later version.                                    #
#                                                                          #
#   This program is distributed in the hope that it will be useful,        #
#   but WITHOUT ANY WARRANTY; without even the implied warranty of         #
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the          #
#   GNU General Public License for more details.                           #
#                                                                          #
#   You should have received a copy of the GNU General Public License      #
#   along with this program.  If not, see .  #

# version history
# v0.1 Angus Wallace Aug 2012
# v1.0 Angus Wallace Sep 2012


# a program to create Manhattan and QQ plots for huge GWAS data sets

### ############## ###
### check software ###
### ############## ###
if [ `which gnuplot | wc -l` -eq 0 ]; then echo "ERROR: gnuplot not found. Please install."; exit 1; fi
if [ `gnuplot -V | grep -o -E "[0-9]\.[0-9]" | awk '{ if($1 < 4.4) print 1}' | wc -m` -eq 1 ]; then #check gnuplot version number
   echo "ERROR: please install gnuplot version 4.4 or newer."; exit 1; fi
if [ `which awk | wc -l` -eq 0 ]; then echo "ERROR: awk not found. Please install."; exit 1; fi
if [ `which sort | wc -l` -eq 0 ]; then echo "ERROR: sort not found. Please install."; exit 1; fi
if [ `which sed | wc -l` -eq 0 ]; then echo "ERROR: sed not found. Please install."; exit 1; fi

### ################# ###
### process arguments ###
### ################# ###

if [ $# -eq 0 ]; then echo -e "\nCreate QQ and Manhattan plots for sanity-checking GWAS results. Written by Angus Wallace (wallace.angus@gmail.com), \n Use (switches are case-sensitive):\n  -in=filename    (specify a filename of input data)\n  -cwd            (work in the current directory)\n  -workdir=dir    (specify a directory in which to work)\n  -pvalueCol=num  (specify the column number of pvals in the datafile [default=11])\n  -SNPCol=n       (specify the column number of SNP names in the datafile [default=2])\n  -refdat=path    (specify the location of reference data)\n  -refChromoCol=n (specify the chromasome column in the reference datafile [default=1])\n  -refSNPCol=n    (specify the SNP column in the reference datafile [default=2])\n  -refBPCol=n     (specify the base-pair column in the reference datafile [default=4])\n  -nolabels       (specify that the datafile has no column labels)\n  -macImputed     (specify that the input data are produced by the minimac imputation program.  A separate reference data file is not needed, but can be supplied for faster execution)\n\nIf the input data filename has a full path, and neither -cwd or -workdir are used, then the path to the input file will be the working directory. If no paths are specified the current directory will be used.\n\nExamples:\nRun fastplot on the output of Merlin for a particular trait, using the standard reference data set:\n  fastplot -in=trait_data.out -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim\nRun fastplot on the output of minimac imputed 1000Genome data:\n  fastplot -in=subcort.dat -macImputed\nRun fastplot on some data with columns arranged differently to the defaults:\n  fastplot -in=trait_data.out -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim -pvalueCol=3 -SNPCol=2 -nolabels refChromoCol=3\n\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program.  If not, see http://www.gnu.org/licenses.\n"; exit 1; fi

col_lab=1 #assume the data file has labels unless told otherwise
imputed=0 #assume the data are not imputed unless told otherwise
subsample=0
sort_mem=50M
for arg in "$@" ; do
   case $arg in
      -in=*) # input file name
         in_dat=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -cwd) #work in the current directory
         workdir=`pwd`
         ;;
      -workdir=*) #specify a directory in which to work
         workdir=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -pvalueCol=*) #specify which column has the pvalues
         pvalueCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -SNPCol=*) #specify which column has the SNP names
         SNPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refdat=*) #specify reference data
         ref_dat=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refChromoCol=*) #specify reference data chromasome col
         refChromoCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refSNPCol=*) #specify reference data SNP col
         refSNPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -refBPCol=*) #specify reference data's base-pair col
         refBPCol=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
      -nolabels) #line one has column labels
         col_lab=0
         ;;
      -macImputed) #line one has column labels
         imputed=1
         ;;
      -sortmem=*) # specify how much RAM is allocated to sort commands
         sort_mem=`echo $arg | sed 's/[-a-zA-Z0-9]*=//'`
         ;;
#      -subsample) #line one has column labels
#         subsample=1
#         ;;
   esac
done  

# if there's no path in the filename, then use the cwd
if [ `echo ${workdir} | wc -m` -eq 1 ]; then 
   workdir=`echo ${in_dat} | grep -E -o ".*/"`
   if [ `echo ${workdir} | wc -m` -eq 1 ]; then 
      workdir=`pwd`; fi; fi
cd $workdir

if [ `echo ${in_dat} | wc -m` -eq 1 ]; then 
   echo "ERROR: no input file specified."
   exit 1
elif [ `ls ${workdir}/${in_dat} | wc -m` -eq 0 ]; then 
   echo "ERROR: input file not found."
   exit 1
fi
 


### ######## ###
### defaults ###
### ######## ###

if [ `echo ${pvalueCol} | wc -m` -eq 1 ]; then pvalueCol=11; fi
if [ `echo ${SNPCol} | wc -m` -eq 1 ]; then SNPCol=2; fi
if [ `echo ${refChromoCol} | wc -m` -eq 1 ]; then refChromoCol=1; fi 
if [ `echo ${refSNPCol} | wc -m` -eq 1 ]; then refSNPCol=2; fi
if [ `echo ${refBPCol} | wc -m` -eq 1 ]; then refBPCol=4; fi

if [ `echo ${ref_dat} | wc -m` -eq 1 ]; then 
   if [ $imputed -eq 0 ]; then
      echo "ERROR: no reference file specified. try -refdat=/scratch/GWAS/GeneralRelease/Release6/GWAS.bim"
      exit 1; 
   else #then the data were inputed with minimac
      # create reference data from input data SNP names
      awk " \$${pvalueCol} != \"-\" {sub(\".*:\",\"\",\$2); print \$1, \$1\":\"\$2, \$2}" $in_dat | sort  -S ${sort_mem} -g > fastplot_refdata.dat
# possible way of specifying different columns for imputed data.
#      awk " \$${pvalueCol} != \"-\" {\$0= \$${SNPCol}\" \"\$${SNPCol}\" \"\$${SNPCol}; sub(\":.*\",\"\",\$1);sub(\".*:\",\"\",\$3); print \$0}" $in_dat | sort  -S ${sort_mem} -g > fastplot_refdata.dat
      ref_dat=${workdir}/fastplot_refdata.dat
      refBPCol=3
   fi
fi

### ####################################### ###
### create the Manhattan data and plot them ###
### ####################################### ###

# from ref_dat, create a list of cumulative SNP counts across chromasomes
rm -f fastplot_chr_len.dat
for i in `cat ${ref_dat} | awk '{print $1}' | sort -S ${sort_mem} | uniq | sort -g | uniq`; do
   awk "BEGIN {max = 0} {if (\$1==${i} && \$${refBPCol}>max) max=\$${refBPCol}} END {print max}" ${ref_dat} >> fastplot_chr_len.dat
done

echo "1 0" > abs_BP.dat
awk 'BEGIN{sum=0}{sum += $1; print (NR+1), sum}' fastplot_chr_len.dat | head -n 21 >> abs_BP.dat

chr_len=abs_BP.dat

if [ $imputed -eq 0 ]; then #are we dealing with the imputed output of mac?
   # prune input data to pvals column, remove first line, sort
   if [ $col_lab -eq 1 ]; then
      awk "NR > 1 && \$${pvalueCol} != \"-\" {print \$${SNPCol}, \$${pvalueCol} } " ${in_dat} | sort  -S ${sort_mem} > ${workdir}/manh_in_sorted.dat
   else
      awk "\$${pvalueCol} != \"-\" {print \$${SNPCol}, \$${pvalueCol} } " ${in_dat} | sort  -S ${sort_mem} > ${workdir}/manh_in_sorted.dat
   fi
   # get the base-pairs of the SNPs, translate to absolute BPs (not relative to chr)
   awk "{print \$${refChromoCol}, \$${refSNPCol}, \$${refBPCol} }" ${ref_dat} > ${workdir}/ref_dat.dat
   join ${workdir}/ref_dat.dat $chr_len | awk {'print $2, ($3 + $4), ($1 % 2)+1'} | sort  -S ${sort_mem} > ${workdir}/step_1_all.dat

   #merge the two files crop to pval/BP and take the -log10
   join ${workdir}/manh_in_sorted.dat ${workdir}/step_1_all.dat | awk {'print $3, -log($2) / log(10), $4'} > ${workdir}/all_dat.dat
   ### clean up ###
   rm ${workdir}/step_1_all.dat ${workdir}/manh_in_sorted.dat 
else 
   # then we're using minimac (13M) data. Do as much as possible in one line for speed.
   awk "NR==FNR{a[\$1]=\$2;next;}{  {sub(\$1,a[\$1]\" \"\$1) sub(\".*:\",\"\",\$3)}; print \$1+\$3, (-log(\$12 )/2.302585), (\$2 % 2)+1 }" abs_BP.dat ${in_dat} > all_dat.dat
#   fi
fi

# create the plot
if [ `gnuplot -V | grep -o -E "[0-9]\.[0-9]" | awk '{ if($1 < 4.4) print 1}' | wc -m` -ge 1 ]; then
   echo -e "\nWARNING: gnuplot 4.4 or newer is needed for prettier Manhattan plot output, colour-coded by chromosome\n"
gnuplot -persist < ${workdir}/qq_in_sorted.dat

# how many data are there?
dat_len=`cat ${workdir}/qq_in_sorted.dat | wc -l`

# generate chi_squared data of the same length
seq ${dat_len} -1 1 | awk {'print -log($1/'$dat_len')/ 2.302585'}  > ${workdir}/chi2_temp.dat # ${workdir}/ints.dat

# combine into the data to be plotted
paste ${workdir}/chi2_temp.dat ${workdir}/qq_in_sorted.dat > ${workdir}/chi2_plot_data.dat
paste ${workdir}/chi2_temp.dat ${workdir}/chi2_temp.dat > ${workdir}/x_eq_y.dat

# create the QQ plot
gnuplot -persist < ${workdir}/lambda_val.txt

echo "QQ lambda estimate: " `cat ${workdir}/lambda_val.txt`


### clean up ###
rm chi2_plot_data.dat x_eq_y.dat chi2_temp.dat qq_in_sorted.dat

This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au

Thursday, February 23, 2012

Political ecology in Australia (Labor's woes)

In the book Dune, by Frank Herbert, one of the main protagonists observes that the biggest threat to most creatures is one of their own kind. This is for the reason that they are in direct competition and that "they share the same basic requirements" -- they eat the same food, seek the same shelter, compete for the same mates, etc.
I thought of this quote recently, watching the antics of the Australian Labor party. Their ideological enemy, the Liberal party, are sitting on amusedly watching Labor tear itself apart. Why are members of Labor so busy fighting each other when their foe sits clearly across the chamber?
I think the reason is highlighted in the above quote. Labor has its niche, and the Liberal party has its niche -- members of each party tend not to change allegiance, and so remain in their party's "territory" in the political "landscape" as though they were fenced within it. Because of this, individuals within each party are effectively competing with others in the same party for positions of power and influence within that party's territory -- in exactly the same way as animals in an ecosystem compete for food and procreation. The fact that they are trashing their "ecosystem" to compete for the resources it contains has an unhappy similarity to humans fighting over resources and reminds me of Solomon's edict to "cut it in two."

This post was written by Angus Wallace and first appeared at guesstimatedapproximations.blogspot.com.au