Tuesday, January 24, 2012

Winsorisation in R

I wrote a function to hard-winsorise each column of a data.frame to 3 standard deviations. I couldn't find anything that did this neatly.
It doesn't do any checking of the data, you need to do that yourself. This is licensed under GPL3 or later. Please link back here if cross-posting it elsewhere.

winsor_clip_data <- function (x, std = 3, na.rm = TRUE)
{
clip_vec <- function(dat, min, max){
# hard clip dat to the rangs max, min
dat[dat > max] <- max
dat[dat < min] <- min
return(dat)
}

sds<-as.matrix(apply(x, 2, sd, na.rm=TRUE))
means<-as.matrix(apply(x, 2, mean, na.rm=TRUE))
mins<-means-3*sds
maxs<- means+3*sds
output<- mapply(clip_vec, x, mins, maxs)

return(output)
}

Summary stats in R

I've been looking for an easy way of creating a data.frame of summary statistics in R, and haven't been able to find anything. The summary() function seems to output a list, and it isn't easily malleable into a data.frame. This makes it hard to add other stats to the list, or to query it from other functions. I've written a simple function that uses boxplot() plus a few other bits to make a nice data.frame.

It doesn't do any checking of the data, you need to do that yourself. This is licensed under GPL3 or later. Please link back here if cross-posting it elsewhere.

summary_stats <- function(these_data, output_dir) {

num_NAs=as.data.frame(t(colSums(is.na(these_data))))
rownames(num_NAs)<-"NA count"

means<-as.data.frame(t(colMeans(these_data, na.rm=TRUE)))
rownames(means)<-"means"

num_dat=as.data.frame(t(rep(nrow(these_data),ncol(these_data))))
rownames(num_dat)<-"num data"
names(num_dat)<-names(these_data)

stats<-boxplot(these_data,plot=FALSE)
stats<-as.data.frame(stats$stats[1:5,])
names(stats)<-names(these_data)
rownames(stats)<-c("minimum (excl outliers)","lower quartile","median", "upper quartile", "maximum (excl outliers)")

output<-as.data.frame(rbind(
num_NAs,
num_dat,
means,
stats
))
return(output)
}
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.