Remove outliers

The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers.

Data

First, lets create a sample data set

set.seed(20130828)
rawData <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y= c(runif(1000), rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))

Here you can browse it interactively:

library(rCharts)
library(data.table)
## Add the index
d <- data.table(cbind("row" = 1:nrow(rawData), rawData))
t1 <- dTable(d, sPaginationType=  'full_numbers', iDisplayLength=10, sScrollX='100%')
t1$print("chart1", include_assets=TRUE, cdn=TRUE)

Notice for example that the first observation in variable X is a NA. Meaning that we will be dealing with original NAs and new NAs.

Find outliers

We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers.

findOutlier <- function(data, cutoff=3) {
    ## Calculate the sd
    sds <- apply(data, 2, sd, na.rm=TRUE)
    ## Identify the cells with value greater than cutoff * sd (column wise)
    result <- mapply(function(d, s) { 
        which(d > cutoff * s)
        },
        data, sds
    )
    result
}

outliers <- findOutlier(rawData)
outliers
## $X
## [1] 1003 1008 1010 1011 1013 1017 1018
## 
## $Y
##  [1] 1001 1002 1003 1004 1005 1006 1007 1008 1010 1011 1012 1013 1014 1015
## [15] 1017 1018 1020
## 
## $Z
##  [1]  14  43  92 104 107 136 151 158 211 215 223 274 332 367 400 427 454
## [18] 475 544 574 579 594 657 675 766 799 803 805 865 877 884 910 922 968
## [35] 978 981

Remove outliers

Next we can remove the ouliers.

removeOutlier <- function(data, outliers) {
    result <- mapply(function(d, o) {
        res <- d
        res[o] <- NA
        return(res)
    }, data, outliers)
    return(as.data.frame(result))
}

dataFilt <- removeOutlier(rawData, outliers)

Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, in page 101 (when showing 10 entries per page) you can see entries 1,001 to 1,010.

## Add the index
d2 <- data.table(cbind("row" = 1:nrow(dataFilt), dataFilt))
t2 <- dTable(d2, sPaginationType=  'full_numbers', iDisplayLength=10, sScrollX='100%')
t2$print("chart2")

Iterate

If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers.

outliers2 <- findOutlier(dataFilt)
outliers2
## $X
## integer(0)
## 
## $Y
##   [1]   2   6  20  26  27  28  37  45  46  51  53  54  55  56  59  62  68
##  [18]  75  89  97 108 129 131 142 147 150 162 168 175 187 201 202 203 207
##  [35] 222 223 237 245 248 250 255 261 272 277 287 306 307 313 332 346 351
##  [52] 353 354 372 375 381 388 409 412 424 443 446 459 477 489 496 499 505
##  [69] 514 526 545 552 562 566 567 568 591 597 600 601 603 604 626 631 646
##  [86] 658 681 683 689 720 729 732 739 755 762 771 772 780 783 792 804 817
## [103] 825 837 854 879 887 894 913 918 922 931 938 953 955 959 973 978 981
## [120] 984 988 990
## 
## $Z
##  [1]   4 297 328 334 368 392 436 614 632 682 809 888 914 941
dataFilt2 <- removeOutlier(dataFilt, outliers2)

Here is the result after two iterations.

## Add the index
d3 <- data.table(cbind("row" = 1:nrow(dataFilt2), dataFilt2))
t3 <- dTable(d3, sPaginationType=  'full_numbers', iDisplayLength=10, sScrollX='100%')
t3$print("chart3")

Reproducibility

Sys.time()
## [1] "2013-08-30 15:18:39 EDT"
proc.time()
##    user  system elapsed 
##   4.327   0.074   4.403
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.8.8     rCharts_0.3.51       knitrBootstrap_0.8.0
## [4] markdown_0.6.3       knitr_1.4.1         
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.3    evaluate_0.4.7  formatR_0.9     grid_3.0.1     
##  [5] lattice_0.20-15 plyr_1.8        RJSONIO_1.0-3   stringr_0.6.2  
##  [9] tools_3.0.1     whisker_0.3-2   yaml_2.1.7

This report written by L. Collado Torres and was generated using knitrBootstrap.

Showing multiple tables was implemented and fixed by Ramnath Vaidyanathan in this issue. Thank you!