Skip to main content
Statistics LibreTexts

B.3: Common pitfalls in R scripting

  • Page ID
    3605
  • Patient: Doc, it hurts when I do this.

    Doctor: Don’t do that.

    To those readers who want to dig deeper, this section continues to explain why R scripts do not sometimes work, and how to solve these problems.

    Advices

    Use the Source, Luke!..

    The most effective way to know what is going on is to look on the source of R function of interest.

    Simplest way to access source is to type function name without paretheses. If the function is buried deeper, then try to use methods() and getAnywhere().

    In some cases, functions are actually not R code, but C or even Fortran. Download R source, open it and find out. This last method (download source) works well for simpler cases too.

    Keep it simple

    Try not to use any external packages, any complicated plots, any custom functions and even some basic functions (like subset()) without absolute need. This increases reproducibility and makes your life easier.

    Analogously, it is better to avoid running R through any external system. Even macOS R shell can bring problems (remember history issues?). RStudio is a great piece of software but it is prone to the same problem.

    Learn to love errors and warnings

    They help! If the code issues error or warning, it is a symptom of something wrong. Much worse is when the code does not issue anything but produce unreliable results.

    However, warnings sometimes are really boring, especially if you know what is going on and why do you have them. On macOS it is even worse because they colored in red... So use suppressWarnings() function, but again, only when you know what you are doing. You can think of it as of headache pills: useful but potentially dangerous.

    Subselect by names, not numbers

    Selecting columns by numbers (like trees[, 2:3]) is convenient but dangerous if you changed your object from the original one. It is always better to use longer approach and select by names, like

    Code \(\PageIndex{1}\) (Python):

    trees[, c("Height", "Volume")]
    

    When you select by name, be aware of two things. First, selection by one name will return NULL and can make new column if aything assigned on the right side. This works only for [[ and $:

    Code \(\PageIndex{2}\) (Python):

    trees[, "aaa"]
    

    Code \(\PageIndex{3}\) (Python):

    trees[["aaa"]]
    trees$aaa
    

    (See also “A Case of Identity” below.)

    Second, negative selection works only with numbers:

    Code \(\PageIndex{4}\) (Python):

    trees[, -c("Height", "Volume")]
    

    Code \(\PageIndex{5}\) (Python):

    trees[, -which(names(trees) %in% c("Height", "Volume"))]
    

    About reserved words, again

    Try to avoid name your objects with reserved words (?Reserved). Be especially careful with T, F, and return. If you assign them to any other object, consequences could be unpredictable. This is, by the way another good reason to write TRUE instead of T and FALSE instead of F (you cannot assign anything to TRUE and FALSE).

    It is also a really bad idea to assign anything to .Last.value. However, using the default .Last.value (it is not a function, see ?.Last.value) could be a fruitful idea.

    If you modified internal data and want to restore it, use something like data(trees).

    The Case-book of Advanced R user

    The Adventure of the Factor String

    By default, R converts textual string into factors. It is usefult to make contrasts but bring problems into many other applications.

    To avoid this behavior in read.table(), use as.is=TRUE option, and in data frame operations, use stringsAsFactors=FALSE (or the same name global option). Also, always control mode of your objects with str().

    A Case of Were-objects

    When R object undergoes some automatic changes, sooner or later you will see that it changes the type, mode or structure and therefore escapes from your control. Typically, it happens when you make an object smaller:

    Code \(\PageIndex{6}\) (Python):

    mode(trees)
    trees2 <- trees[, 2:3]
    mode(trees2)
    trees1 <- trees2[, 2]
    mode(trees1)
    

    Data frames and matrices normally drop dimensions after reduction. To prevent this, use [, , drop=FALSE] argument. There is even function mat.or.vec(), please check how it works.

    Factors, on other hand, do not drop levels after reductions. To prevent, use [, drop= TRUE].

    Empty zombie objects appear when you apply malformed selection condition:

    Code \(\PageIndex{7}\) (Python):

    trees.new <- trees[trees[, 1] < 0, ]
    str(trees.new)
    

    To avoid such situations (there are more pitfalls of this kind), try to use str() (or Str() from asmisc.r) every time you create new object.

    A Case of Missing Compare

    If missing data are present, comparisons should be thought carefully:

    Code \(\PageIndex{8}\) (Python):

    aa <- c(1, NA, 3)
    aa[aa != 1] # bad idea
    aa[aa != 1 & !is.na(aa)] # good idea
    

    A Case of Outlaw Parameters

    Consider the following:

    Code \(\PageIndex{9}\) (Python):

    mean(trees[, 1])
    mean(trees[, 1], .2)
    mean(trees[, 1], t=.2)
    mean(trees[, 1], tr=.2)
    mean(trees[, 1], tri=.2)
    mean(trees[, 1], trim=.2)
    mean(trees[, 1], trimm=.2) # why?!
    mean(trees[, 1], anyweirdoption=1) # what?!
    

    Problem is that R frequently ignores illegal parameters. In some cases, this makes debugging difficult.

    However, not all functions are equal:

    Code \(\PageIndex{10}\) (Python):

    IQR(trees[, 1])
    IQR(trees[, 1], t=8)
    IQR(trees[, 1], type=8)
    IQR(trees[, 1], types=8)
    
    

    Code \(\PageIndex{11}\) (Python):

    IQR(trees[, 1], anyweirdoption=1)
    

    And some functions are even more weird:

    Code \(\PageIndex{12}\) (Python):

    bb <- boxplot(1:20, plot=FALSE)
    bxp(bb, horiz=T) # plots OK
    boxplot(1:20, horiz=T) # does not plot horizontally!
    boxplot(1:20, horizontal=T) # this is what you need
    

    The general reason of all these different behaviors is that functions above are internally different. The first case is especially harmful because R does not react on your misprints. Be careful.

    A Case of Identity

    Similar by consequences is an example when something was selected from list but the name was mistyped:

    Code \(\PageIndex{13}\) (Python):

    prop.test(3, 23)
    pval <- prop.test(3, 23)$pvalue
    pval
    pval <- prop.test(3, 23)$p.value # correct identity!
    pval
    

    This is not a bug but a feature of lists and data frames. For example, it will allow to grow them seamlessly. However, mistypes do not raise any errors and therefore this might be a problem when you debug.

    The Adventure of the Floating Point

    This is well known to all computer scientists but could be new to unexperienced users:

    Code \(\PageIndex{14}\) (Python):

    aa <- sqrt(2)
    aa * aa == 2
    aa * aa - 2
    

    What is going on? Elementary, my dear reader. Computers work only with 0 and 1 and do not know about floating points numbers.

    Instead of exact comparison, use “near exact” all.equal() which is aware of this situation:

    Code \(\PageIndex{15}\) (Python):

    all.equal(aa * aa, 2)
    all.equal(aa * aa - 2, 0)
    

    A Case of Twin Files

    Do this small exercise, preferably on two computers, one under Windows and another under Linux:

    Code \(\PageIndex{16}\) (Python):

    pdf("Ex.pdf")
    plot(1)
    dev.off()
    pdf("ex.pdf")
    plot(1:3)
    dev.off()
    

    On Linux, there are two files with proper numbers of dots in each, but on Windows, there is only one file named Ex.pdf but with three dots! This is even worse on macOS, because typical installation behaves like Windows but there are other variants too.

    Do not use uppercase in file names. And do not use any other symbols (including spaces) except lowercase ASCII letters, underscore, 0–9 numbers, and dot for extension. This will help to make your work portable.

    A Case of Bad Grammar

    The style of your scripts could be the matter of taste, but not always. Consider the following:

    Code \(\PageIndex{17}\) (Python):

    aa<-3
    

    This could be interpreted as either

    Code \(\PageIndex{18}\) (Python):

    aa <- 3
    

    or

    Code \(\PageIndex{19}\) (Python):

    aa < -3
    

    Always keep spaces around assignments. Spaces after commas are not so important but they will help to read your script.

    A Case of Double Dipping

    Double comparisons do not work! Use logical concatenation instead.

    Code \(\PageIndex{20}\) (Python):

    aa <- 3
    0 < aa < 10
    

    Code \(\PageIndex{21}\) (Python):

    aa <- 3
    aa > 0 & aa < 10
    

    A Case of Factor Join

    There is no c() for factors in R, result will be not a factor but numerical codes. This is concerted with a nature of factors.

    However, if you really want to concatenate factors and return result as a factor, ?c help page recommends:

    Code \(\PageIndex{22}\) (Python):

    c(factor(LETTERS[1:3]), factor(letters[1:3]))
    c.factor <- function(..., recursive=TRUE) unlist(list(...), recursive=recursive)
    c(factor(LETTERS[1:3]), factor(letters[1:3]))
    

    A Case of Bad Font

    Here is a particularly nasty error:

    Code \(\PageIndex{23}\) (Python):

    ll <- seq(0, 1, 1ength=10)
    

    Unfortunately, well-known problem. It is always better to use good, visually discernible monospaced font. Avoid also lowercase “l”, just in case. Use “j” instead, it is much easier to spot.

    By the way, error message shows the problem because it stops printing exactly where is something wrong.