Skip to main content
Statistics LibreTexts

4.5: How to create your own functions

  • Page ID
    3565
  • Shapiro-Wilk test is probably the fastest way to check normality but its output is not immediately understandable. It is also not easy to apply for whole data frames. Let us create the function which overcomes these problems:

    Code \(\PageIndex{1}\) (Python):

    Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")}
    

    (We used here the fact that in R, test output is usually a list and each component is possible to extract using $-name approach described in previous chapter. How to know what to extract? Save test output into object and run str(obj)!)

    Collection asmisc.r contains slightly more advanced version of the Normality() which takes into account that Shapiro-Wilks test is not so reliable for small size (\(< 25\)) samples.

    To make this Normality() function work, you need to copy the above text into R console, or into the separate file (preferably with *.r extension), and then load it with source() command. Next step is to call the function:

    Code \(\PageIndex{2}\) (Python):

    salary <- c(21, 19, 27, 11, 102, 25, 21)
    Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")}
    Normality(salary) # asmisc.r
    sapply(trees, Normality)
    sapply(log(trees+0.01), Normality)
    

    (Note that logarithmic conversion could change the normality. Check yourself if square root does the same.)

    This function not only runs Shapiro-Wilks test several times but also outputs an easily readable result. Most important is the third row which uses p-value extracted from the test results. Extraction procedure is based on the knowledge of the internal structure of shapiro.test() output.

    How to know the structure of shapiro.test() output object without going into help?

    In many cases, “stationary”, named function is not necessary as user need some piece of code which runs only once (but runs in relatively complicated way). Here helps the anonymous function. It is especially useful within functions of apply() family. This is how to calculate mode simultaneously in multiple columns:

    Code \(\PageIndex{3}\) (Python):

    x <- c(174, 162, 188, 192, 165, 168, 172.5)
    sapply(chickwts,function(.x) names(table(.x))[which.max(table(.x))])
    

    (Here we followed the agreement that in the anonymous functions, argument names must start with a dot.)

    Even more useful—simultaneous confidence intervals:

    Code \(\PageIndex{4}\) (Python):

    old.options <- options(warn=-1)
    x <- c(174, 162, 188, 192, 165, 168, 172.5)
    sapply(trees, function(.x) wilcox.test(.x, conf.int=TRUE)$conf.int)
    options(old.options)
    

    (Here we suppressed multiple “ties” warnings. Do not do it yourself without a strong reason!)

    File betula.txt in the open data repository contains measurements of several birch morphological characters. Are there any characters which could be analyzed with parametric methods?

    Please make the user function CV() which calculates coefficient of variation (CV) and apply it to betula data. Which character is most variable? (File betula_c.txt contains explanation of variables.)

    In the first chapter, we used dact.txt data to illustrate situation when it is really hard to say anything about data without statistical analysis. Now, time came to make this analysis. Provide as many relevant description characteristics as possible, calculate the appropriate confidence interval and plot this data.

    In the open repository, file nymphaeaceae.txt contains counts of flower parts taken from two members of water lily family (Nymphaeaceae), Nuphar lutea (yellow water lily) and Nymphaea candida (white water lily). Using plots and, more importantly, confidence intervals, find which characters (except SEPALS) distinguish these species most. Provide the answer in the form “if the character is ..., then species is ..”..