Skip to main content
Statistics LibreTexts

1.2: Getting started in R

  • Page ID
    33205
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    You will need to download the statistical software package called R and an enhanced interface to R called RStudio (RStudio Team 2022). They are open source and free to download and use (and will always be that way). This means that the skills you learn now can follow you the rest of your life. R is becoming the primary language of statistics and is being adopted across academia, government, and businesses to help manage and learn from the growing volume of data being obtained. Hopefully you will get a sense of some of the power of R in this book.

    The next pages will walk you through the process of getting the software downloaded and provide you with an initial experience using RStudio to do things that should look familiar even though the interface will be a new experience. Do not expect to master R quickly – it takes years (sorry!) even if you know the statistical methods being used. We will try to keep all your interactions with R code in a similar code format and that should help you in learning how to use R as we move through various methods. We will also often provide you with example code. Everyone that learns R starts with copying other people’s code and then making changes for specific applications – so expect to go back to examples from the text and focus on learning how to modify that code to work for your particular data set. Only really experienced R users “know” functions without having to check other resources. After we complete this basic introduction, Chapter 2 begins doing more sophisticated things with R, allowing us to compare quantitative responses from two groups, make some graphical displays, do hypothesis testing and create confidence intervals in a couple of different ways.

    You will have two3 downloading activities to complete before you can do anything more than read this book4. First, you need to download R. It is the engine that will do all the computing for us, but you will only interact with it once. Go to http://cran.rstudio.com and click on the “Download R for…” button that corresponds to your operating system. On the next page, click on “base” and then it will take you to a screen to download the most current version of R that is compiled for your operating system, something like “Download R 4.2.1 for Windows”. Click on that link and then open the file you downloaded. You will need to select your preferred language (choose English so your instructor can help you), then hit “Next” until it starts to unpack and install the program (all the base settings will be fine). After you hit “Finish” you will not do anything further with R directly.

    Second, you need to download RStudio. It is an enhanced interface that will make interacting with R less frustrating and allow you to directly create reports that include the code and output. To download RStudio, go near the bottom of https://www.rstudio.com/products/rstudio/download/ and select the correct version under “Installers for Supported Platforms” for your operating system. Download and then install RStudio using the installer. From this point forward, you should only open RStudio; it provides your interface with R. Note that both R and RStudio are updated frequently (up to four times a year) and if you downloaded either more than a few months previously, you should download the up-to-date versions, especially if something you are trying to do is not working. Sometimes code will not work in older versions of R and sometimes old code won’t work in new versions of R.5

    Initial RStudio layout.
    Figure 1.2: Initial RStudio layout.

    To get started, we can complete some basic tasks in R using the RStudio interface. When you open RStudio, you will see a screen like Figure 1.2. The added annotation in this and the following screen-grabs is there to help you get initially oriented to the software interface. R is command-line software – meaning that in some way or another you have to create code and get it evaluated, either by entering and execute it at a command prompt or by using the RStudio interface to run the code that is stored in a file. RStudio makes the management and execution of that code more efficient than the basic version of R. In RStudio, the lower left panel is called the “console” window and is where you can type R code directly into R or where you will see the code you run and (most importantly!) where the results of your executed commands will show up. The most basic interaction with R is available once you get the cursor active at the command prompt “>” by clicking in that panel (look for a blinking vertical line). The upper left panel is for writing, saving, and running your R code either in .R script files or .Rmd (markdown) files, discussed below. Once you have code available in this window, the “Run” button will execute the code for the line that your cursor is on or for any text that you have highlighted with your mouse. The “data management” or environment panel is in the upper right, providing information on what data sets have been loaded. It also contains the “Import Dataset” button that provides the easiest way for you to read a data set into R so you can analyze it. The lower right panel contains information on the “Packages” (additional code we will download and install to add functionality to R) that are available and is where you will see plots that you make and requests for “Help” on specific functions.

    As a first interaction with R we can use it as a calculator. To do this, click near the command prompt (>) in the lower left “console” panel, type 3+4, and then hit enter. It should look like this:

    > 3 + 4
    [1] 7

    You can do more interesting calculations, like finding the mean of the numbers -3, 5, 7, and 8 by adding them up and dividing by 4:

    > (-3 + 5 + 7 + 8)/4
    [1] 4.25

    Note that the parentheses help R to figure out your desired order of operations. If you drop that grouping, you get a very different (and wrong!) result:

    > -3 + 5 + 7 + 8/4
    [1] 11

    We could estimate the standard deviation similarly using the formula you might remember from introductory statistics, but that will only work in very limited situations. To use the real power of R this semester, we need to work with data sets that store the observations for our subjects in variables. Basically, we need to store observations in named vectors (one dimensional arrays) that contain a list of the observations. To create a vector containing the four numbers and assign it to a variable named variable1, we need to create a vector using the concatenate function c which means “combine the items” that follow, if they are inside parentheses and have commas separating the values, as follows:

    > c(-3, 5, 7, 8)
    [1] -3 5 7 8

    To get this vector stored in a variable called variable1 we need to use the assignment operator, <- (read as “is defined to contain”) that assigns the information on the right into the variable that you are creating on the left.

    > variable1 <- c(-3, 5, 7, 8)

    In R, the assignment operator, <-, is created by typing a “less than” symbol < followed by a “minus” sign (-) without a space between them. If you ever want to see what numbers are residing in an object in R, just type its name and hit enter. You can see how that variable contains the same information that was initially generated by c(-3, 5, 7, 8) but is easier to access since we just need the text for the variable name representing that vector.

    > variable1
    [1] -3 5 7 8

    With the data stored in a variable, we can use functions such as mean and sd to find the mean and standard deviation of the observations contained in variable1:

    > mean(variable1)
    [1] 4.25
    > sd(variable1)
    [1] 4.99166

    When dealing with real data, we will often have information about more than one variable. We could enter all observations by hand for each variable but this is prone to error and onerous for all but the smallest data sets. If you are to ever utilize the power of statistics in the evolving data-centered world, data management has to be accomplished in a more sophisticated way. While you can manage data sets quite effectively in R, it is often easiest to start with your data set in something like Microsoft Excel or OpenOffice’s Calc. You want to make sure that observations are in the rows and the names of variables are in first row of the columns and that there is no “extra stuff” in the spreadsheet. If you have missing observations, they should be represented with blank cells. The file should be saved as a “.csv” file (stands for comma-separated values although Excel calls it “CSV (Comma Delimited)”), which basically strips off some of the junk that Excel adds to the necessary information in the file. Excel will tell you that this is a bad idea, but it actually creates a more stable archival format and one that R can use directly.6

    The following code to read in the data set relies on an R package called readr (Wickham, Hester, and Bryan 2022). Packages in R provide additional functions and data sets that are not available in the initial download of R or RStudio. To get access to the packages, first “install” (basically download) and then “load” the package. To install an R package, go to the Packages tab in the lower right panel of RStudio. Click on the Install button and then type in the name of the package in the box (here type in readr). RStudio will try to auto-complete the package name you are typing which should help you make sure you got it typed correctly. If you are working in a .Rmd file, a highlighted message may show up on the top of the file to suggest packages to install that are not present – look for this to help make sure you have the needed packages installed. This will be the first of many times that we will mention that R is case sensitive – in other words, Readr is different from readr in R syntax and this sort of thing applies to everything you do in R. You should only need to install each R package once on a given computer. If you ever see a message that R can’t find a package, make sure it appears in the list in the Packages tab. If it doesn’t, repeat the previous steps to install it.

    Important: R is case sensitive! Readr is not the same as readr!

    After installing the package, we need to load it to make it active in a given work session. Go to the command prompt and type (or copy and paste) library(readr) or require(readr):

    > library(readr)

    With a data set converted to a CSV file and readr installed and loaded, we need to read the data set into the active workspace. There are two ways to do this, either using the point-and-click GUI in RStudio (click the “Import Dataset” button in the upper right “Environment” panel as indicated in Figure 1.2) or modifying the read_csv function to find the file of interest. To practice this, you can download an Excel (.xls) file from http://www.math.montana.edu/courses/s217/documents/treadmill.xls that contains observations on 31 males that volunteered for a study on methods for measuring fitness (Westfall and Young 1993). In the spreadsheet, you will find a data set that starts and ends with the following information (only results for Subjects 1, 2, 30, and 31 shown here):

    Sub- ject Tread- MillOx TreadMill- MaxPulse RunTime RunPulse Rest Pulse BodyWeight Age
    1 60.05 186 8.63 170 48 81.87 38
    2 59.57 172 8.17 166 40 68.15 42
    30 39.2 172 12.88 168 44 91.63 54
    31 37.39 192 14.03 186 56 87.66 45

    The variables contain information on the subject number (Subject), subjects’ maximum treadmill oxygen consumption (TreadMillOx, in ml per kg per minute, also called maximum VO2) and maximum pulse rate (TreadMillMaxPulse, in beats per minute), time to run 1.5 miles (Run Time, in minutes), maximum pulse during 1.5 mile run (RunPulse, in beats per minute), resting pulse rate (RestPulse, beats per minute), Body Weight (BodyWeight, in kg), and Age (in years). Open the file in Excel or equivalent software and then save it as a .csv file in a location you can find on your computer. Then go to RStudio and click on File, then Import Dataset, then From Text (readr)…7 Click “Import” and find your file. R will store the data set as an object with the same name as the .csv file. You could use another name as well, but it is often easiest just to keep the data set name in R related to the original file name. You should see some text appear in the console (lower left panel) like in Figure 1.3. The text that is created will look something like the following – if you had stored the file in a drive labeled D:, it would be:

    treadmill <- read_csv("D:/treadmill.csv")

    What is put inside the " " will depend on the location and name of your saved .csv file. A version of the data set in what looks like a spreadsheet will appear in the upper left window due to the second line of code (View(treadmill)).

    RStudio with initial data set loaded.
    Figure 1.3: RStudio with initial data set loaded.

    Just directly typing (or using) a line of code like this is actually the other way that we can read in files. If you choose to use the text-only interface, then you need to tell R where to look in your computer to find the data file. read_csv is a function that takes a path as an argument. To use it, specify the path to your data file, put quotes around it, and put it as the input to read_csv(...). For some examples later in the book, you will be able to copy a command like this from the text and read data sets and other code directly from the website, assuming you are connected to the internet.

    To verify that you read the data set in correctly, it is always good to check its contents. We can view the first and last rows in the data set using the head and tail functions on the data set, which show the following results for the treadmill data. Note that you will sometimes need to resize the console window in RStudio to get all the columns to display in a single row which can be performed by dragging the gray bars that separate the panels.

    > head(treadmill)
    # A tibble: 6 x 8
      Subject TreadMillOx TreadMillMaxPulse RunTime RunPulse RestPulse BodyWeight   Age
        <int>       <dbl>             <int>   <dbl>    <int>     <int>      <dbl> <int>
    1       1       60.05               186    8.63      170        48      81.87    38
    2       2       59.57               172    8.17      166        40      68.15    42
    3       3       54.62               155    8.92      146        48      70.87    50
    4       4       54.30               168    8.65      156        45      85.84    44
    5       5       51.85               170   10.33      166        50      83.12    54
    6       6       50.55               155    9.93      148        49      59.08    57
    
    > tail(treadmill)
    # A tibble: 6 x 8
      Subject TreadMillOx TreadMillMaxPulse RunTime RunPulse RestPulse BodyWeight   Age
        <int>       <dbl>             <int>   <dbl>    <int>     <int>      <dbl> <int>
    1      26       44.61               182   11.37      178        62      89.47    44
    2      27       40.84               172   10.95      168        57      69.63    51
    3      28       39.44               176   13.08      174        63      81.42    44
    4      29       39.41               176   12.63      174        58      73.37    57
    5      30       39.20               172   12.88      168        44      91.63    54
    6      31       37.39               192   14.03      186        56      87.66    45

    When you load an installed package with library, you may see a warning message about versions of the package and versions of R – this is usually something you can ignore. Other warning messages could be more ominous for proceeding but before getting too concerned, there are couple of basic things to check. First, double check that the package is installed (see previous steps). Second, check for typographical errors in your code – especially for mis-spellings or unintended capitalization. If you are still having issues, try repeating the installation process. Then click on the “Update” button to check for potentially newer versions of packages. If all that fails, try the cloud version of RStudio discussed before and repeat the steps there.

    To help you go from basic to intermediate R usage and especially to help with more complicated problems, you will want to learn how to manage and save your R code. The best way to do this is using the upper left panel in RStudio. If you just want to manage code, then you can use what are called R Scripts, which are files that have a file extension of “.R”. To start a new “.R” file to store your code, click on File, then New File, then R Script. This will create a blank page to enter and edit code – then save the file as something like “MyFileName.R” in your preferred location. Saving your code will mean that you can return to where you were working last by simply re-running the saved script file. With code in the script window, you can place the cursor on a line of code or highlight a chunk of code and hit the “Run” button8 on the upper part of the panel. It will appear in the console with results just like what you would obtain if you typed it after the command prompt and hit enter for each line. Figure 1.4 shows the screen with the code used in this section in the upper left panel, saved in a file called “Ch1.R”, with the results of highlighting and executing the first section of code using the “Run” button.

    RStudio with highlighted code run.
    Figure 1.4: RStudio with highlighted code run.

    This page titled 1.2: Getting started in R is shared under a CC BY-NC 4.0 license and was authored, remixed, and/or curated by via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.