4.6: Useful Things to Know about Variables
In Chapter 3 I talked a lot about variables, how they’re assigned and some of the things you can do with them, but there’s a lot of additional complexities. That’s not a surprise of course. However, some of those issues are worth drawing your attention to now. So that’s the goal of this section; to cover a few extra topics. As a consequence, this section is basically a bunch of things that I want to briefly mention, but don’t really fit in anywhere else. In short, I’ll talk about several different issues in this section, which are only loosely connected to one another.
Special values
The first thing I want to mention are some of the “special” values that you might see R produce. Most likely you’ll see them in situations where you were expecting a number, but there are quite a few other ways you can encounter them. These values are
Inf
,
NaN
,
NA
and
NULL
. These values can crop up in various different places, and so it’s important to understand what they mean.
-
Infinity
(
Inf
). The easiest of the special values to explain isInf
, since it corresponds to a value that is infinitely large. You can also have-Inf
. The easiest way to getInf
is to divide a positive number by 0:
1 / 0
## [1] Inf
In most real world data analysis situations, if you’re ending up with infinite numbers in your data, then something has gone awry. Hopefully you’ll never have to see them.
-
Not a Number
(
NaN
). The special value ofNaN
is short for “not a number”, and it’s basically a reserved keyword that means “there isn’t a mathematically defined number for this”. If you can remember your high school maths, remember that it is conventional to say that 0/0 doesn’t have a proper answer: mathematicians would say that 0/0 is undefined . R says that it’s not a number:
0 / 0
## [1] NaN
Nevertheless, it’s still treated as a “numeric” value. To oversimplify,
NaN
corresponds to cases where you asked a proper numerical question that genuinely has
no meaningful answer
.
-
Not available
(
NA
).NA
indicates that the value that is “supposed” to be stored here is missing. To understand what this means, it helps to recognise that theNA
value is something that you’re most likely to see when analysing data from real world experiments. Sometimes you get equipment failures, or you lose some of the data, or whatever. The point is that some of the information that you were “expecting” to get from your study is just plain missing. Note the difference betweenNA
andNaN
. ForNaN
, we really do know what’s supposed to be stored; it’s just that it happens to correspond to something like 0/0 that doesn’t make any sense at all. In contrast,NA
indicates that we actually don’t know what was supposed to be there. The information is missing . -
No value
(
NULL
). TheNULL
value takes this “absence” concept even further. It basically asserts that the variable genuinely has no value whatsoever. This is quite different to bothNaN
andNA
. ForNaN
we actually know what the value is, because it’s something insane like 0/0. ForNA
, we believe that there is supposed to be a value “out there”, but a dog ate our homework and so we don’t quite know what it is. But forNULL
we strongly believe that there is no value at all .
Assigning names to vector elements
One thing that is sometimes a little unsatisfying about the way that R prints out a vector is that the elements come out unlabelled. Here’s what I mean. Suppose I’ve got data reporting the quarterly profits for some company. If I just create a no-frills vector, I have to rely on memory to know which element corresponds to which event. That is:
profit <- c( 3.1, 0.1, -1.4, 1.1 )
profit
## [1] 3.1 0.1 -1.4 1.1
You can probably guess that the first element corresponds to the first quarter, the second element to the second quarter, and so on, but that’s only because I’ve told you the back story and because this happens to be a very simple example. In general, it can be quite difficult. This is where it can be helpful to assign
names
to each of the elements. Here’s how you do it:
names(profit) <- c("Q1","Q2","Q3","Q4")
profit
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
This is a slightly odd looking command, admittedly, but it’s not too difficult to follow. All we’re doing is assigning a vector of labels (character strings) to
names(profit)
. You can always delete the names again by using the command
names(profit) <- NULL
. It’s also worth noting that you don’t have to do this as a two stage process. You can get the same result with this command:
profit <- c( "Q1" = 3.1, "Q2" = 0.1, "Q3" = -1.4, "Q4" = 1.1 )
profit
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
The important things to notice are that (a) this does make things much easier to read, but (b) the names at the top aren’t the “real” data. The
value
of
profit[1]
is still
3.1
; all I’ve done is added a
name
to
profit[1]
as well. Nevertheless, names aren’t purely cosmetic, since R allows you to pull out particular elements of the vector by referring to their names:
profit["Q1"]
## Q1
## 3.1
And if I ever need to pull out the names themselves, then I just type
names(profit)
.
Variable classes
As we’ve seen, R allows you to store different kinds of data. In particular, the variables we’ve defined so far have either been character data (text), numeric data, or logical data. 56 It’s important that we remember what kind of information each variable stores (and even more important that R remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it’s okay to multiply them together:
x <- 5 # x is numeric
y <- 4 # y is numeric
x * y
## [1] 20
But if they contain character data, multiplication makes no sense whatsoever, and R will complain if you try to do it:
x <- "apples" # x is character
y <- "oranges" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator
Even R is smart enough to know you can’t multiply
"apples"
by
"oranges"
. It knows this because the quote marks are indicators that the variable is supposed to be treated as text, not as a number.
This is quite useful, but notice that it means that R makes a big distinction between
5
and
"5"
. Without quote marks, R treats
5
as the number five, and will allow you to do calculations with it. With the quote marks, R treats
"5"
as the textual character five, and doesn’t recognise it as a number any more than it recognises
"p"
or
"five"
as numbers. As a consequence, there’s a big difference between typing
x <- 5
and typing
x <- "5"
. In the former, we’re storing the number
5
; in the latter, we’re storing the character
"5"
. Thus, if we try to do multiplication with the character versions, R gets stroppy:
x <- "5" # x is character
y <- "4" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator
Okay, let’s suppose that I’ve forgotten what kind of data I stored in the variable
x
(which happens depressingly often). R provides a function that will let us find out. Or, more precisely, it provides
three
functions:
class()
,
mode()
and
typeof()
. Why the heck does it provide three functions, you might be wondering? Basically, because R actually keeps track of three different kinds of information about a variable:
-
The
class
of a variable is a “high level” classification, and it captures psychologically (or statistically) meaningful distinctions. For instance
"2011-09-12"
and"my birthday"
are both text strings, but there’s an important difference between the two: one of them is a date. So it would be nice if we could get R to recognise that"2011-09-12"
is a date, and allow us to do things like add or subtract from it. The class of a variable is what R uses to keep track of things like that. Because the class of a variable is critical for determining what R can or can’t do with it, theclass()
function is very handy. -
The
mode
of a variable refers to the format of the information that the variable stores. It tells you whether R has stored text data or numeric data, for instance, which is kind of useful, but it only makes these “simple” distinctions. It can be useful to know about, but it’s not the main thing we care about. So I’m not going to use the
mode()
function very much. 57 -
The
type
of a variable is a very low level classification. We won’t use it in this book, but (for those of you that care about these details) this is where you can see the distinction between integer data, double precision numeric, etc. Almost none of you actually will care about this, so I’m not even going to bother demonstrating the
typeof()
function.
For purposes, it’s the
class()
of the variable that we care most about. Later on, I’ll talk a bit about how you can convince R to “coerce” a variable to change from one class to another (Section 7.10). That’s a useful skill for real world data analysis, but it’s not something that we need right now. In the meantime, the following examples illustrate the use of the
class()
function:
x <- "hello world" # x is text
class(x)
## [1] "character"
x <- TRUE # x is logical
class(x)
## [1] "logical"
x <- 100 # x is a number
class(x)
## [1] "numeric"
Exciting, no?