7.13: Summary
Obviously, there’s no real coherence to this chapter. It’s just a grab bag of topics and tricks that can be handy to know about, so the best wrap up I can give here is just to repeat this list:
- Section 7.1. Tabulating data.
- Section 7.2. Transforming or recoding a variable.
- Section 7.3. Some useful mathematical functions.
- Section 7.4. Extracting a subset of a vector.
- Section 7.5. Extracting a subset of a data frame.
- Section 7.6. Sorting, flipping or merging data sets.
- Section 7.7. Reshaping a data frame.
- Section 7.8. Manipulating text.
- Section 7.9. Opening data from different file types.
- Section 7.10. Coercing data from one type to another.
- Section 7.11. Other important data types.
- Section 7.12. Miscellaneous topics.
There are a number of books out there that extend this discussion. A couple of my favourites are Spector (2008) “Data Manipulation with R” and Teetor (2011) “R Cookbook”.
References
Spector, P. 2008. Data Manipulation with R . New York, NY: Springer.
Teetor, P. 2011. R Cookbook . Sebastopol, CA: O’Reilly.
- The quote comes from Home is the Hangman , published in 1975.
-
As usual, you can assign this output to a variable. If you type
speaker.freq <- table(speaker)
at the command prompt R will store the table as a variable. If you then typeclass(speaker.freq)
you’ll see that the output is actually of classtable
. The key thing to note about a table object is that it’s basically a matrix (see Section 7.11.1. -
It’s worth noting that there’s also a more powerful function called
recode()
function in thecar
package that I won’t discuss in this book but is worth looking into if you’re looking for a bit more flexibility. -
If you’ve read further into the book, and are re-reading this section, then a good example of this would be someone choosing to do an ANOVA using
age.group3
as the grouping variable, instead of running a regression usingage
as a predictor. There are sometimes good reasons for do this: for instance, if the relationship betweenage
and your outcome variable is highly non-linear, and you aren’t comfortable with trying to run non-linear regression! However, unless you really do have a good rationale for doing this, it’s best not to. It tends to introduce all sorts of other problems (e.g., the data will probably violate the normality assumption), and you can lose a lot of power. - The real answer is 0: $10 for a sandwich is a total ripoff so I should go next door and buy noodles.
- Again, I doubt that’s the right “real world” answer. I suspect that most sandwich shops won’t allow you to pay off your debts to them in sandwiches. But you get the idea.
-
Actually, that’s a bit of a lie: the
log()
function is more flexible than that, and can be used to calculate logarithms in any base. Thelog()
function has abase
argument that you can specify, which has a default value of e. Thuslog10(1000)
is actually equivalent tolog(x = 1000, base = 10)
. -
It’s also worth checking out the
match()
function - It also works on data frames if you ever feel the need to import all of your variables from the data frame into the workspace. This can be useful at times, though it’s not a good idea if you have large data sets or if you’re working with multiple data sets at once. In particular, if you do this, never forget that you now have two copies of all your variables, one in the workspace and another in the data frame.
-
You can do this yourself using the
make.names()
function. In fact, this is itself a handy thing to know about. For example, if you want to convert the names of the variables in thespeech.by.char
list into valid R variable names, you could use a command like this:names(speech.by.char) <- make.names(names(speech.by.char))
. However, I won’t go into details here. -
Conveniently, if you type
rownames(df) <- NULL
R will renumber all the rows from scratch. For thedf
data frame, the labels that currently run from 7 to 10 will be changed to go from 1 to 4. -
Actually, you can make the
subset()
function behave this way by using the optionaldrop
argument, but by defaultsubset()
does not drop, which is probably more sensible and more intuitive to novice users. - Specifically, recursive indexing, a handy tool in some contexts but not something that I want to discuss here.
-
Remember,
print()
is generic: see Section 4.11. -
Note for advanced users: both of these functions are just wrappers to the
matrix()
function, which is pretty flexible in terms of the ability to convert vectors into matrices. Also, while I’m on this topic, I’ll briefly mention the fact that if you’re a Matlab user and looking for an equivalent of Matlab’srepmat()
function, I’d suggest checking out thematlab
package which contains R versions of a lot of handy Matlab functions. -
The function you need for that is called
as.data.frame()
. -
In truth, I suspect that most of the cases when you can sensibly flip a data frame occur when all of the original variables are measurements of the same type (e.g., all variables are response times), and if so you could easily have chosen to encode your data as a matrix instead of as a data frame. But since people do sometimes prefer to work with data frames, I’ve written the
tFrame()
function for the sake of convenience. I don’t really think it’s something that is needed very often. -
This limitation is deliberate, by the way: if you’re getting to the point where you want to do something more complicated, you should probably start learning how to use
reshape()
,cast()
andmelt()
or some of other the more advanced tools. ThewideToLong()
andlongToWide()
functions are included only to help you out when you’re first starting to use R. -
To be honest, it does bother me a little that the default value of
sep
is a space. Normally when I want to paste strings together I don’t want any separator character, so I’d prefer it if the default weresep=""
. To that end, it’s worth noting that there’s also apaste0()
function, which is identical topaste()
except that it always assumes thatsep=""
. Type?paste
for more information about this. -
Note that you can capture the output from
cat()
if you want to, but you have to be sneaky and use thecapture.output()
function. For example, the commandx <- capture.output(cat(hw,ng))
would work just fine. - Sigh. For advanced users: R actually supports two different ways of specifying regular expressions. One is the POSIX standard, the other is to use Perl-style regular expressions. The default is generally POSIX. If you understand regular expressions, that probably made sense to you. If not, don’t worry. It’s not important.
- I thank Amy Perfors for this example.
- If you’re lucky.
-
You can also use the
matrix()
command itself, but I think the “binding” approach is a little more intuitive. -
This has some interesting implications for how matrix algebra is implemented in R (which I’ll admit I initially found odd), but that’s a little beyond the scope of this book. However, since there will be a small proportion of readers that do care, I’ll quickly outline the basic thing you need to get used to: when multiplying a matrix by a vector (or one-dimensional array) using the
\%*\%
operator R will attempt to interpret the vector (or 1D array) as either a row-vector or column-vector, depending on whichever one makes the multiplication work. That is, suppose M is the 2×3 matrix, and v is a 1×3 row vector. It is impossible to multiply Mv, since the dimensions don’t conform, but you can multiply by the corresponding column vector, Mvt. So, if I setv <- M[2,]
and then try to calculateM \%*\% v
, which you’d think would fail, it actually works because R treats the one dimensional array as if it were a column vector for the purposes of matrix multiplication. Note that if both objects are one dimensional arrays/vectors, this leads to ambiguity since vvt (inner product) and vtv (outer product) yield different answers. In this situation, the\%*\%
operator returns the inner product not the outer product. To understand all the details, check out the help documentation. -
I should note that if you type
class(xtab.3d)
you’ll discover that this is a"table"
object rather than an"array"
object. However, this labelling is only skin deep. The underlying data structure here is actually an array. Advanced users may wish to check this using the commandclass(unclass(xtab.3d))
, but it’s not important for our purposes. All I really want to do in this section is show you what the output looks like when you encounter a 3D array. - Date objects are coded as the number of days that have passed since January 1, 1970.
-
For advanced users: type
?double
for more information. -
Or at least, that’s the default. If all your numbers are integers (whole numbers), then you can explicitly tell R to store them as integers by adding an
L
suffix at the end of the number. That is, an assignment likex <- 2L
tells R to assignx
a value of 2, and to store it as an integer rather than as a binary expansion. Type?integer
for more details. -
For advanced users: that’s a little over simplistic in two respects. First, it’s a terribly imprecise way of talking about scoping. Second, it might give you the impression that all the variables in question are actually loaded into memory. That’s not quite true, since that would be very wasteful of memory. Instead R has a “lazy loading” mechanism, in which what R actually does is create a “promise” to load those objects if they’re actually needed. For details, check out the
delayedAssign()
function.