7.9: Reading Unusual Data Files
In this section I’m going to switch topics (again!) and turn to the question of how you can load data from a range of different sources. Throughout this book I’ve assumed that your data are stored as an
.Rdata
file or as a “properly” formatted CSV file. And if so, then the basic tools that I discussed in Section 4.5 should be quite sufficient. However, in real life that’s not a terribly plausible assumption to make, so I’d better talk about some of the other possibilities that you might run into.
Loading data from text files
The first thing I should point out is that if your data are saved as a text file but aren’t
quite
in the proper CSV format, then there’s still a pretty good chance that the
read.csv()
function (or equivalently,
read.table()
) will be able to open it. You just need to specify a few more of the optional arguments to the function. If you type
?read.csv
you’ll see that the
read.csv()
function actually has several arguments that you can specify. Obviously you need to specify the
file
that you want it to load, but the others all have sensible default values. Nevertheless, you will sometimes need to change them. The ones that I’ve often found myself needing to change are:
-
header
. A lot of the time when you’re storing data as a CSV file, the first row actually contains the column names and not data. If that’s not true, you need to setheader = FALSE
. -
sep
. As the name “comma separated value” indicates, the values in a row of a CSV file are usually separated by commas. This isn’t universal, however. In Europe the decimal point is typically written as,
instead of.
and as a consequence it would be somewhat awkward to use,
as the separator. Therefore it is not unusual to use;
over there. At other times, I’ve seen a TAB character used. To handle these cases, we’d need to setsep = ";"
orsep = "\t"
. -
quote
. It’s conventional in CSV files to include a quoting character for textual data. As you can see by looking at thebooksales.csv}
file, this is usually a double quote character,"
. But sometimes there is no quoting character at all, or you might see a single quote mark'
used instead. In those cases you’d need to specifyquote = ""
orquote = "'"
. -
skip
. It’s actually very common to receive CSV files in which the first few rows have nothing to do with the actual data. Instead, they provide a human readable summary of where the data came from, or maybe they include some technical info that doesn’t relate to the data. To tell R to ignore the first (say) three lines, you’d need to setskip = 3
-
na.strings
. Often you’ll get given data with missing values. For one reason or another, some entries in the table are missing. The data file needs to include a “special” string to indicate that the entry is missing. By default R assumes that this string isNA
, since that’s what it would do, but there’s no universal agreement on what to use in this situation. If the file uses???
instead, then you’ll need to setna.strings = "???"
.
It’s kind of nice to be able to have all these options that you can tinker with. For instance, have a look at the data file shown pictured in Figure 7.1. This file contains almost the same data as the last file (except it doesn’t have a header), and it uses a bunch of wacky features that you don’t normally see in CSV files. In fact, it just so happens that I’m going to have to change all five of those arguments listed above in order to load this file. Here’s how I would do it:
data <- read.csv( file = "./rbook-master/data/booksales2.csv", # specify the name of the file
header = FALSE, # variable names in the file?
skip = 8, # ignore the first 8 lines
quote = "*", # what indicates text data?
sep = "\t", # what separates different entries?
na.strings = "NFI" ) # what is the code for missing data?
If I now have a look at the data I’ve loaded, I see that this is what I’ve got:
head( data )
## V1 V2 V3 V4
## 1 January 31 0 high
## 2 February 28 100 high
## 3 March 31 200 low
## 4 April 30 50 out
## 5 May 31 NA out
## 6 June 30 0 high
Because I told R to expect
*
to be used as the quoting character instead of
"
; to look for tabs (which we write like this:
\t
) instead of commas, and to skip the first 8 lines of the file, it’s basically loaded the right data. However, since
booksales2.csv
doesn’t contain the column names, R has made them up. Showing the kind of imagination I expect from insentient software, R decided to call them
V1
,
V2
,
V3
and
V4
. Finally, because I told it that the file uses “NFI” to denote missing data, R correctly figures out that the sales data for May are actually missing.
In real life you’ll rarely see data this stupidly formatted. 125
Loading data from SPSS (and other statistics packages)
The commands listed above are the main ones we’ll need for data files in this book. But in real life we have many more possibilities. For example, you might want to read data files in from other statistics programs. Since SPSS is probably the most widely used statistics package in psychology, it’s worth briefly showing how to open SPSS data files (file extension
.sav
). It’s surprisingly easy. The extract below should illustrate how to do so:
library( foreign ) # load the package
X <- read.spss( "./rbook-master/data/datafile.sav" ) # create a list containing the data
X <- as.data.frame( X ) # convert to data frame
If you wanted to import from an SPSS file to a data frame directly, instead of importing a list and then converting the list to a data frame, you can do that too:
X <- read.spss( file = "datafile.sav", to.data.frame = TRUE )
And that’s pretty much it, at least as far as SPSS goes. As far as other statistical software goes, the
foreign
package provides a wealth of possibilities. To open SAS files, check out the
read.ssd()
and
read.xport()
functions. To open data from Minitab, the
read.mtp()
function is what you’re looking for. For Stata, the
read.dta()
function is what you want. For Systat, the
read.systat()
function is what you’re after.
Loading Excel files
A different problem is posed by Excel files. Despite years of yelling at people for sending data to me encoded in a proprietary data format, I get sent a lot of Excel files. In general R does a pretty good job of opening them, but it’s bit finicky because Microsoft don’t seem to be terribly fond of people using non-Microsoft products, and go to some lengths to make it tricky. If you get an Excel file, my suggestion would be to open it up in Excel (or better yet, OpenOffice, since that’s free software) and then save the spreadsheet as a CSV file. Once you’ve got the data in that format, you can open it using
read.csv()
. However, if for some reason you’re desperate to open the
.xls
or
.xlsx
file directly, then you can use the
read.xls()
function in the
gdata
package:
library( gdata ) # load the package
X <- read.xls( "datafile.xlsx" ) # create a data frame
This usually works. And if it doesn’t, you’re probably justified in “suggesting” to the person that sent you the file that they should send you a nice clean CSV file instead.
Loading Matlab (& Octave) files
A lot of scientific labs use Matlab as their default platform for scientific computing; or Octave as a free alternative. Opening Matlab data files (file extension
.mat
) slightly more complicated, and if it wasn’t for the fact that Matlab is so very widespread and is an extremely good platform, I wouldn’t mention it. However, since Matlab is so widely used, I think it’s worth discussing briefly how to get Matlab and R to play nicely together. The way to do this is to install the
R.matlab
package (don’t forget to install the dependencies too). Once you’ve installed and loaded the package, you have access to the
readMat()
function. As any Matlab user will know, the
.mat
files that Matlab produces are workspace files, very much like the
.Rdata
files that R produces. So you can’t import a
.mat
file as a data frame. However, you can import it as a list. So, when we do this:
library( R.matlab ) # load the package
data <- readMat( "matlabfile.mat" ) # read the data file to a list
The
data
object that gets created will be a list, containing one variable for every variable stored in the Matlab file. It’s fairly straightforward, though there are some subtleties that I’m ignoring. In particular, note that if you don’t have the
Rcompression
package, you can’t open Matlab files above the version 6 format. So, if like me you’ve got a recent version of Matlab, and don’t have the
Rcompression
package, you’ll need to save your files using the
-v6
flag otherwise R can’t open them.
Oh, and Octave users? The
foreign
package contains a
read.octave()
command. Just this once, the world makes life easier for you folks than it does for all those cashed-up swanky Matlab bastards.
Saving other kinds of data
Given that I talked extensively about how to load data from non-R files, it might be worth briefly mentioning that R is also pretty good at writing data into other file formats besides it’s own native ones. I won’t discuss them in this book, but the
write.csv()
function can write CSV files, and the
write.foreign()
function (in the
foreign
package) can write SPSS, Stata and SAS files. There are also a lot of low level commands that you can use to write very specific information to a file, so if you really, really needed to you could create your own
write.obscurefiletype()
function, but that’s also a long way beyond the scope of this book. For now, all that I want you to recognise is that this capability is there if you need it.
done yet?
Of course not. If I’ve learned nothing else about R it’s that you’re
never bloody done
. This listing doesn’t even come close to exhausting the possibilities. Databases are supported by the
RODBC
,
DBI
, and
RMySQL
packages among others. You can open webpages using the
RCurl
package. Reading and writing JSON objects is supported through the
rjson
package. And so on. In a sense, the right question is not so much “can R do this?” so much as “whereabouts in the wilds of CRAN
is
the damn package that does it?”