Efficiency

GNU/Linux Desktop Survival Guide
by Graham Williams

Efficiency

Efficiency

To quote Brian Ripley, one of the S and R gurus, The first step is to get enough RAM.

Larger data is often an issue with R. While it will load the data, it will be relatively slow, and extracting subsets tends to be slow. Selecting and subsetting required data sets off a database or through other means (e.g., using Python) will generally be faster.

On MSWindows you may need to set the memory size using the command-line flag -max-mem-size.

A suggested process is to working with the a subset of all the data loaded in memory, using a data set small enough to make this viable. Explore the data, explore for the choice of models, and prototype the final analysis using this smaller dataset. For the final full analyses one may need to allow R to run overnight with enough RAM.

A data frame of 150,000 rows and some 55 columns will be about the maximum for 500MB of RAM.

Also, note the difference between data frames and arrays/matrices. For example, rbind'ing data frames is much more expensive than rbind'ing arrays/matrices. However, an array/matrix must have all data of the same data type in each column while data frames can have different data types in different columns. A number of algorithms are written to handle either data frames or matrices (e.g., rpart) and it is best, if possible, to use a matrix in these cases. The coercion back to a data frame can always be done afterwards.

Note that to convert a data frame to a matrix:

m <- as.matrix(df)

However, if there are any character columns, all the data is converted to character.