> avgs <- numeric (8) > for (i in 1:8) + avgs[i] <- mean (state.x77[,i]) # The "+" is R's continuation character; don't type it > avgs [1] 4246.4200 4435.8000 1.1700 70.8786 7.3780 53.1080 104.4600 70735.8800This is comparatively slow, much more so in large datasets. R is bad at looping. A more vectorized way to do this is to use the apply() function. In this example, apply extracts each column as a vector, one at a time, and passes it to the median() function.
> apply (state.x77, 2, median) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277The 2 means "go by column" -- a 1 would have meant "go by row." Of course, if we had used a 1, we would have computed 50 averages, one for each row. If we had had a three-dimensional array we could have used a 3 there. The third argument specifies the function to be applied to each column. We can use any function that makes sense there. We can use our own function or even pass in a function that we write on the spot. If your function returns a vector of constant length, R will stick the vectors together into a matrix. However, if your function returns vectors of different lengths, R will have to create a list (see more details below).
The special cases of mean and sum have been taken care of already with the built-in colMeans, ColSums, rowMeans, and rowSums functions. These are highly efficient and worth using.
In this example, we construct a function "on the fly" and pass it to apply. This particular function computes the median and maximum of each column of state.x77.
> apply (state.x77, 2, function(x) c(median (x), max(x))) Population Income Illiteracy Life Exp Murder HS Grad Frost Area [1,] 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277 [2,] 21198.0 6315 2.80 73.600 15.10 67.30 188.0 566432If you pass additional arguments to apply, those arguments get passed down to the function you're having apply call. So if you wanted to calculate the mean of each column after trimming the highest and lowest 10%, you could do this:
> apply (state.x77, 2, mean, trim=.1) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 3384.27500 4430.07500 1.09750 70.91775 7.29750 53.33750 106.80000 56575.72500This is particularly handy for passing the na.rm=T argument to functions like max.
> system.time (for (j in 1:20000) colMeans (state.x77)) > system.time (for (j in 1:20000) apply (state.x77, 2, mean)) > system.time (for (j in 1:20000) for (i in 1:8) mean (state.x77[,i]))expecting the last one to be reported as the slowest. Actually, though, the middle one was. I'm not sure what the story is here.
> a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3) > a [,1] [,2] [,3] [1,] 5 1 4 [2,] 2 2 5 [3,] 7 8 6 > apply (a, 1, min) [1] 1 2 6So apply() works on each row, one at a time, to tell me the smallest number in each row. What if I want the index of the smallest number in each row? That is, I want the answer to the question "in which column can the minimum value be found"? That sounds easy, too: we'll use the which() function, which returns the indices within a vector for which the vector holds the value TRUE.
> which (c(F, F, T, F, T, T, F)) # Example of "which" : where are the Trues? [1] 3 5 6 # # For each row, find the column in which that row has its smallest value. # > apply (a, 1, function(x) which(x == min(x))) [[1]] [1] 2 [[2]] [1] 1 2 [[3]] [1] 3What has happened here is that there's a tie in the second row. apply() returns a single value for rows 1 and 3, but two values for row 2, and R doesn't know how to arrange those, so it makes a list. The [[1]] tells us that the first element of the list has no name.
If we needed to do this we might impose a rule like "if there's a tie pick out the first one."
> apply (a, 1, function(x) which(x == min(x))[1]) [1] 2 1 3
The lapply() function works on any list, not just a rectangular one. (The "l" in "lapply" stands for "list.") In that way it's more general than apply(), although it does not work on matrices or higher-dimensional arrrays. You don't need to specify the "direction" as you do with apply(); just pass the function. However, lapply() always returns a list. Usually I want a vector, and that's what sapply() tries to do. The "s" in "sapply" stands for "simplify." Here's an example using the built-in barley data frame. My question is, how many levels of each variable are there? We can count the number by seeing how many unique entries there are: so length(unique(x)) will do the trick.
library (lattice) # Make this data available > dim (barley) # Barley has 120 rows [1] 120 4 > lapply (barley, function(x) length(unique(x))) # returns a list $yield: [1] 114 $variety: [1] 10 $year: [1] 2 $site: [1] 6 > sapply (barley, function(x) length(unique(x))) # Simplifies output to a vector yield variety year site 114 10 2 6 > apply (barley, 2, function(x) length(unique(x))) # Also works on data frames (but not non-data frame lists). yield variety year site 114 10 2 6
> tapply (barley$yield, barley$site, mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 24.93167 27.99667 32.66667 35.4 37.42 48.10833tapply() returns a vector with one element for each unique value of barley$site. The element for Grand Rapids, for example, gives the average of all the elements of barley$yield for which barley$site == "Grand Rapids". I have found tapply() to be incredibly useful. If you want to cross-tabulate by more than one variable, construct a list of your tabulating variables and pass that to tapply(). Here we break yields down by year and site.
> tapply (barley$yield, list (barley$year, barley$site), mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 1932 20.81000 25.70000 29.50667 41.51333 31.18 41.87000 1931 29.05334 30.29333 35.82667 29.28667 43.66 54.34667We've learned something: 1931 was a much better year, except in Morris. (There's some suspicion that Morris was in fact incorrectly recorded in this well-known data set.) 1932 appears before 1931 in the table because that's how the levels of "year" were set up in R. (If this bothers you see Reordering the levels of a factor.) Years appear in the rows because they came first in the list. Of course a three- or higher-way table can be made in this way as well.