> avgs <- numeric (8) > for (i in 1:8) + avgs[i] <- mean (state.x77[,i]) # The "+" is R's continuation character; don't type it > avgs [1] 4246.4200 4435.8000 1.1700 70.8786 7.3780 53.1080 104.4600 70735.8800This is comparatively slow, much more so in large datasets. R is bad at looping. A more vectorized way to do this is to use the

> apply (state.x77, 2, median) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277The 2 means "go by column" -- a 1 would have meant "go by row." Of course, if we had used a 1, we would have computed 50 averages, one for each row. If we had had a three-dimensional array we could have used a 3 there. The third argument specifies the function to be applied to each column. We can use any function that makes sense there. We can use our own function or even pass in a function that we write on the spot. If your function returns a vector of constant length, R will stick the vectors together into a matrix. However, if your function returns vectors of different lengths, R will have to create a list (see more details below).

The special cases of mean and sum have been taken care of
already with the built-in `colMeans`, `ColSums`, `rowMeans`, and `rowSums` functions. These are highly efficient and worth using.

In this example, we construct a function "on the fly" and pass it to apply. This particular function computes the median and maximum of each column of `state.x77`.

> apply (state.x77, 2, function(x) c(median (x), max(x))) Population Income Illiteracy Life Exp Murder HS Grad Frost Area [1,] 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277 [2,] 21198.0 6315 2.80 73.600 15.10 67.30 188.0 566432If you pass additional arguments to apply, those arguments get passed down to the function you're having apply call. So if you wanted to calculate the mean of each column after trimming the highest and lowest 10%, you could do this:

> apply (state.x77, 2, mean, trim=.1) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 3384.27500 4430.07500 1.09750 70.91775 7.29750 53.33750 106.80000 56575.72500This is particularly handy for passing the

> system.time (for (j in 1:20000) colMeans (state.x77)) > system.time (for (j in 1:20000) apply (state.x77, 2, mean)) > system.time (for (j in 1:20000) for (i in 1:8) mean (state.x77[,i]))expecting the last one to be reported as the slowest. Actually, though, the middle one was. I'm not sure what the story is here.

> a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3) > a [,1] [,2] [,3] [1,] 5 1 4 [2,] 2 2 5 [3,] 7 8 6 > apply (a, 1, min) [1] 1 2 6So

> which (c(F, F, T, F, T, T, F)) # Example of "which" : where are the Trues? [1] 3 5 6 # # For each row, find the column in which that row has its smallest value. # > apply (a, 1, function(x) which(x == min(x))) [[1]] [1] 2 [[2]] [1] 1 2 [[3]] [1] 3What has happened here is that there's a tie in the second row.

If we needed to do this we might impose a rule like "if there's a tie pick out the first one."

> apply (a, 1, function(x) which(x == min(x))[1]) [1] 2 1 3

The `lapply()` function works on any list, not just a rectangular one. (The "l" in "lapply" stands for "list.") In that way it's more general than `apply()`, although it does not work on matrices or higher-dimensional arrrays. You don't need to specify the "direction" as you do with `apply()`; just pass the function. **However, lapply()
always returns a list.** Usually I want a vector, and that's what

library (lattice) # Make this data available > dim (barley) # Barley has 120 rows [1] 120 4 > lapply (barley, function(x) length(unique(x))) # returns a list $yield: [1] 114 $variety: [1] 10 $year: [1] 2 $site: [1] 6 > sapply (barley, function(x) length(unique(x))) # Simplifies output to a vector yield variety year site 114 10 2 6 > apply (barley, 2, function(x) length(unique(x))) # Also works on data frames (but not non-data frame lists). yield variety year site 114 10 2 6

> tapply (barley$yield, barley$site, mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 24.93167 27.99667 32.66667 35.4 37.42 48.10833

> tapply (barley$yield, list (barley$year, barley$site), mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 1932 20.81000 25.70000 29.50667 41.51333 31.18 41.87000 1931 29.05334 30.29333 35.82667 29.28667 43.66 54.34667We've learned something: 1931 was a much better year, except in Morris. (There's some suspicion that Morris was in fact incorrectly recorded in this well-known data set.) 1932 appears before 1931 in the table because that's how the levels of "year" were set up in R. (If this bothers you see Reordering the levels of a factor.) Years appear in the rows because they came first in the list. Of course a three- or higher-way table can be made in this way as well.