The Apply Family of Functions

## What "Apply" does

### Tapply: avoiding loops when applying a function to subsets

"Apply" functions keep you from having to write loops to perform some operation on every row or every column of a matrix or data frame, or on every element in a list. For example, the built-in data set state.x77 contains eight columns of data describing the 50 U.S. states in 1977. If you wanted the average of each of the eight columns, you could do this:
```> avgs <- numeric (8)
> for (i in 1:8)
+     avgs[i] <- mean (state.x77[,i])   # The "+" is R's continuation character; don't type it
> avgs
  4246.4200  4435.8000     1.1700    70.8786     7.3780    53.1080   104.4600 70735.8800
```
This is comparatively slow, much more so in large datasets. R is bad at looping. A more vectorized way to do this is to use the apply() function. In this example, apply extracts each column as a vector, one at a time, and passes it to the median() function.
```> apply (state.x77, 2, median)
Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
2838.5   4519       0.95   70.675   6.85   53.25 114.5 54277
```
The 2 means "go by column" -- a 1 would have meant "go by row." Of course, if we had used a 1, we would have computed 50 averages, one for each row. If we had had a three-dimensional array we could have used a 3 there. The third argument specifies the function to be applied to each column. We can use any function that makes sense there. We can use our own function or even pass in a function that we write on the spot. If your function returns a vector of constant length, R will stick the vectors together into a matrix. However, if your function returns vectors of different lengths, R will have to create a list (see more details below).

The special cases of mean and sum have been taken care of already with the built-in colMeans, ColSums, rowMeans, and rowSums functions. These are highly efficient and worth using.

In this example, we construct a function "on the fly" and pass it to apply. This particular function computes the median and maximum of each column of state.x77.

```> apply (state.x77, 2, function(x) c(median (x), max(x)))
Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
[1,]     2838.5   4519       0.95   70.675   6.85   53.25 114.5  54277
[2,]    21198.0   6315       2.80   73.600  15.10   67.30 188.0 566432
```
If you pass additional arguments to apply, those arguments get passed down to the function you're having apply call. So if you wanted to calculate the mean of each column after trimming the highest and lowest 10%, you could do this:
```> apply (state.x77, 2, mean, trim=.1)
Population      Income  Illiteracy    Life Exp      Murder     HS Grad       Frost        Area
3384.27500  4430.07500     1.09750    70.91775     7.29750    53.33750   106.80000 56575.72500
```
This is particularly handy for passing the na.rm=T argument to functions like max.

#### Does apply() loop?

Yes. apply() calls lapply and lapply() loops. Clearly something has to loop. The reason that the apply family of functions is fast is that the looping is done in compiled code (C or Fortran), not in R's own interpreted code. The difference can be the difference between finishing and crashing. Note: After writing this I got curious about the extent to which apply() increases speed. I used commands like this:
```> system.time (for (j in 1:20000) colMeans (state.x77))
> system.time (for (j in 1:20000) apply (state.x77, 2, mean))
> system.time (for (j in 1:20000) for (i in 1:8) mean (state.x77[,i]))
```
expecting the last one to be reported as the slowest. Actually, though, the middle one was. I'm not sure what the story is here.

#### Sometimes you expect apply() to return a vector but you get a list

I include this topic because it has bedeviled me in the past. Suppose I have this matrix a, and I want to find the smallest number in each row. This is easy:
```> a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3)
> a
[,1] [,2] [,3]
[1,]    5    1    4
[2,]    2    2    5
[3,]    7    8    6
> apply (a, 1, min)
 1 2 6
```
So apply() works on each row, one at a time, to tell me the smallest number in each row. What if I want the index of the smallest number in each row? That is, I want the answer to the question "in which column can the minimum value be found"? That sounds easy, too: we'll use the which() function, which returns the indices within a vector for which the vector holds the value TRUE.
```> which (c(F, F, T, F, T, T, F))   # Example of "which" : where are the Trues?
 3 5 6
#
# For each row, find the column in which that row has its smallest value.
#
> apply (a, 1, function(x) which(x == min(x)))
[]
 2

[]
 1 2

[]
 3

```
What has happened here is that there's a tie in the second row. apply() returns a single value for rows 1 and 3, but two values for row 2, and R doesn't know how to arrange those, so it makes a list. The [] tells us that the first element of the list has no name.

If we needed to do this we might impose a rule like "if there's a tie pick out the first one."

```> apply (a, 1, function(x) which(x == min(x)))
 2 1 3
```

## Lapply and sapply: avoiding loops on lists and data frames

The regular apply() function can be used on a data frame since a data frame is a type of matrix. When you use it on the columns of a data frame, passing the number 2 for the second argument, it does what you expect. It will work on the rows of a data frame, too, but remember: apply extracts each row as a vector, one at a time. Every element of a vector must have the same kind of data, so unless every column of the data frame has the same kind of data, R will end up converting the elements of the row to a common format (like character).

The lapply() function works on any list, not just a rectangular one. (The "l" in "lapply" stands for "list.") In that way it's more general than apply(), although it does not work on matrices or higher-dimensional arrrays. You don't need to specify the "direction" as you do with apply(); just pass the function. However, lapply() always returns a list. Usually I want a vector, and that's what sapply() tries to do. The "s" in "sapply" stands for "simplify." Here's an example using the built-in barley data frame. My question is, how many levels of each variable are there? We can count the number by seeing how many unique entries there are: so length(unique(x)) will do the trick.

```library (lattice)                                  # Make this data available
> dim (barley)                                     # Barley has 120 rows
 120   4
> lapply (barley, function(x) length(unique(x)))   # returns a list
\$yield:
 114

\$variety:
 10

\$year:
 2

\$site:
 6

> sapply (barley, function(x) length(unique(x)))   # Simplifies output to a vector
yield variety year site
114      10    2    6
> apply (barley, 2, function(x) length(unique(x))) # Also works on data frames (but not non-data frame lists).
yield variety year site
114      10    2    6
```

## Tapply: avoiding loops when applying a function to subsets

tapply() is a very powerful function that lets you break a vector into pieces, and then apply some function to each of the pieces. (For you Excel users, tapply() produces things that correspond to Excel's pivot tables.) It's sort of like sapply(), except that with sapply() the pieces are always elements of a list. With tapply() you get to specify how the breakdown is done. For example, suppose I want to find the average yield of barley at each site in the last example.
```> tapply (barley\$yield, barley\$site, mean)
Grand Rapids   Duluth University Farm Morris Crookston   Waseca
24.93167 27.99667        32.66667   35.4     37.42 48.10833
```
tapply() returns a vector with one element for each unique value of barley\$site. The element for Grand Rapids, for example, gives the average of all the elements of barley\$yield for which barley\$site == "Grand Rapids". I have found tapply() to be incredibly useful. If you want to cross-tabulate by more than one variable, construct a list of your tabulating variables and pass that to tapply(). Here we break yields down by year and site.
```> tapply (barley\$yield, list (barley\$year, barley\$site), mean)
Grand Rapids   Duluth University Farm   Morris Crookston   Waseca
1932     20.81000 25.70000        29.50667 41.51333     31.18 41.87000
1931     29.05334 30.29333        35.82667 29.28667     43.66 54.34667
```
We've learned something: 1931 was a much better year, except in Morris. (There's some suspicion that Morris was in fact incorrectly recorded in this well-known data set.) 1932 appears before 1931 in the table because that's how the levels of "year" were set up in R. (If this bothers you see
Reordering the levels of a factor.) Years appear in the rows because they came first in the list. Of course a three- or higher-way table can be made in this way as well.