Matrices and Data Frames

Matrices

Data Frames

Higher-Dimensional Arrays

A matrix is a two-dimensional data structure. All the elements of a matrix must be of the same type (numeric, logical, character, complex). You can create a matrix with the matrix() command:
> matrix (1:12, nrow = 4, ncol = 3) # Use the integers 1 through 12 in four rows, three columns
     [,1] [,2] [,3] 
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
Actually, you don't need to specify both "nrow" and "ncol," because given one, R can deduce the other. Notice that the data goes in column-by-column, unless you specify byrow=T:
> matrix (1:12, nrow = 4, byrow = T)
     [,1] [,2] [,3] 
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
Let's save this as our matrix to work with.
x <- matrix (1:12, nrow = 4, byrow=T)
>dim(x) # What are the dimensions of x?
[1] 4 3 # Answer: 4 rows by three columns. Notice this is a vector of length 2.

Subscripting

With a vector, we subscript with a single index. It seems natural that we should use two subscripts for a matrix. Separate them with a comma. For example:

> x[3,2] # Give me the item in the third row, second column.
[1] 8
> x[1:3, c(1,3)] # Give me rows 1 through 3, columns 1 and 3
     [,1] [,2] 
[1,]    1    3
[2,]    4    6
[3,]    7    9 # The result is a 3x2 matrix.
If you omit a subscript you get the whole row or column;
> x[3,]
[1]  7  8  9
> x[,2]
[1]  2  5  8 11
Note these results are vectors, not matrices with one row/column. If we ask for two columns, then of course we get a matrix:
>x[,c(1,3)]
     [,1] [,2] 
[1,]    1    3
[2,]    4    6
[3,]    7    9
[4,]   10   12
Actually, if you really want to, you can force the result of asking for one row or column to continue to be a matrix, using the drop=F argument. It doesn't come up much, so this is just for completeness.
> x[2,,drop=F]
     [,1] [,2] [,3] 
[1,]    4    5    6 # This is a 1x3 matrix...
> x[,2,drop=F]
     [,1] 
[1,]    2
[2,]    5
[3,]    8
[4,]   11           # ...this is a 4x1 matrix.

Logical Subscripting

As with a vector, we can use logical vectors to select certain rows or columns. Normally we would select rows by using a logical vector with one entry for each row, and similarly for columns. So, for example, consider the expression x[,2] > 5:

> x[,2] > 5
[1] F F T T
This has four entries, one for each row. If we wanted only the rows for which the second column is > 5, we could do that simply:
> x[x[,2] > 5,]   # Give me just those rows, and all columns
     [,1] [,2] [,3]
[1,]    7    8    9
[2,]   10   11   12
Logical and Character Matrices

Here's an example of a logical matrix:

> x > 5    
      [,1] [,2] [,3] 
[1,]    F    F    F
[2,]    F    F    T
[3,]    T    T    T
[4,]    T    T    T
> x[x>5]
[1]  7 10  8 11  6  9 12 # This extract the values > 5. It gives a vector, not a matrix, and note
                         # that the extraction goes column-by-column.
Here's a character matrix. It uses the built-in variable "letters" that contains the twenty-six letters in order.
> matrix (letters[1:12], nrow = 4, byrow = T)
     [,1] [,2] [,3] 
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
[3,] "g"  "h"  "i" 
[4,] "j"  "k"  "l" # You can tell they're characters by the quotes

Handy matrix functions

Some matrix functions that seem to come up a lot are t(), which transposes your matrix; %*%, which does matrix multiplication; and solve(), which inverts a matrix and solves linear systems. For example:

> t(x)                   # Give me x-transpose
     [,1] [,2] [,3] [,4] 
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> t(x) %*% x             # Here's x-transpose times x
     [,1] [,2] [,3] 
[1,]  166  188  210
[2,]  188  214  240
[3,]  210  240  270
> solve (t(x) %*% x)     # Can we invert this matrix?
Error in solve.default(t(x) %*% x) : 
  Lapack routine dgesv: system is exactly singular
Here x is not of full rank, so neither is x-transpose x, and we can't invert it.

Solving Linear Systems

Here's an example of solving a system of linear equations. Suppose we have the system

14 c1 + 5 c2 + 5 c3 + 2 c4 = 2
 8 c1 + 3 c2 + 4 c3 + 4 c4 = 2
 6 c1 + 7 c2 + 3 c3 + 7 c4 = 3
16 c1 + 6 c2 + 1 c3 + 9 c4 = 3
. If we create the matrix of this system (call it mat) and the result vector (call it res), so that the system reads (mat) x = res, then we can find x by inverting the matrix with (solve()) and matrix-multiplying by res, or by calling solve() with both mat and res as arguments:
> res <- c(2,2,3,3)
> mat <- matrix (c(14, 8, 16, 6, 5, 3, 7, 6, 5, 4, 3, 1, 2, 4, 7, 9), ncol=4)
> solve (mat)
           [,1]  [,2]       [,3]        [,4]
[1,] -0.1511111  0.06  0.2288889 -0.17111111
[2,]  0.5688889 -0.52 -0.3911111  0.40888889
[3,]  0.1733333  0.24 -0.3066667  0.09333333
[4,] -0.2977778  0.28  0.1422222 -0.05777778
> solve (mat) %*% res
             [,1]
[1,] -0.008888889
[2,]  0.151111111      # Note: result is a 4x1 matrix
[3,]  0.186666667
[4,]  0.217777778            

> solve (mat, c(2, 2, 3, 3)) # Result is a vector of length 4
[1] -0.008888889  0.151111111  0.186666667  0.217777778

Row and column names

One final handy thing is that the rows and/or columns of your matrix can have names. For example, we might set the columns of x to have the names of colors, and the rows to be people's names:

> dimnames(x) <- list (c("Bob", "Dave", "Mary", "Sandy"), c("Blue", "White", "Red"))
Note that dimnames() is a function that expects a list (see lists). The first item on the list is the vector of row names; the second is the vector of column names; either (or both) can be omitted by replacing the vector with the reserved word NULL. Now what does x look like?
> x
      Blue White Red 
  Bob    1     2   3
 Dave    4     5   6
 Mary    7     8   9
Sandy   10    11  12 # Same contents, it's just there are now row and column names.
We can now extract by name, rather than by number. This is handy because the numbers are subject to change, if for example we delete some rows or columns.
 > x["Bob","Blue"]      # The top-left element
[1] 1
> x[,"Red"]            # The "Red" column. Note the result is a vector with names.
 Bob Dave Mary Sandy 
   3    6    9    12

Data Frames

A data frame combines features of matrices and lists. In fact we can think of a data frame as a rectangular list, that is, a list in which all items have the length length. The items of the list serve as the columns of the data frame, so every item within a particular column has to be of the samne type. However, different columns can be of different types. For example, consider the built-in data frame called "PlantGrowth":
> PlantGrowth
   weight group
1    4.17  ctrl
2    5.58  ctrl
3    5.18  ctrl
4    6.11  ctrl
5    4.50  ctrl
 :     :    :
30   5.26  ctr2
This looks rectangular, but it's not a matrix since the second column isn't numeric like the first. The names of the list are the column headers: every data frame must have column names. (In contrast, a matrix doesn't have to have names.) A data frame must also have row names, although often, as here, they're just ascending integers. Since a data frame is a list, you can get at the column names with the names() function; since it's a matrix, you can also get at them with the dimnames() function we used above.

In general (as here) the rows of a data frame will contain incompatible data (numbers, characters, and so on). So in contrast to the matrix case, if you extract a single row from a data frame you get a data frame:

> PlantGrowth[3,]

  weight group
3   5.18  ctrl # This is not a vector, it's a 1x2 data frame -- and note the row name is "3", not "1".
There are some things you just can't do to a data frame. For example, you can't transpose it, because then you'd have columns with different types of things in them. When you try, R does all it can -- it converts everything to character, and then does the transposition:
> t(PlantGrowth)
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   ...
weight "4.17" "5.58" "5.18" "6.11" "4.50" "4.61" ...
group  "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" ...

Data frames are handy because real-life data frequently comes in this form: it's very often rectangular, with each row representing one case and the columns representing the observations. Since a data frame is both a list and matrix, we can use either matrix-type extraction or list-type extraction. For example, all four of these produce the same result:

> PlantGrowth[,1]             # (Matrix type) Give me column 1
 [1] 4.17 5.58 5.18 6.11 ...  
> PlantGrowth[[1]]            # (List type) Give me item 1
 [1] 4.17 5.58 5.18 6.11 ... 
> PlantGrowth[,"weight"]      # (Matrix type) Give me the column named "weight"
 [1] 4.17 5.58 5.18 6.11 ... 
> PlantGrowth$weight          # (List type) Give me the item named "weight"
 [1] 4.17 5.58 5.18 6.11 ... 
For list extraction, you only have to give enough of the name to make it unambiguous. Here PlantGrowth$w would be enough to get the information you wanted. Of course if there was a column named weather, you'd have to specify at least wei to be unambiguous.

Higher Dimension Arrays

Data frames must be two-dimensional (rows and columns). Occasionally, though, we run into a three- or higher-dimensional array. Normally this would be the output from the table() function. An array like that requires one subscript for every dimension. Here's a slightly odd example of a three-dimensional array:
> mytable <- table (PlantGrowth$weight > 4, PlantGrowth$weight > 5, PlantGrowth$group)
> mytable
, ,  = ctrl

       
        FALSE TRUE
  FALSE     0    0
  TRUE      4    6

, ,  = trt1

       
        FALSE TRUE
  FALSE     2    0
  TRUE      6    2

, ,  = trt2

       
        FALSE TRUE
  FALSE     0    0
  TRUE      1    9

Since weight > 4 was the first argument to the table function, it appears in the rows. weight > 5 appears in the columns, and group appears in the "layers." We need three subscripts to extract things from this array. Here's how we get only the first layer.
              > mytable[,,"ctrl"]   # Drops the extra dimension, returns a 2x2 matrix

       
        FALSE TRUE
  FALSE     0    0
  TRUE      4    6
     
> mytable[,,"ctrl", drop=F]         # Doesn't drop: returns a 2x2x1 array.

, ,  = ctrl

       
        FALSE TRUE
  FALSE     0    0
  TRUE      4    6
        
> mytable["TRUE",,"ctrl"]       # Drops two dimensions, returns a vector of length 2
FALSE  TRUE 
    4     6 
  
> mytable["TRUE",,"ctrl", drop=F] # Returns a 1x2x1 array.
, ,  = ctrl

      
       FALSE TRUE
  TRUE     4    6

Return to R docs