Missing Values in R

Missing Values

A missing value is one whose value is unknown. Missing values are represented in R by the NA symbol. NA is a special value whose properties are different from other values. NA is one of the very few reserved words in R: you cannot give anything this name. (Because R is case-sensitive, na and Na are okay to use, although I don't recommend them.) Missing values are often legitimate: values really are missing in real life. NAs can arise when you read in a Excel spreadsheet with empty cells, for example. You will also see NA when you try certain operations that are illegal or don't make sense. Here are some examples of operations that produce NA's.

> var (8)                                  # Variance of one number
[1] NA
> as.numeric (c("1", "2", "three", "4"))   # Illegal conversion
[1]  1  2 NA  4
Warning message:
NAs introduced by coercion
> c(1, 2, 3)[4]                            # Vector subscript out of range                   
[1] NA
> NA - 1                                   # Most operations on NAs produce NAs
[1] NA

> a <- data.frame (a = 1:3, b = 2:4)
> a[4,]                                    # Data frame row subscript out of range
    a  b 
NA NA NA                                   # The first NA there is the row number
> a[,4]                                    # Specifying a non-existent column just produces an error
Error in `[.data.frame`(a, , 4) : undefined columns selected
#
# Here's one that's particularly irksome
#
> a[1,2] <- NA                             # Suppose you have an NA in your dataframe...
> a[a$b < 4,]                              # ...and you try to index on that column
    a  b 
NA NA NA                                   # You get one of these NA rows for each NA in that column.
 2  2  3

Note that if you specify a row number out of range for a data frame, that's not an error. You just get a row full of NAs. Interestingly, if you specify a row index of a matrix that's too big, you get a different respose altogether. That is an error. It's also an error to specify too big an index for a column of either a matrix or a data frame.

Operations on Missing Values

Almost every operation performed on an NA produces an NA. For example:

> x <- c(1, 2, NA, 4)                      # Set up a numeric vector
> x                                        # There's an NA in there
[1]  1  2 NA  4              
> x + 1                                    # NA + 1 = NA
[1]  2  3 NA  5
> sum(x)                                   # This produces NA because we can't add NAs
[1] NA
> length(x)                                # This is okay
[1] 4

By the way, the default mode of NA is logical. That generally won't affect us.

Detecting NAs

You can't find missing values by looking at x == NA. Like most other functions, the == operator returns NA when either argument is NA. The is.na() function will find missing values for you: this function returns a logical vector the same length as its argument, with T for missing values and F for non-missings. It's fairly common to want to know the index of the missing values, and the which() function will help do this for you. For example:

> x                # Here's my vector
[1]  1  2 NA  4    
> is.na(x)         # Is it NA?
[1] F F T F        # Answer: no, no, yes, no.
> which (is.na(x)) # Which one is NA?
[1] 3              # Answer: the third one

To find all the rows in a data frame with at least one NA, try this:

> unique (unlist (lapply (your.data.frame, function (x) which (is.na (x)))))

lapply() applies the function to each column and returns a list whose i-th element is a vector containing the indices of the elements which have missing values in column i. unlist() turns that list into a vector and unique() gets rid of the duplicates. To learn more about lapply(), see the apply family of functions.

Ways to Exclude Missing Values

Math functions generally have a way to exclude missing values in their calculations. mean(), median(), colSums(), var(), sd(), min() and max() all take the na.rm argument. When this is TRUE, missing values are omitted. The default is FALSE, meaning that each of these functions returns NA if any input number is NA. Note that cor() and its relatives don't work that way: with those you need to supply the use= argument. This is to permit more complicated handling of missing values than simply omitting them.

R's modeling functions accept an na.action argument that tells the function what to do when it encounters an NA. This causes the modeling function to call one of the missing value filter functions. These functions replace the original data set by a new data set in which the NAs have been altered. The default setting is na.omit, which excludes all rows with any missing values. An alternative is na.action=na.fail, which just stops when it encounters any missing values. This is useful if you didn't know you had any. The filter functions are:

na.fail: Stop if any missing values are encountered
na.omit: Drop out any rows with missing values anywhere in them and forgets them forever.
na.exclude: Drop out rows with missing values, but keeps track of where they were (so that when you make predictions, for example, you end up with a vector whose length is that of the original response.)
na.pass: Take no action.
A couple of other packages supply more alternatives:
na.tree.replace (library (tree): For discrete variables, adds a new category called "NA" to replace the missing values.
na.gam.replace (library gam): Operates on discrete variables like na.tree.replace(); for numerics, NAs are replaced by the mean of the non-missing entries.

Here are examples of these functions at work. You can call them directly, as I will do here, but they are also commonly used as values for the na.action= argument to the modeling functions.

#
# Set up a data frame, make a couple of elements NA.
#
> a <- data.frame (c1 = 1:8, c2 = factor (c("a", "b", "a", "c", "b", "c", "a", "b")))
> a[4,1] <- a[6,2] <- NA    # This repeated assignment is legal and does what you expect.
> a
  c1  c2 
1  1   a
2  2   b
3  3   a
4 NA   c
5  5   b
6  6                   # Note the slightly different display of missings inside factors
7  7   a
8  8   b
> levels(a$c2)              # Note the levels of c2 are "a," "b" and "c." NA is not a level.
[1] "a" "b" "c"

> na.fail (a)               # Fails if NAs are present
Error in na.fail.default(a) : missing values in object 

> na.exclude (a)            # Omits rows with NAs in them
  c1 c2 
1  1  a
2  2  b
3  3  a
5  5  b
7  7  a
8  8  b

> na.gam.replace (a)        # Replace missing in c1 with the mean of the non-missings;
        c1 c2               # Add a new level to c2
1 1.000000  a
2 2.000000  b
3 3.000000  a
4 4.571429  c
5 5.000000  b
6 6.000000 NA
7 7.000000  a
8 8.000000  b
> levels (na.gam.replace(a)$c2)  # There's now a fourth level in that column
[1] "a"  "b"  "c"  "NA"

Special Case 1a: Missing Values in Factor Vectors

We noted above that a missing value in a factor variable is displayed as <NA> rather than just NA. Again, missing values do not have a level, but you can change a missing value to one of the existing levels. (You can't create a new level on the fly, though -- see the discussion of factors.

Special Case 2: Missing Values in Character Vectors

Character vectors can have missing values. They display as NA in the usual way. This really isn't a special case at all.

Special Case 3: NaNs

In addition to NA, R has a special value NaN for "not a number." 0/0 is an example of a calculation that will produce a NaN. NaNs print as NaN, but generally act like NAs. (For example, a computation done on an NaN produces an NaN; if you try to extract the NaNth element of a vector, you get NA.) One more special value is Inf. If you need them, there are is.nan() and functions for finding things that are NaN or infinite and not NA.

Return to R docs