A missing value is one whose value is unknown. Missing values are represented in R by the NA symbol. NA is a special value whose properties are different from other values. NA is one of the very few reserved words in R: you cannot give anything this name. (Because R is case-sensitive, na and Na are okay to use, although I don't recommend them.) Missing values are often legitimate: values really are missing in real life. NAs can arise when you read in a Excel spreadsheet with empty cells, for example. You will also see NA when you try certain operations that are illegal or don't make sense. Here are some examples of operations that produce NA's.
> var (8) # Variance of one number [1] NA > as.numeric (c("1", "2", "three", "4")) # Illegal conversion [1] 1 2 NA 4 Warning message: NAs introduced by coercion > c(1, 2, 3)[4] # Vector subscript out of range [1] NA > NA - 1 # Most operations on NAs produce NAs [1] NA > a <- data.frame (a = 1:3, b = 2:4) > a[4,] # Data frame row subscript out of range a b NA NA NA # The first NA there is the row number > a[,4] # Specifying a non-existent column just produces an error Error in `[.data.frame`(a, , 4) : undefined columns selected # # Here's one that's particularly irksome # > a[1,2] <- NA # Suppose you have an NA in your dataframe... > a[a$b < 4,] # ...and you try to index on that column a b NA NA NA # You get one of these NA rows for each NA in that column. 2 2 3Note that if you specify a row number out of range for a data frame, that's not an error. You just get a row full of NAs. Interestingly, if you specify a row index of a matrix that's too big, you get a different respose altogether. That is an error. It's also an error to specify too big an index for a column of either a matrix or a data frame.
> x <- c(1, 2, NA, 4) # Set up a numeric vector > x # There's an NA in there [1] 1 2 NA 4 > x + 1 # NA + 1 = NA [1] 2 3 NA 5 > sum(x) # This produces NA because we can't add NAs [1] NA > length(x) # This is okay [1] 4
By the way, the default mode of NA is logical. That generally won't affect us.
> x # Here's my vector [1] 1 2 NA 4 > is.na(x) # Is it NA? [1] F F T F # Answer: no, no, yes, no. > which (is.na(x)) # Which one is NA? [1] 3 # Answer: the third oneTo find all the rows in a data frame with at least one NA, try this:
> unique (unlist (lapply (your.data.frame, function (x) which (is.na (x)))))lapply() applies the function to each column and returns a list whose i-th element is a vector containing the indices of the elements which have missing values in column i. unlist() turns that list into a vector and unique() gets rid of the duplicates. To learn more about lapply(), see the apply family of functions.
R's modeling functions accept an na.action argument that tells the function what to do when it encounters an NA. This causes the modeling function to call one of the missing value filter functions. These functions replace the original data set by a new data set in which the NAs have been altered. The default setting is na.omit, which excludes all rows with any missing values. An alternative is na.action=na.fail, which just stops when it encounters any missing values. This is useful if you didn't know you had any. The filter functions are:
A couple of other packages supply more alternatives:
# # Set up a data frame, make a couple of elements NA. # > a <- data.frame (c1 = 1:8, c2 = factor (c("a", "b", "a", "c", "b", "c", "a", "b"))) > a[4,1] <- a[6,2] <- NA # This repeated assignment is legal and does what you expect. > a c1 c2 1 1 a 2 2 b 3 3 a 4 NA c 5 5 b 6 6# Note the slightly different display of missings inside factors 7 7 a 8 8 b > levels(a$c2) # Note the levels of c2 are "a," "b" and "c." NA is not a level. [1] "a" "b" "c" > na.fail (a) # Fails if NAs are present Error in na.fail.default(a) : missing values in object > na.exclude (a) # Omits rows with NAs in them c1 c2 1 1 a 2 2 b 3 3 a 5 5 b 7 7 a 8 8 b > na.gam.replace (a) # Replace missing in c1 with the mean of the non-missings; c1 c2 # Add a new level to c2 1 1.000000 a 2 2.000000 b 3 3.000000 a 4 4.571429 c 5 5.000000 b 6 6.000000 NA 7 7.000000 a 8 8.000000 b > levels (na.gam.replace(a)$c2) # There's now a fourth level in that column [1] "a" "b" "c" "NA"