A "factor" is a vector whose elements can take on one of a specific set of values. For example, "Sex" will usually take on only the values "M" or "F," whereas "Name" will generally have lots of possibilities. The set of values that the elements of a factor can take are called its levels. If you want to add a new level to a factor, you can do that, but you can't just change elements to have new values that aren't already levels. Here's an example. I'll start by creating a factor whose values are "a", "b", and "c." The factor() function will do this, and it will generate the labels automatically.
> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c")) # create the factor > a # Print the new variable [1] a b c b c b a c c # You can tell it's not a character vector: no quotes Levels: a b c # Also the levels print out > levels(a) # You can get the set of levels separately [1] "a" "b" "c" # # What if I try to change an element to a new value, like "d"?# > a[3] <- "d" Warning messages: In `[<-.factor`(`*tmp*`, 3, value = "d") : invalid factor level, NAs generated # # The warning message tells you that some NAs have geen generated. # > a [1] a bb c b a c c Levels: a b c # # However it's okay to set elements to values that are already levels: # > a[3] <- "a" > a [1] a b a b c b a c c Levels: a b c # # It's also easy to change levels. Here I'll change the "a"'s to "AA". Notice that # I don't change the values themselves, just the levels. # > levels(a) [1] "a" "b" "c" > levels(a)[1] <- "AA" > a [1] AA b AA b c b AA c c Levels: AA b c # # The general way to convert a factor to character is with as.character(): # > as.character(a) [1] "AA" "b" "AA" "b" "c" "b" "AA" "c" "c" #
By default the levels are the unique data values sorted alphabetically. This turns out to matter in some statistical models. You can reorder the levels if you want.
> a <- factor (c(1, 2, 3, 2, 3, 2, 1), levels=1:4, labels=c("Small", "Medium", "Large", "Huge")) # Create it # # In this example, the "levels=1:4" is required. Otherwise the mismatch between the fact that # there are four labels but only three values will get you in trouble. Of course the values in "levels" # need to match the values in the data. # > a [1] Small Medium Large Medium Large Medium Small Levels: Small Medium Large Huge # # Notice how the levels (including "Huge") print out. # # Take a look at the table of a. The "Huge" level is remembered. # > table (a) Small Medium Large Huge 2 3 2 0I usually want this, but you can get rid of unused levels when subscripting by using the drop=T argument:
> table (a[,drop=T]) Small Medium Large 2 3 2
> a[3] <- NA # Make one entry NA > a # Sure enough [1] Small Medium NA Medium Large Medium Small Levels: [1] "Small" "Medium" "Large" "Huge" # # It's missing from table()... # > table (a) a Small Medium Large Huge 2 3 1 0 # # ... unless we ask explicitly. # > table (a, exclude=NULL) a Small Medium Large Huge2 3 1 0 1 > > sum (is.na (a)) # How many NAs are there in this vector? > [1] 1 # Answer: 1
> Steve$G <- as.numeric(levels(Steve$G)[Steve$G]) # or <- as.numeric(as.character(Steve$G))Remember that, internally, Steve$G is numeric. So indexing something by Steve$G is certainly possible.
Watch out! Here's something to watch out for. If your numeric gets converted to factor, then the levels will be what you want. The internal representation, the numbers 1, 2, and so on which R uses to keep track of things, will generally not be what you want. The reason is that by default level 1 gets assigned to the first value in alphabetical order, the second level to the second value, and so on. So suppose that your values are 8, 25, 111, and "Missing". When that gets imported, it will be recognized as character data. Then it will be converted to a factor, with levels corresponding to the values of these alphabetic values. Of course the alphabetic sorting scheme is different than the numeric one. In this example, "111" would bome first alphabetically. Here's another example:
> factor (c(1, 3, 17, 4, "NA", 5)) # Create and display a factor variable. The whole vector [1] 1 3 17 4 NA 5 # is converted to character before being factorized. Levels: 1 17 3 4 5 NA # # Suppose we save that as "fac" and try to convert to numeric # > fac <- factor (c(1, 3, 17, 4, "NA", 5)) > as.numeric (fac) [1] 1 3 2 4 6 5 # NOT what we want > as.numeric (as.character(fac)) # Much better [1] 1 3 17 4 NA 5 Warning message: NAs introduced by coercionIn that example, the character string "17" comes between "1" and "2" (just as "Ag" comes between "A" and "B") and so the "17" gets level 2. The as.numeric() function converts the factor into its level numbers. That's almost certainly not what you wanted.
> for (i in 1:ncol (a)) if (class (a[,i]) == "factor") a[,i] <- as.character(a[,i])
Reordering the levels of a factor
This question arises in some models. The first level is set to be the baseline in the usual "treatment contrasts" setup (see the discussion of contrasts.) Sometimes it's desirable to have a different level be the baseline. To do that, convert the vector to character, then call factor() passing the new levels in the levels= argument. The result will look like the original; only the ordering of the levels will have changed. For example:
> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c")) > a # Print a [1] a b c b c b a c c # The table is produced in order of the levels, which is, be default, alphabetical > table (a) a a b c 2 3 4 # # Convert a to character, then back to factor with a new vector of levels # > a <- factor (as.character(a), levels=c("c", "a", "b")) > a # a is unchanged, but note the levels [1] a b c b c b a c c Levels: c a b > table (a) # The table is unchanged, too, but it's in a different order. a c a b 4 2 3See also the relevel() and reorder functions.
In practice I don't make much use of ordered factors.