A "factor" is a vector whose elements can take on one of a specific set of values. For example, "Sex" will usually take on only the values "M"
or "F," whereas "Name" will generally have lots of possibilities. The set of values that the elements of a factor can take are called its **levels**.
If you want to add a new level to a factor, you can do that, but you can't just change elements to have new values that aren't already levels.
Here's an example. I'll start by creating a factor whose values are "a", "b", and "c." The `factor()` function will do this, and it will generate the labels automatically.

> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c")) # create the factor > a # Print the new variable [1] a b c b c b a c c # You can tell it's not a character vector: no quotes Levels: a b c # Also the levels print out > levels(a) # You can get the set of levels separately [1] "a" "b" "c" # # What if I try to change an element to a new value, like "d"?# > a[3] <- "d" Warning messages: In `[<-.factor`(`*tmp*`, 3, value = "d") : invalid factor level, NAs generated # # The warning message tells you that some NAs have geen generated. # > a [1] a bb c b a c c Levels: a b c # # However it's okay to set elements to values that are already levels: # > a[3] <- "a" > a [1] a b a b c b a c c Levels: a b c # # It's also easy to change levels. Here I'll change the "a"'s to "AA". Notice that # I don't change the values themselves, just the levels. # > levels(a) [1] "a" "b" "c" > levels(a)[1] <- "AA" > a [1] AA b AA b c b AA c c Levels: AA b c # # The general way to convert a factor to character is with as.character(): # > as.character(a) [1] "AA" "b" "AA" "b" "c" "b" "AA" "c" "c" #

By default the levels are the unique data values **sorted alphabetically**. This turns out to matter in some statistical models. You can reorder the levels if you want.

> a <- factor (c(1, 2, 3, 2, 3, 2, 1), levels=1:4, labels=c("Small", "Medium", "Large", "Huge")) # Create it # # In this example, the "levels=1:4" is required. Otherwise the mismatch between the fact that # there are four labels but only three values will get you in trouble. Of course the values in "levels" # need to match the values in the data. # > a [1] Small Medium Large Medium Large Medium Small Levels: Small Medium Large Huge # # Notice how the levels (including "Huge") print out. # # Take a look at the table of a. The "Huge" level is remembered. # > table (a) Small Medium Large Huge 2 3 2 0I usually want this, but you can get rid of unused levels when subscripting by using the

> table (a[,drop=T]) Small Medium Large 2 3 2

> a[3] <- NA # Make one entry NA > a # Sure enough [1] Small Medium NA Medium Large Medium Small Levels: [1] "Small" "Medium" "Large" "Huge" # # It's missing from table()... # > table (a) a Small Medium Large Huge 2 3 1 0 # # ... unless we ask explicitly. # > table (a, exclude=NULL) a Small Medium Large Huge2 3 1 0 1 > > sum (is.na (a)) # How many NAs are there in this vector? > [1] 1 # Answer: 1

> Steve$G <- as.numeric(levels(Steve$G)[Steve$G]) # or <- as.numeric(as.character(Steve$G))Remember that, internally,

Watch out! Here's something to watch out for. If your numeric gets converted to factor, then the **levels** will be what you want. The internal representation, the numbers 1, 2, and so on which S-Plus uses to keep track of things, will generally **not** be what you want. The reason is that by default level 1 gets assigned to the first value **in alphabetical order**, the second level to the second value, and so on. So suppose that your values are 8, 25, 111, and "Missing". When that gets imported, it will be recognized as character data. Then it will be converted to a factor, with levels corresponding to the values of these alphabetic values. Of course the alphabetic sorting scheme is different than the numeric one. In this example, "111" would bome first alphabetically. Here's another example:

> factor (c(1, 3, 17, 4, "NA", 5)) # Create and display a factor variable. The whole vector [1] 1 3 17 4 NA 5 # is converted to character before being factorized. Levels: 1 17 3 4 5 NA # # Suppose we save that as "fac" and try to convert to numeric # > fac <- factor (c(1, 3, 17, 4, "NA", 5)) > as.numeric (fac) [1] 1 3 2 4 6 5 # NOT what we want > as.numeric (as.character(fac)) # Much better [1] 1 3 17 4 NA 5 Warning message: NAs introduced by coercionIn that example, the character string "17" comes between "1" and "2" (just as "Ag" comes between "A" and "B") and so the "17" gets level 2. The

> for (i in 1:ncol (a)) if (class (a[,i]) == "factor") a[,i] <- as.character(a[,i])Here's an interesting fact. Remember how you can refer to columns of a data frame either in matrix style or in list style? When you use the matrix-style notation S-Plus will often factorize your character variables automatically. That's not true for list-style notation, so list-style is often what you want. Here's an example:

> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c")) > a # Print a [1] a b c b c b a c c # The table is produced in order of the levels, which is, be default, alphabetical > table (a) a a b c 2 3 4 # # Convert a to character, then back to factor with a new vector of levels # > a <- factor (as.character(a), levels=c("c", "a", "b")) > a # a is unchanged, but note the levels [1] a b c b c b a c c Levels: c a b > table (a) # The table is unchanged, too, but it's in a different order. a c a b 4 2 3

In practice I don't make much use of ordered factors.