Factor Variables

# Factor Variables

### Ordered variables

#### What factor variables are

A "factor" is a vector whose elements can take on one of a specific set of values. For example, "Sex" will usually take on only the values "M" or "F," whereas "Name" will generally have lots of possibilities. The set of values that the elements of a factor can take are called its levels. If you want to add a new level to a factor, you can do that, but you can't just change elements to have new values that aren't already levels. Here's an example. I'll start by creating a factor whose values are "a", "b", and "c." The factor() function will do this, and it will generate the labels automatically.

> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c")) # create the factor
> a                      # Print the new variable
[1] a b c b c b a c c    # You can tell it's not a character vector: no quotes
Levels: a b c            # Also the levels print out
> levels(a)              # You can get the set of levels separately
[1] "a" "b" "c"
#
# What if I try to change an element to a new value, like "d"?#
> a[3] <- "d"
Warning messages:
In `[<-.factor`(`*tmp*`, 3, value = "d") :
invalid factor level, NAs generated
#
# The warning message tells you that some NAs have geen generated.
#
> a
[1] a    b     b    c    b    a    c    c
Levels: a b c
#
# However it's okay to set elements to values that are already levels:
#
> a[3] <- "a"
> a
[1] a b a b c b a c c
Levels: a b c
#
# It's also easy to change levels. Here I'll change the "a"'s to "AA". Notice that
# I don't change the values themselves, just the levels.
#
> levels(a)
[1] "a" "b" "c"
> levels(a)[1] <- "AA"
> a
[1] AA b  AA b  c  b  AA c  c
Levels: AA b c
#
# The general way to convert a factor to character is with as.character():
#
> as.character(a)
[1] "AA" "b"  "AA" "b"  "c"  "b"  "AA" "c" "c"
#

By default the levels are the unique data values sorted alphabetically. This turns out to matter in some statistical models. You can reorder the levels if you want.

#### Internal Storage and Extra Levels

Factor variables are stored, internally, as numeric variables together with their levels. The actual values of the numeric variable are 1, 2, and so on. Not every level has to appear in the vector. In this example I create a factor variable with four levels, even though I only actually have data in three of them.
> a <- factor (c(1, 2, 3, 2, 3, 2, 1), levels=1:4, labels=c("Small", "Medium", "Large", "Huge")) # Create it
#
# In this example, the "levels=1:4" is required. Otherwise the mismatch between the fact that
# there are four labels but only three values will get you in trouble. Of course the values in "levels"
# need to match the values in the data.
#
> a
[1] Small  Medium Large  Medium Large  Medium Small
Levels: Small Medium Large Huge
#
# Notice how the levels (including "Huge") print out.
#
# Take a look at the table of a. The "Huge" level is remembered.
#
> table (a)
Small Medium Large Huge
2      3     2    0
I usually want this, but you can get rid of unused levels when subscripting by using the drop=T argument:
> table (a[,drop=T])
Small Medium Large
2      3     2

#### Missing values in factors

Missing values in factor variables can be a drag. They're invisible to the table() function, unless you use the exclude=NULL argument that is not the default.
> a[3] <- NA # Make one entry NA
> a          # Sure enough
[1] Small  Medium NA     Medium Large  Medium Small
Levels:
[1] "Small"  "Medium" "Large"  "Huge"
#
# It's missing from table()...
#
> table (a)
a
Small Medium  Large   Huge
2      3      1      0
#
# ... unless we ask explicitly.
#
> table (a, exclude=NULL)
a
Small Medium  Large   Huge
2      3      1      0      1
> > sum (is.na (a))              # How many NAs are there in this vector?
> [1] 1                        # Answer: 1

#### Q: Why is my character variable a factor?

When you construct a data.frame with read.table(), the default decision is to turn every character variable into a factor. This may or may not be a good idea for you (see
"When do I need a factor variable?" below). If you don't want factors, use the stringsAsFactors = FALSE argument to read.table(). A single TRUE says "leave everything as is": a vector of TRUEs and FALSEs results in the conversion of all the columns for which the as.is argument is FALSE. You can also use a numeric vector to refer to specific columns.

#### Q: Why is my numeric variable a factor?

This usually happens when your "numeric" variable actually contains some non-numeric entries (like "NA" or "Missing" or an empty space). R sees that the column is not numeric, so it treats it as if it were character, and factorizes it (see the preceding paragraph). If you don't mind a few warnings, you can convert a column this has happened to into numeric in the following way. Suppose your data frame is named Steve and the column is G. Then this line converts the entries in Steve\$G to numeric, where possible. Non-numeric entries in G will be turned into NAs.
> Steve\$G <- as.numeric(levels(Steve\$G)[Steve\$G]) # or <- as.numeric(as.character(Steve\$G))
Remember that, internally, Steve\$G is numeric. So indexing something by Steve\$G is certainly possible.

Watch out! Here's something to watch out for. If your numeric gets converted to factor, then the levels will be what you want. The internal representation, the numbers 1, 2, and so on which S-Plus uses to keep track of things, will generally not be what you want. The reason is that by default level 1 gets assigned to the first value in alphabetical order, the second level to the second value, and so on. So suppose that your values are 8, 25, 111, and "Missing". When that gets imported, it will be recognized as character data. Then it will be converted to a factor, with levels corresponding to the values of these alphabetic values. Of course the alphabetic sorting scheme is different than the numeric one. In this example, "111" would bome first alphabetically. Here's another example:

> factor (c(1, 3, 17, 4, "NA", 5)) # Create and display a factor variable. The whole vector
[1] 1  3  17 4  NA 5               # is converted to character before being factorized.
Levels: 1 17 3 4 5 NA
#
# Suppose we save that as "fac" and try to convert to numeric
#
> fac <- factor (c(1, 3, 17, 4, "NA", 5))
> as.numeric (fac)
[1] 1 3 2 4 6 5                    # NOT what we want
> as.numeric (as.character(fac))   # Much better
[1]  1  3 17  4 NA  5
Warning message:
NAs introduced by coercion
In that example, the character string "17" comes between "1" and "2" (just as "Ag" comes between "A" and "B") and so the "17" gets level 2. The as.numeric() function converts the factor into its level numbers. That's almost certainly not what you wanted.

#### When do I want a factor variable?

Factor variables are useful in several places. First, some R functions that expect factors fail when given a character vector. (However, these are rare. Generally the modeling functions will convert character vectors to factors invisibly.) Second, it's sometimes handy to carry the set of levels around with you. Suppose you have a factor vector with four levels. Then table() is guaranteed to produce a four-entry table, whether you operate on the whole vector or on any subset. In contrast, that operation on a character vector will produce only as many entries in the table as there are unique elements in the subset. So if you're planning to compare the distribution of subsets, you'll want a factor. Third, factor variables can help make huge data smaller, since each observation is stored as an integer and the levels are only stored once.

#### When are factor variables a big pain?

Factor variables are a pain when you're cleaning your data because they're hard to update. My approach has always been to convert the variable to character with as.character(), then handle the variable as a character vector, and then convert it to factor (using factor() or as.factor()) at the end.

#### How do I converti factors to character vectors in a data frame?

The easiest way to keep factors out of your data.frames is to specify stringsAsFactors=FALSE at the time you create the data.frame. But I find that I do something like the below a lot. Suppose I have a data frame a with a bunch of factor columns, and also a bunch of other columns I don't want to touch. This loop may not be efficient, but it does what I want:
> for (i in 1:ncol (a)) if (class (a[,i]) == "factor") a[,i] <- as.character(a[,i])
Here's an interesting fact. Remember how you can refer to columns of a data frame either in matrix style or in list style? When you use the matrix-style notation S-Plus will often factorize your character variables automatically. That's not true for list-style notation, so list-style is often what you want. Here's an example:

#### Reordering the levels of a factor

This question arises in some models. The first level is set to be the baseline in the usual "treatment contrasts" setup (see the discussion of contrasts.) Sometimes it's desirable to have a different level be the baseline. To do that, convert the vector to character, then call factor() passing the new levels in the levels= argument. The result will look like the original; only the ordering of the levels will have changed. For example:
> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c", "c"))
> a                     # Print a
[1] a b c b c b a c c  # The table is produced in order of the levels, which is, be default, alphabetical
> table (a)
a
a b c
2 3 4
#
# Convert a to character, then back to factor with a new vector of levels
#
> a <- factor (as.character(a), levels=c("c", "a", "b"))
> a                    # a is unchanged, but note the levels
[1] a b c b c b a c c
Levels: c a b
> table (a)            # The table is unchanged, too, but it's in a different order.
a
c a b
4 2 3

# Ordered Factors

An "ordered" factor is a factor whose levels have a particular order. Ordered variables inherit from factors, so anything that you can to a factor you can do to an ordered factor. Create ordered factors with the ordered() command, or by using factor() with the ordered=TRUE argument. Many R models generally ignore ordering even if it is present.

In practice I don't make much use of ordered factors.