Knn With Categorical Variables

Version 0.1: August 2001

Introduction

This document describes software that performs k-nearest-neighbor (knn) classification with categorical variables. The basic idea is that each category is mapped into a real number in some optimal way, and then knn classification is performed using those numeric values. Details can be found in Buttrey (1998). The program to perform these computations has been compiled as a stand-alone (that is, command-line) Windows executable called ords, which you can get here. This is a complicated program with many command-line options in the old (one-hyphen) style. So, for example, to run the program using the data in the "mytrain" file as the training set, and the data in the "mytest" file as the test set, you would use the command
ords -T mytrain -t mytest
This assumes one nearest-neighbor and that all variables are categorical. This command
ords -T mytrain -t mytest -N -k 1 3 5 7 9 -v
tells the program that all the variables are numeric and that five different numbers of nearest neighbors (1, 3, 5, 7, and 9) should be tried. The -v flag produces "verbose" output.

Data File Format

The training and test set files should have the same format: one row per record, with whitespace separating the fields in each record and a new-line terminating each record. If the lines of the file are longer than 10,000 character, bad things will happen -- but 10,000 is a lot. The first field in each record should hold the class. (If you do not have a test set, use the training set for both the test and training sets and pass the -S argument.) Important: the classes and categories should start at 0 and go up by 1's. So do not have a category whose values are (1, 2, 3) or (0, 2, 4, 6) or (-1, -2, -3); they must be 0, 1, 2 and so on.

Recap of Technique

Many the options have to do with computations and stopping rules so we recap the technique here. Suppose there are C classes and j categorical predictors. Each predictor has several categories; suppose there is a total of n+ categories. Then we seek a n+-vector with one entry for every category. This vector turns out to be the solution to the eigen-equation U = µW , where W is a block matrix whose s, t block is the cross-tabulation of variables s and t, and U is a block matrix of which the p, q element of the s, t block is given by the sum, as c goes from 1 to C, of the ratio (items in class c with a p for variable s × items in class c with a q for variable t) ÷ (items in class c). Neither U nor W will generally be of full rank and we "ridge" both (that is, add a small amount to each diagonal element) by an amount set with the -r command-line argument.

Discriminant Analysis

The -d flag, as for example
ords -T mytrain -t mytest -d
uses a different technique, in which both classes and categories are mapped into real numbers in a space of dimension (number of classes - 1) and then a discriminant analysis is performed to do the classification. Computations are similar. Details can be found in Buttrey (2001).

Variable Selection

At each stage, the eigenvalue corresponding to the eigenvector measures the quality of the fit. Variables enter by stepwise forward addition. By default, addition stops when no variable produces an increase in the eigenvalue by more than the "improvement" parameter (see -i below). This number is somewhat arbitrary, so an an alternative the user can choose to do a permutation test. In this case the increase in the relevant eigenvalue for a particular variable is compared to the increase obtained when the categories for that variable are randomly permuted. This tests the null hypothesis that the variable has no predictive power. The user supplies the number of permutations and a threshold value for the number of permutations for which the eigenvalue should exceed the unpermuted value before the variable is excluded. (Normally this threshold will be 1: if any permutation does better than the original variable, the original variable is probably not very helpful.)

Numeric Variables

Numeric variables are represented with a linear spline basis. The number of knots is set by the -k option and is the same for all numeric variables (the default is five knots). The -N flag indicates that all variables are numeric. When some are numeric and other are not, the user will need to prepare a small file (called the fascinating file) to inform the program about which variables have which type.

Ordered Variables

The original paper (Buttrey, 1998) described a scheme for handling ordered variables. In this scheme we compute as above, and then use that as a starting point for a non-linear optimization routine in which the ordering is enforced. That routine was taken from the NAG libraries, though, and I am uncertain whether I am permitted to distribute code using pieces of those libraries. For the moment I have used public-domain or open-source code for the eigen-decomposition (using clapack from netlib), for sorting (using slatec from netlib) and for permutation (using ranlib from openresources.com). When I find a good public-domain optimizer I will try to include it. If you have access to the NAG libraries, you can add the -DNAG flag to the set of compiler flags: the program will then use NAG for all these tasks. This will (well, it should) also make the code for ordered variables available although it has not been well tested.

Building a Windows DLL For S-Plus

I am currently trying to build a DLL for S-Plus under windows. This will allow the routine to be called directly from S-Plus with relevant data matrices and parameters passed in. The -DCALL_FROM_S flag is required and this has worked, in the past, on Solaris systems, the resulting object being loaded with S-Plus's dyn.load() function. I will dig up and make available the necessary S-Plus code. However I have not yet been able to build a DLL for Windows.

A Note on the Current Version

The version I'm using right now (this one) was compiled with Microsoft Visual C++ with the debug configuration. When I compile into the "release" configuration the program crashes. This makes me nervous. On the other hand the results I'm getting are reasonable. I think it has something to do with memory allocation (doesn't it always?). I'm continuing to look into this as well as playing with the cygwin compilers.

Source Code

You are welcome to the source code. You can find a zipped version here. It has no warranty of any sort, but I believe it works, at least pretty often under many circumstances. Do with it what you will. As mentioned above some pieces were not written by me and may carry open-source-type or other copyrights. If you don't already have them you'll need the libraries lapack.lib, blas.lib, f77.lib, and libi77.lib. The links will let you download versions for Windows (you'll probably need to right-click and choose "Save As..."); for other platforms you'll need to find or build your own. You may also need to re-name these for your compiler.

I Am Not a Software Engineer

My code works but it's not very pretty. Furthermore I'm not great at version control. If you'd like to use my software I'll be happy to help out, at least to the extent that I can and have time available, but I am not a software engineer. I just thought that was important enough to get its own paragraph.

Command-line Options:

-CName of cost file. Default: none; all misclassifications cost 1.
-cUse classification (rather than discrimination). The default.
-dUse discrimination (rather than classification). By default nearest-neighbor classification is used.
-FName of status file, to which "verbose"-type messages are sent. Default: messages are sent to stderr. Even when verbose is 0, there is a small set of messages (like mappings and error rates) you can expect. Messages arising from problems encountered while reading the options are always sent to stderr (since these might arise before the -F option has been parsed).
-fName of fascinating file. The fascinating file gives information about each of the variables in the case where some variables are categorical (or ordered) and others numeric. Default: no file, all variables are unordered categorical (see -N and -U).
-iImprovement. This is the minimum amount by which the first eigenvalues must go up in order for the stepwise addition to continue. That is, if the increase in the first eigenvalue is smaller than "improvement," the model is complete. Default: 1e-9. This is ignored if the permutation test is used (see -P).
-KNumber of knots per continuous variable. Default: 5.
-kVector of k's for k-nearest-neighbor classification. Default: k = 1 only.
-mMissing value threshold. Any item smaller than this number is presumed to represent a missing value. Default: -99.
-NAll variables are numeric. Default: not true (see -U).
-oOnce out, always out. If present, any variable that is deleted from the model because it failed the permutation test is excluded from the model permanently to save time. Default: not true.
-pPrior probabilities, one for each class. Default: prior probability for class i is estimated by the proportion of training set observations in class i.
-PPermutation test (instead of using an improvement threshold; see -i). This should be followed by two numbers: the first is the number of permutations to perform, and the second is the "permute_tail." This item tells how many of the permutations need to be better than the unpermuted before a variable is excluded. Default: 0 (that is, use "improvement" (see -i)); if present, permute_tail defaults to 1.
-rRidge value, used to make some matrices invertible. Default .003: 0 not currently permitted but it should be (?)
-sSlots, that is, number of nearest neighbors to actually look for. (This is a little bigger than the largest value in k_vec because of possible ties.) Default: the larget k + 15. Note: this argument currently has no effect!
-SIf present, the training and test sets are identical. In that case row i of the training set is not considered as a possible neighbor to row i of the test set. Default: not present. This is not implemented for discrimination, so using this and -d together will give artificially small error rates.
-TName of the training set.
-tName of the test set.
-UAll variables are unordered. Default: true.
-vVerbosity level. Default is 0; each "-v" adds 1.
-xNumber of cross-validations to do when deciding which k and which variables to use. Default: 10.

The Cost File

The cost file specifies variable misclassification costs. It should take the form of a matrix whose ij-th entry gives the cost of misclassifying a class i item as being in class j. Diagonal elements are ignored. A cost file for the two-class case might look like this:
0 5
1 0
In this example, classifying a class 0 observation wrongly costs five times as much as classifying a class 1 observation wrongly. As the paper notes, the program handles the 2x2 case, or the case when the cost matrix has constant rows (excepting the diagonal), but not the general case.

The Fascinating File

The "fascinating file" holds information about each of the variables. It will normally be needed only when the variables are of mixed type, some continuous and some categorical. It should be an ordinary text file. Lines that start with "#" are ignored. The first line may be "all numeric" or "all unordered" in which case processing is complete; otherwise, each line should have a variable number (zero-based), some white space, and then one of the strings "numeric," "ordered" or "unordered" for that variable. The variables need not be in order. An example of a valid fascinating file is:
# Fascinating file example
# Lines starting with "#" are ignored
0 numeric
1 unordered
2 ordered
3 numeric
"Ordered" will be changed to "unordered" in a discrimination problem since ordering is only supported in the classification problem. Actually it is currently never supported (see ordered variables above).

References

Buttrey (1998): Buttrey, S.E., "Nearest-Neighbor Classification with Categorical Variables," Computational Statistics and Data Analysis, Vol. 28, No. 2, (1998), pp. 157-169.

Buttrey (2001): Buttrey, S.E., "Discrimination with Categorical Variables," in submission.