A Genetic Algorithm for Variable Selection

What is a Genetic Algorithm?

A Genetic Algorithm is a search-based algorithm inspired by the concepts of natural selection and survival of the fittest. Genetic Algorithms are commonly used to generate good-quality solutions to difficult optimization and search problems by relying on genetics-inspired operators such as selection, crossover and mutation. A Genetic Algorithm repeatedly modifies a population of individual solutions; at each step, the algorithm selects individuals from the current population to be parents and uses them to produce the offsprings, subject to potential mutations, for the next generation. Over successive generations, the population evolves toward better solutions.

In R-sw Discriminant, a Genetic Algorithm is used to identify the combination(s) of variables able to most accurately predict cluster membership, variables chosen from the set of variables specified by the user.

Population

A population is a subset of candidate solutions. A candidate solution is a specific combination of variables that is assessed in terms of how accurately it is able to predict the cluster membership.

The Genetic Algorithm is based on a generational model:

an initial population is populated with random solutions (random initialization);
a solution could be characterized by a specific and pre-defined number of variables (e.g., 5 variables) or the user could allow the number of variables in the solution to vary between a minimum and a maximum, e.g. between 4 and 7 variables;
it is possible to force one or more variables into each solution;
at each iteration (generation) n x 2 parents are identified and n off-springs are generated, where n is the population size. Therefore the entire population is replaced by the new one at the end of the iteration;
the maximum number of generations to be considered is specified by the user.

The Fitness Function

The fitness function is a function that considers a candidate solution and produces, as output, the level of fit of the solution, i.e, how accurately the variables in the solution predict cluster membership. Objective of the Genetic Algorithm is to maximize the fitness function, i.e., identifying those candidate solutions that have the highest accuracy in predicting cluster membership.

Details:

The fitness function is based on either a linear discriminant analysis model or a quadratic discriminant analysis model.
The discriminant model could be fit to the full dataset and tested against the same, or the model could be cross-validated through repeated random sub-sampling validation. In the latter case, the dataset is split into training and testing datasets. The training dataset is characterized by 50% of the records (randomly extracted), while the testing dataset is characterized by the remaining records. The discriminant model is fit to the training dataset and then its predictive accuracy is assessed using the testing dataset. The process is repeated as many times as specified by the user and the results are then averaged to produce the reported fit.
The fit reported by the fitness function could be adjusted by the number of variables in the candidate solution. This could be useful as the algorithm might be asked to considers subsets of variables of different size and one might want to set a specific amount to be subtracted from the classification accuracy of the model for each variable appearing in the solution. By applying an adjustment, the more variables there are in the model, the lower its reported classification accuracy; this is done to privilege models with fewer variables. If an adjustment is not requested, a larger number of variables will tend to be selected as these are likely to result in a higher classification accuracy.

Parent selection

Fitness Proportionate Selection (FPS) is the approach used for parent selection.

Members of the population can be deemed eligible to become parents if they meet the minimum level of fit (i.e., classification accuracy) specified by the user. If the threshold is too high, most members of the population won’t be able to become parents (have offspring) and the Genetic Algorithm will not work well.
Every eligible member of the population can become a parent with a probability proportional to its fitness (classification accuracy). More accurate candidate solutions have a higher chance of mating and propagating their features to the next generation, thus there is the tendency to evolve better solutions over time.
It is possible to affect the FPS of parents through a scaling parameter. The larger this scaling parameter, the larger is the likelihood for parents with high fit to be selected versus parents with low fit.

Offspring generation

An offspring is produced using the genetic material of two parents.

The approach used is multi point crossover. Features from each parent (i.e., a subset of the variables characterizing the parent solution) are randomly selected to create the offspring.
The offspring has got a small probability to go through a mutation, which is a random tweak in its features. Mutations are required to maintain and introduce diversity in the population and attempt to prevent the Genetic Algorithm converging to a local minimum.
If the offspring is selected for mutation, one of the variables in the offspring solution is randomly selected and dropped and one variable currently not in the offspring solution is randomly selected and added to it.

Termination conditions

The Genetic Algorithm process stops when:

all members of the population have the same classification accuracy;
the target classification accuracy has been reached by at least one member of the current population;
there has been no improvement in the classification accuracy for a pre-specified number of generations;
the number of pre-specified generations has been reached.