Average Distance Classifier

Classification/Search Algorithm

The Average Distance Classifier is a simplified version of Nearest Neighbor Classifier. It is really easy to implement compared to other classification algorithms. Because it discards a lot of the data given to it, its accuracy can be worse than other classifiers. The algorithm can be thought of as a nearest-neighbor where each label is a single neighbor.

Comparison with Nearest-Neighbor

Lets say you have a dataset with 3 features for each data point, and 15 labels in total. You have 20 examples for each label, giving you a total row count 300.

Instead of storing all your features like the classic nearest-neighbor, you only store the average of them for each label. This means while you would store 300 rows for nearest neighbor, you only need to store 15 rows for the average distance classifier.

This is a huge reduction of your data, which should make your final model easier to store and faster to execute. But at the same time, it will make your accuracy a lot worse. But due to the ease of implementation and the miniscule computational requirements, Average Distance Classifier might be an acceptable choice if your classification task is a simple one.

Psudeocode

Training

  1. For each label that you want to train your classifier on
    1. Take all the rows that match the label
    2. For each of the features
      1. Take the arithmetic mean of all the rows for that feature
    3. Store the arithmetic means and their respective labels in your classifier data structure

Classification

  1. Iterate over each label in the classifier data structure
  2. Calculate the distance of the unknown data to each of the known labels
  3. Return the label that has the least distance

Comments BETA