KNN for classification and regression

  • The k-nearest neighbors algorithm (k-NN, kNN, or KNN) is an instance-based, non-parametric method proposed by Thomas Cover used for classification (when the output is a class membership) and regression (when the output is a value).
  • K-NN is an instance-based algorithm, as it compares new problem instances with instances seen in training and makes prediction based on the existing instances that are ``close" to the new problem instance.
  • In k-NN classification, the output is a class membership. An item/object is classified by a plurality vote of its neighbors, with the item being assigned to the class most common among its k nearest neighbors (k is typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor.
  • In k-NN regression, the output is a value, which is the average of the values of k nearest neighbors.
  • Both for classification and regression, a useful technique is to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. A common weighting scheme is 1/d, where d is the distance between the new instance and the neighbor.
  • Nearest neighbor search algorithms: brute-force vs those utilzing space partitioning data structure (e.g., ball tree or k-d tree).

Some notes

  • How to set the parameters? What k to use? What distance metric?
    (image from Wikipedia)