KNN Classifier to detect potential credit card fraud

Through this project, I was attempting to understand the K-nearest-neighbors classification algorithm and the process of selecting the optimal estimators.

Understanding the dataset#

The dataset was a CSV file consisting of PCA values for certain transaction information to protect consumer privacy. The Amount feature is the amount of money in that particular transaction and the Class feature contains two classes safe and fraud

Objective#

The goal was to find the optimal parameters of the KNN estimator using cross validation and then provide a final estimate of the model’s generalization performance via the test set.

Methodology#

A grid search was performed to optimize the following hyperparameters:

  • The number of neighbors n_neighbors
  • The type of weights considered : uniform or distance based on whether each neighbor was to be assigned a uniform weight or a weight proportional to the inverse of the distance from the query point
  • The type of metrics considered: minkowski or chebyshev to see which distance measurement metric is better suited to the dataset

The gridseach was done by specifying the number of folds to 5.

Results#

The best parameters are {'metric': 'minkowski', 'n_neighbors': 3, 'weights': 'uniform'}
The best accuracy on the training data is 0.9546875
The best accuracy on the testing data is 0.90625

https://github.com/vigneshsundararajan/KNN-credit-fraud-dataset