Skip to content

To detect potentially fraudulent healthcare providers from claim data, I apply techniques recommended by relevant literature.

Notifications You must be signed in to change notification settings

maxruther/HCP_Fraud_Detection

Repository files navigation

Healthcare Provider Fraud Detection

Overview

I observe myriad recommendations from consulted literature to build an optimally precise classifier of healthcare providers' potential fraudulence. The datasets come from a Kaggle post, and its labels resemble one of the few publicly available sources of labelled provider fraud data, the List of Excluded Individuals/Entities (LEIE).

Please see my full analysis in a Google Colab notebook.

If the notebook of my full analysis is too large, please see these segments into which I've broken it up:

  1. Setup
  2. Integration I
  3. Feature Engineering
  4. Integration II
  5. EDA
  6. Sampling
  7. Modelling

Highlights

The highlights of this work might relate to the following topics, and reside in the corresponding linked notebooks:

Topic of Highlight Resident Notebook(s)
data integration and feature engineering Integration I, Feature Engineering, Integration II
exploratory visualization EDA
correlation-based feature selection EDA
resampling to mitigate class imbalance Sampling
hypertuning, testing, and logging numerous models through automated means Modelling



Recommendations from the Literature

The recommendations I implement from the consulted literature, which mostly concerns approaches to fraud detection in healthcare claim data, include the following:

  • Feature engineering to aggregate claim and patient data to the provider level. [1]
  • Correlation-based feature selection to interpret variable importances and drastically reduce training times. [2]
  • Alleviating class imbalance to a 75-25 ratio. [3]
  • Classifying with ensemble learning methods, which were described as particularly effective on small samples, as well as those with rebalanced class ratios. [3]

Bibliography

  1. Kumaraswamy, Nishamathi, et al. "Healthcare fraud data mining methods: A look back and look ahead." Perspectives in health information management 19.1 (2022).

  2. Bolón-Canedo, Verónica, et al. "A review of microarray datasets and applied feature selection methods." Information sciences 282 (2014): 111-135.

  3. Herland, Matthew, Richard A. Bauder, and Taghi M. Khoshgoftaar. "Approaches for identifying US medicare fraud in provider claims data." Health care management science 23 (2020): 2-19.



Findings

Assessing the Recommended Techniques

Ultimately, I found these recommendations to be beneficial, overall. Below I briefly comment with my findings for each:

Recommended Technique My Assessment thereof in this Application
Feature engineering to aggregate to the provider level So plainly necessary for my application that I cannot consider it as a recommendation taken.
Correlation-based feature selection Beneficial in interpreting variable importances and drastically reducing training times. Only slightly costed the models' precision scores, usually.
Alleviating class imbalance Beneficial, as models trained on samples so adjusted were nearly always more precise than their counterparts.
Classifying with ensemble learning methods Mixed findings on this point. The best models did come from two ensemble learning methods. But the two non-ensemble methods, Naive Bayes and SVM, hit higher heights than Gradient Boosting (an ensemble method.) Also counter to the notion of ensemble advantageousness was a Naive Bayes model handily outperforming the others on the more class-imbalanced datasets.



Optimal Classifier of Potentially Fraudulent Providers

The optimally precise classifier came from using Ada Boosting on the rebalanced, CFS-reduced sample. It scored a precision of 0.864.



test

About

To detect potentially fraudulent healthcare providers from claim data, I apply techniques recommended by relevant literature.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published