I observe myriad recommendations from consulted literature to build an optimally precise classifier of healthcare providers' potential fraudulence. The datasets come from a Kaggle post, and its labels resemble one of the few publicly available sources of labelled provider fraud data, the List of Excluded Individuals/Entities (LEIE).
Please see my full analysis in a Google Colab notebook.
If the notebook of my full analysis is too large, please see these segments into which I've broken it up:
The highlights of this work might relate to the following topics, and reside in the corresponding linked notebooks:
Topic of Highlight | Resident Notebook(s) |
---|---|
data integration and feature engineering | Integration I, Feature Engineering, Integration II |
exploratory visualization | EDA |
correlation-based feature selection | EDA |
resampling to mitigate class imbalance | Sampling |
hypertuning, testing, and logging numerous models through automated means | Modelling |
The recommendations I implement from the consulted literature, which mostly concerns approaches to fraud detection in healthcare claim data, include the following:
- Feature engineering to aggregate claim and patient data to the provider level. [1]
- Correlation-based feature selection to interpret variable importances and drastically reduce training times. [2]
- Alleviating class imbalance to a 75-25 ratio. [3]
- Classifying with ensemble learning methods, which were described as particularly effective on small samples, as well as those with rebalanced class ratios. [3]
-
Kumaraswamy, Nishamathi, et al. "Healthcare fraud data mining methods: A look back and look ahead." Perspectives in health information management 19.1 (2022).
-
Bolón-Canedo, Verónica, et al. "A review of microarray datasets and applied feature selection methods." Information sciences 282 (2014): 111-135.
-
Herland, Matthew, Richard A. Bauder, and Taghi M. Khoshgoftaar. "Approaches for identifying US medicare fraud in provider claims data." Health care management science 23 (2020): 2-19.
Ultimately, I found these recommendations to be beneficial, overall. Below I briefly comment with my findings for each:
Recommended Technique | My Assessment thereof in this Application |
---|---|
Feature engineering to aggregate to the provider level | So plainly necessary for my application that I cannot consider it as a recommendation taken. |
Correlation-based feature selection | Beneficial in interpreting variable importances and drastically reducing training times. Only slightly costed the models' precision scores, usually. |
Alleviating class imbalance | Beneficial, as models trained on samples so adjusted were nearly always more precise than their counterparts. |
Classifying with ensemble learning methods | Mixed findings on this point. The best models did come from two ensemble learning methods. But the two non-ensemble methods, Naive Bayes and SVM, hit higher heights than Gradient Boosting (an ensemble method.) Also counter to the notion of ensemble advantageousness was a Naive Bayes model handily outperforming the others on the more class-imbalanced datasets. |
The optimally precise classifier came from using Ada Boosting on the rebalanced, CFS-reduced sample. It scored a precision of 0.864
.