In this project, we delve into the insurance domain to explore how various client characteristics influence the annual insurance payouts. Using statistical tests within an Exploratory Data Analysis (EDA) framework, we aim to uncover patterns and relationships within the data. The insights gained will assist an insurance company in understanding the impact of client attributes on the size of insurance payouts.
The dataset comprises annual insurance payouts alongside client characteristics, structured as follows:
- age: Client's age.
- sex: Client's gender (female/male).
- bmi: Body Mass Index, indicating the body fat based on height and weight.
- children: Number of dependents covered by the insurance.
- smoker: Smoking status of the client (yes/no).
- region: Client's residential area in the US (northeast, southeast, southwest, northwest).
- charges: Annual insurance payouts to the client.
Source: "Medical Cost Personal Datasets" (kaggle.com)
The project seeks to answer the following questions with a significance level of
- Are insurance payouts for male clients higher than for female clients?
- Are insurance payouts for non-smokers less than for smokers?
- Does the region of residence affect the size of payouts?
- Is there a relationship between smoking and gender?
- Data Preprocessing: Initial data cleaning and preparation for analysis.
- Exploratory Data Analysis (EDA): Conducted to identify patterns and relationships in the data.
- Statistical Testing: Employed various statistical tests to answer the research questions, including:
- Shapiro-Wilk test for normality.
- Mann-Whitney U test and Kruskal-Wallis test for comparing distributions.
- Chi-square test for independence between categorical variables.
- No significant difference in insurance payouts between male and female clients.
- Insurance payouts for non-smokers are significantly lower than for smokers.
- No evidence suggests that the region of residence impacts the size of insurance payouts.
- A significant relationship exists between smoking status and gender, indicating that smoking preferences differ between genders.
The statistical analysis provides valuable insights into factors influencing insurance payouts, highlighting the importance of smoking status over gender or region of residence. These findings can aid insurance companies in refining their risk assessment models and tailoring insurance policies more effectively.
- Clone the repository to access the Jupyter notebooks.
- Install the required Python packages listed in
requirements.txt
. - Explore the notebooks to see the data analysis, statistical testing, and conclusions in detail.
- Incorporate more client characteristics to explore additional factors influencing insurance payouts.
- Apply machine learning models to predict insurance payouts based on client attributes.
- Extend the analysis to compare different insurance products and their dependency on client attributes.
The project is distributed under the MIT license. You can freely use and distribute this code for personal and commercial purposes with a mandatory link to the author.
Credits to data providers, contributors, and any references used in the development of this project.