# Insurance Data

This Jupyter Notebook takes an insurance data set from Kaggle and looks into the relationships between the different parameters given. First, we check for a correlation (a linear relationship) between BMI and insurance charges, and then we see if being a smoker influences what you’re charged by insurance companies using an A/B Test.

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# plt.style.use('ggplot')
# read table
insurance = pd.read_csv('insurance.csv')
insurance.head()
```

age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|

0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |

1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |

2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |

3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |

4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |

## Data Exploration

The best place to start with any data-oriented project is to figure out how the data look. To this end, we take a look at the distributions of the different data in the `insurance`

DataFrame. The figure created below has a histogram or bar chart for each column to show the counts of the data contained therein.

## Is there a correlation between BMI and insurance charges?

Correlation is calculated by taking two data points, putting them in standard units, multiplying the coordinates elementwise, and then finding the mean (all of this is defined in the `correlation`

function below). The value of $r$, heretofore referred to as correlation, ranges from -1 to 1; a value near 1 indicates a positive linear relationship (i.e. a line with a positive slope), near -1 indicates a negative linear relationship, and near 0 indicates little/no linear relationship.

Accompanying the calculation of $r$ is a scatter plot and a joint density plot of the data, with BMI on the $x$-axis and insurance charges on the $y$-axis.

```
standard_units = lambda x: (x - np.std(x))/np.mean(x)
correlation = lambda x, y: np.mean(standard_units(x) * standard_units(y))
slope = lambda x, y: correlation(x, y) * np.std(y) / np.std(x)
intercept = lambda x, y: np.mean(y) - slope * np.mean(x)
```

Based on the fact that the value of $r$ was around 0.1, there might be a *small* correlation between BMI and insurance charges, but the relationship is not as strong as it could be among other data points. There doesn’t seem to be such a huge correlation between BMI and insurance charges, but it kind of looks like there might be something (very) loosely positive there. In order to check, we see if there is perhaps a more linear relationship between BMI and the insurance charges on a log scale:

### Conclusion

As it turns out, although the correlation is .7, there really doesn’t look to be much there. So while there may be *some* linear relationship between BMI and insurance charges (or its log), it’s not too apparent.

## Does being a smoker affect what you’re charged by the insurance company?

**Null Hypothesis**: Being a smoker does not affect your charges; any differences in the observed values are due to random chance.

**Alternative Hypothesis**: Being a smoker *does* affect what you are charged in insurance premiums.

This question, from a data science perspective, is asking whether or not the charges for the groups `smoker`

and `non-smoker`

come from the same underlying distribution. To find this out, we use an A/B Test, which involves shuffling up the data in question in a sample *without replacement*, computing a test statistic, and finding the p-value. By convention, if the p-value is less than .05 (meaning less than 5% of the simulated data point in the same direction as the original data set), we lean in the direction of the alternative hypothesis. In this A/B test, the test statistic will be the absolute difference between the mean charges for smokers and non-smokers.

Before doing the actual permutation test, we will run through a single permutation to demonstrate the process that will eventually be done thousands of times to obtain a p-value. The first part of the permutation test is to shuffle up the `charges`

column of the `insurance`

DataFrame and create a new column with the shuffled-up charges. It is from this *permuting* of the sample that permutation tests get their name.

```
smoker_and_charges = insurance[['smoker', 'charges']]
charges_shuffled = list(smoker_and_charges.sample(frac=1)['charges'])
shuffled_charges_df = smoker_and_charges.assign(shuffled_charges=charges_shuffled)
shuffled_charges_df.head()
```

smoker | charges | shuffled_charges | |
---|---|---|---|

0 | yes | 16884.92400 | 6600.3610 |

1 | no | 1725.55230 | 12231.6136 |

2 | no | 4449.46200 | 3484.3310 |

3 | no | 21984.47061 | 8442.6670 |

4 | no | 3866.85520 | 42983.4585 |

The next part of the test is to compute the value of the test statistic, which is the absolute difference between the mean charges for smokers and non-smokers. (A high value of this statistic points in the direction of the alternative hypothesis.) To this end, we define the function `ts`

which takes a DataFrame and a column name as its arguments and returns the absolute difference of the mean value of `col_name`

after `df`

has been grouped by the column `smoker`

.

```
def ts(df, col_name):
df_grouped = df.groupby('smoker').mean()
return abs(df_grouped[col_name].iloc[0] - df_grouped[col_name].iloc[1])
# computing the test statistic on the table shuffled_charges_df
test_stat_1 = ts(shuffled_charges_df, 'shuffled_charges')
test_stat_1
```

```
668.2917138336725
```

Finally, we are ready for the permutation test. The function `perm_test`

below taks a DataFrame as its argument and the number of replications, `reps`

, to go through. For each replication, it permutes `df`

as we did above and computes the value of the test statistic, collecting them in the list `stats`

. After collecting these values, it computes the test statistic for the *original* data, and returns a p-value by taking the percentage of test statistics that are *greater than or equal to* the observed value.

```
def perm_test(df, reps):
stats = []
for _ in np.arange(reps):
charges_shuffled = list(df.sample(frac=1)['charges'])
df = df.assign(shuffled_charges=charges_shuffled)
stat = ts(df, 'shuffled_charges')
stats += [stat]
observed_ts = ts(df, 'charges')
return np.count_nonzero(stats >= observed_ts) / len(stats)
```

```
# run the permutation test with 100,000 repetitions
perm_test(smoker_and_charges, 100000)
```

```
0.0
```

### Conclusion

Because the p-value is 0, we know that *none* of the shuffled sets were as far or farther in the direction of the alternative hypothesis than was the original data set; this means that in all likelihood, the observed differences are *not* due to random chance. Thus, we lean in the direction of the alternative hypothesis: that being a smoker affects what you’re charged by insurance companies. Conventional wisdom, I know, but it is still nice to have it proven empirically.