Predictive power score introduction
   6 min read    Rohit Pruthi

1.0 What is correlation?

In non mathematical terms, it is easy to understand correlation using the below picture.

  • Do two things move together?
  • Do they move independently of each other?
  • Do they move opposite of each other?

alt text

We can intuitively understand it.

  • The number of hours put in to study is correlated to marks.
  • Net caloric intake to weight is possibly high correlation.

We can infer these because in this case inherent causation is clear. However, in complex multi-variable problems, causation in not always clear. And who better than xkcd to explain this.

alt text

If you have used data science or statistics earlier, you would have heard the caution ‘Correlation is not causation’.

In practice, some time correlation is not even the correct measure of dependence.

We need to understand how it is defined before making use and there are some experiments in better forms of looking at dependence.

2.0 Of course, I know correlation - spearman-rank, pearson, kendall..

  • The score ranges from -1 to 1
  • Tells if there is a strong linear relationship — either in a positive or negative direction.

What about non linear relationship? Sine curve, step function? Anyway.

  • Did you drop all the category columns already? Well, you have to.

What if majority of my columns are ordinal? Can I onehotencode, would that be meaningful?

3.0 Enter PPS

There is another different way of thinking about correlation.

PPS - Predictive Power Score uses decision trees as the basis of calculation relationship between variables

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).

Of course, I won’t go into all the details here, but you can always read about it. Let us try it out.

4.0 Time to install ppscore

1
!pip install ppscore
Requirement already satisfied: ppscore in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (1.2.0)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from ppscore) (1.1.5)
Requirement already satisfied: scikit-learn<1.0.0,>=0.20.2 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from ppscore) (0.23.2)
Requirement already satisfied: numpy>=1.15.4 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from pandas<2.0.0,>=1.0.0->ppscore) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from pandas<2.0.0,>=1.0.0->ppscore) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from pandas<2.0.0,>=1.0.0->ppscore) (2021.1)
Requirement already satisfied: six>=1.5 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.0.0->ppscore) (1.15.0)
Requirement already satisfied: joblib>=0.11 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from scikit-learn<1.0.0,>=0.20.2->ppscore) (1.0.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from scikit-learn<1.0.0,>=0.20.2->ppscore) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\site-packages (from scikit-learn<1.0.0,>=0.20.2->ppscore) (1.5.4)

5.0 Import some libraries

1
2
3
4
5
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
1
import ppscore as pps

6.0 Create data for exploration

This is a square function, which would intentionally fail the correlation test (I know, wickedly simple! and copied from the documentation of ppscore).

1
2
3
4
5
6
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = (df["x"]*df["x"]) + df["error"]

sns.scatterplot(data=df, x='x', y='y')
<AxesSubplot:xlabel='x', ylabel='y'>

png

X and Y are obviously related, there is a square relationship between them.

Let us see how correlation and pps handle this

7.0 Check correlation

I love how easy it is now to get correlation.

Let’s get Pearson first

1
sns.heatmap(df[['x','y']].corr(), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<AxesSubplot:>

png

and then Spearman

1
sns.heatmap(df[['x','y']].corr(method='spearman'), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23f9d299888>

png

and Kendall

1
sns.heatmap(df[['x','y']].corr(method='kendall'), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23fa0cade88>

png

So, what do you think.

Of course we know that.

Over to PPS now

8.0 Using pps

The calculation is fairly similar to correlation matrix. and then can be plotted the same way as well with heatmap.

1
2
df_matrix = pps.matrix(df[['x','y']])
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23fa116e3c8>

png

Why would x vs y be 0, let us look into pps.score

1
ppscore.score(df, x='y', y='x', task=None, sample=5000)
{'x': 'y',
 'y': 'x',
 'task': 'regression',
 'ppscore': 0.0347825646932195,
 'metric': 'mean absolute error',
 'baseline_score': 1.0072856030153572,
 'model_score': 0.9722496263639269,
 'model': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}

Interesting, looks like there is a high model score and the model seems reasonable.

What if we flip to square root of x?

1
2
3
4
5
6
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.1, 0.1, 1_000_000)
df["y"] = np.sqrt(abs(df["x"])) + df["error"]

sns.scatterplot(data=df, x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x23fa122e288>

png

1
2
df_matrix = pps.matrix(df[['x','y']])
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23fa126e208>

png

Still the same 0 between y and x.

This is because of the large variation in x for a given y (see below chart and horizontal line)

1
sns.scatterplot(data=df, x='x', y='y').axhline(1.0)
<matplotlib.lines.Line2D at 0x23fa126e988>

png

9.0 Non-simulated data

Let us start with looking through some baseball data source

1
example3 = pd.read_csv('Example3.csv', header=None)
1
example3.columns=['Pitching', 'Defense', 'Hitting', 'Win Percentage']
1
example3.head()

Pitching Defense Hitting Win Percentage
0 0.487013 3.98 0.975 0.692104
1 0.402597 4.94 0.969 0.717472
2 0.610390 3.90 0.967 0.809883
3 0.464052 4.40 0.964 0.814853
4 0.640523 3.82 0.965 0.835017

This basically says how good a team is in either of the aspects and how that correlated to their win percentage.

1
sns.heatmap(example3.corr(), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23f96d54488>

png

While there is high correlation score (see above picture),

1
sns.heatmap(pps.matrix(example3), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23f992ef708>

png

individually neither of the variables would be able to determine the win rate.

10. 0 Looking at categorical data - with Iris

1
2
iris = sns.load_dataset('iris')
iris.head()

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Can you predict the species of flower, with its parameters

1
sns.heatmap(iris.corr(), vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23f9bbd7348>

png

Notice, only numerical variables get picked up in correlation, let us check with pps

1
2
df_matrix = pps.matrix(iris)
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x23fa7aa5b48>

png

Better, we can see species is almost entirely predictable by petal width and length, not so much by sepal parameters

1
sns.pairplot(iris, hue='species')
<seaborn.axisgrid.PairGrid at 0x23f9bbd7248>

png

That was helpful going into a model, or maybe stopped us from going into a model at all!

11.0 Summary

PPS is not a sliver bullet, and neither is correlation. In fact, there are no silver bullets, except of course actual silver bullets.

Anyway, We need to use them both in conjunction to get the best outcome.

https://github.com/8080labs/ppscore Check for more details