1.0 What is correlation?
In non mathematical terms, it is easy to understand correlation using the below picture.
 Do two things move together?
 Do they move independently of each other?
 Do they move opposite of each other?
We can intuitively understand it.
 The number of hours put in to study is correlated to marks.
 Net caloric intake to weight is possibly high correlation.
We can infer these because in this case inherent causation is clear. However, in complex multivariable problems, causation in not always clear. And who better than xkcd to explain this.
If you have used data science or statistics earlier, you would have heard the caution ‘Correlation is not causation’.
In practice, some time correlation is not even the correct measure of dependence.
We need to understand how it is defined before making use and there are some experiments in better forms of looking at dependence.
2.0 Of course, I know correlation  spearmanrank, pearson, kendall..
 The score ranges from 1 to 1
 Tells if there is a strong linear relationship — either in a positive or negative direction.
What about non linear relationship? Sine curve, step function? Anyway.
 Did you drop all the category columns already? Well, you have to.
What if majority of my columns are ordinal? Can I onehotencode, would that be meaningful?
3.0 Enter PPS
There is another different way of thinking about correlation.
PPS  Predictive Power Score uses decision trees as the basis of calculation relationship between variables
The PPS is an asymmetric, datatypeagnostic score that can detect linear or nonlinear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).
Of course, I won’t go into all the details here, but you can always read about it. Let us try it out.
4.0 Time to install ppscore


Requirement already satisfied: ppscore in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (1.2.0)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from ppscore) (1.1.5)
Requirement already satisfied: scikitlearn<1.0.0,>=0.20.2 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from ppscore) (0.23.2)
Requirement already satisfied: numpy>=1.15.4 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from pandas<2.0.0,>=1.0.0>ppscore) (1.19.5)
Requirement already satisfied: pythondateutil>=2.7.3 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from pandas<2.0.0,>=1.0.0>ppscore) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from pandas<2.0.0,>=1.0.0>ppscore) (2021.1)
Requirement already satisfied: six>=1.5 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from pythondateutil>=2.7.3>pandas<2.0.0,>=1.0.0>ppscore) (1.15.0)
Requirement already satisfied: joblib>=0.11 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from scikitlearn<1.0.0,>=0.20.2>ppscore) (1.0.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from scikitlearn<1.0.0,>=0.20.2>ppscore) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in c:\users\pruthir\appdata\local\continuum\anaconda3\envs\gallup\lib\sitepackages (from scikitlearn<1.0.0,>=0.20.2>ppscore) (1.5.4)
5.0 Import some libraries




6.0 Create data for exploration
This is a square function, which would intentionally fail the correlation test (I know, wickedly simple! and copied from the documentation of ppscore).


<AxesSubplot:xlabel='x', ylabel='y'>
X and Y are obviously related, there is a square relationship between them.
Let us see how correlation and pps handle this
7.0 Check correlation
I love how easy it is now to get correlation.
Let’s get Pearson first


<AxesSubplot:>
and then Spearman


<matplotlib.axes._subplots.AxesSubplot at 0x23f9d299888>
and Kendall


<matplotlib.axes._subplots.AxesSubplot at 0x23fa0cade88>
So, what do you think.
X in not related to Y, linearly
Of course we know that.
Over to PPS now
8.0 Using pps
The calculation is fairly similar to correlation matrix. and then can be plotted the same way as well with heatmap.


<matplotlib.axes._subplots.AxesSubplot at 0x23fa116e3c8>
Why would x vs y be 0, let us look into pps.score


{'x': 'y',
'y': 'x',
'task': 'regression',
'ppscore': 0.0347825646932195,
'metric': 'mean absolute error',
'baseline_score': 1.0072856030153572,
'model_score': 0.9722496263639269,
'model': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')}
Interesting, looks like there is a high model score and the model seems reasonable.
What if we flip to square root of x?


<matplotlib.axes._subplots.AxesSubplot at 0x23fa122e288>


<matplotlib.axes._subplots.AxesSubplot at 0x23fa126e208>
Still the same 0 between y and x.
This is because of the large variation in x for a given y (see below chart and horizontal line)


<matplotlib.lines.Line2D at 0x23fa126e988>
9.0 Nonsimulated data
Let us start with looking through some baseball data source






Pitching  Defense  Hitting  Win Percentage  

0  0.487013  3.98  0.975  0.692104 
1  0.402597  4.94  0.969  0.717472 
2  0.610390  3.90  0.967  0.809883 
3  0.464052  4.40  0.964  0.814853 
4  0.640523  3.82  0.965  0.835017 
This basically says how good a team is in either of the aspects and how that correlated to their win percentage.


<matplotlib.axes._subplots.AxesSubplot at 0x23f96d54488>
While there is high correlation score (see above picture),


<matplotlib.axes._subplots.AxesSubplot at 0x23f992ef708>
individually neither of the variables would be able to determine the win rate.
10. 0 Looking at categorical data  with Iris


sepal_length  sepal_width  petal_length  petal_width  species  

0  5.1  3.5  1.4  0.2  setosa 
1  4.9  3.0  1.4  0.2  setosa 
2  4.7  3.2  1.3  0.2  setosa 
3  4.6  3.1  1.5  0.2  setosa 
4  5.0  3.6  1.4  0.2  setosa 
Can you predict the species of flower, with its parameters


<matplotlib.axes._subplots.AxesSubplot at 0x23f9bbd7348>
Notice, only numerical variables get picked up in correlation, let us check with pps


<matplotlib.axes._subplots.AxesSubplot at 0x23fa7aa5b48>
Better, we can see species is almost entirely predictable by petal width and length, not so much by sepal parameters


<seaborn.axisgrid.PairGrid at 0x23f9bbd7248>
That was helpful going into a model, or maybe stopped us from going into a model at all!
11.0 Summary
PPS is not a sliver bullet, and neither is correlation. In fact, there are no silver bullets, except of course actual silver bullets.
Anyway, We need to use them both in conjunction to get the best outcome.
https://github.com/8080labs/ppscore Check for more details