Distance metrics with mixed data types
   10 min read    Rohit Pruthi

1.0 What is distance?

Basically, distance means a quantitative measure of how far apart two objects are in space. I am sure you know that. Spatial distance has been part of human life in perception since time eternal, and has been quantified and standardized across world over last few thousand years at least.

Another way to think of this, particularly in terms of statistics, is dissimilarity. Closer means similar, and further means dissimilar.

1.1 What is the role of distance in machine learning?

Distance measures play an important role in machine learning. This similarity or distance is the very basic building block for activities such as

  • Recommendation engines,
  • Clustering,
  • Different classification problems - like Email spam or ham classification problems

1.2 Practical aspects of using distance metrics

Most distance measures were built for numerical features. In the simple two dimensional world, the simplest form of distance is the well known - as the raven flies - Euclidean distance.

Straight line between two points calculated using the pythagorean formulation [Reference Image from Wikipedia]

image

However, as the data becomes multi dimensional, the question of optimal calculation inherently pops up. With the curse of dimensionality, the manhattan or cityblock distance becomes a better choice. The paper link goes into some detail on the subject

image

There are many more distance metrics available to choose from

  • minkowski
  • correlation
  • cosine
  • mahalanobis

For a detailed view on these please refer [2]

Further, as more data types are introduced, the distance calculation can become more subjective. Binary or categorical variables can be handled with specific metrics like

  • dice
  • yule
  • kulsinski
  • russelrao

Dice Distance can be calculated using the below formulation

image

It is here with a mixture of the data types available, when we start to consider composite measures that focus on combining selected metrics.

A lot of the recent work has focused on similarity in text features, which is emerging as an altogether different area of research. Although it isn’t covered here, interested readers are encouraged to read here an image from this is reproduced below.

image

1.3 Mixed Variable Distance measures

[3] In June 2020, Sudha Bishnoi and BK Hooda surveyed a wide variety of methods and published their findings in International Journal of Chemical studies. The below is a snapshot from their summary. We will mostly be looking at one of the classical measures - Gower Distance

image

Gower disatance is a quantitative measure of the similarity between two rows of a dataset consisting of mixed type attributes. It uses the concept of

  • Manhattan distance for continuous variables and
  • dice distance for measuring similarity between Binary variables.

Let’s see how we go about calculating this measure.

2.0 Data Introudction - Create data from MS malware data

Let us start by reading the microsoft malware data from KAGGLE. We will be using only a fraction of this data for the analysis. This is currently based on a product name. We are restricting our selves to ‘mse’ product

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
## import the library

filename = 'C:/Users/PruthiR/Documents/2021/Scarecrow/train/train.csv'
## set a file name

trial = pd.read_csv(filename, usecols=['ProductName','MachineIdentifier'])
## read a minimum number of columns and all rows

index_to_read = trial[trial['ProductName'] == 'mse'].index
## get the index to read from product name

mse = pd.read_csv(filename, header = None, skiprows= lambda x: x not in index_to_read)
## read relevant index

col_names = pd.read_csv(filename, nrows=5).columns
## get column names from first few rows

mse.columns = col_names
## apply column names as requires

work_data = mse.sample(frac=0.02, random_state=42)
## sampled a small fraction of data

del trial, mse, col_names
## and deleted the unused dataframes, may need garbage collection here to improve performance
1
work_data.dtypes.value_counts()
float64    36
object     30
int64      17
dtype: int64
1
int(((work_data.shape[0])**0.5)/2)
21

It looks like we have 30 object types, and 53 numerical data types.

2.1 When is an ‘object’ a ‘category’?

Not all objects are actually categorical. Only if a type of object has sufficient repitiitions to be captured by an algorithm, should it be included in the model. Identifiers for instance masquerading as categories may end up making the process noisier.

Thumb rule The number of unique values of a category should be less than half of the square root of the rows available for it to be categorized, otherwise it should be dropped.

Of course, this is a hyper parameter and should be iterated, but often a good starting point can help accelerate the optimization of this parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## subset to get object columns in separate dataframe
work_data_object = work_data.select_dtypes(include=['object'])

## extract number of rows and use that to define a upper limit for category definition
cap_cat = int(((work_data_object.shape[0])**0.5)/2)

## get the number of unique values for each object to decide on category conversion
unique_val_df = pd.DataFrame(pd.DataFrame(work_data_object.describe()).iloc[1,:])

## high category columns, current threshold at 10, can be user defined, can be a hyper parameter
cols_high_cardinality = list(unique_val_df[unique_val_df['unique']>cap_cat].index)

## remove high category columns
work_data_sub = work_data.drop(columns=cols_high_cardinality)

## dropping NAs  currently, can be improved with a better imputation scheme
work_data_sub_clean = work_data_sub.dropna(axis =1, how = 'any')


## Convert object to category variable
trial = pd.concat([
        work_data_sub_clean.select_dtypes([], ['object']),
        work_data_sub_clean.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex(work_data_sub_clean.columns, axis=1)


## check distribution in clean data
trial.dtypes.value_counts()
int64       17
float64      2
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
dtype: int64

2.2 Trying out Gower Distance

Gower distance uses Manhattan for calculating distance between continuous datapoints and Dice for calculating distance between categorical datapoints.

This is based on implementation of the classical 1971 statistical work for mixed data types challenge in distance metrics in economics.

https://www.jstor.org/stable/2528823?seq=1

Improvements in the original implementation have been suggested as well which can be looked into. example - Modifications of the Gower Similarity Coefficient, October 2016, Conference: Applications of Mathematics and Statistics in Economics 2016

2.3 Calculate gower matrix and calculate top neighbors

This is a square matrix, where Value(i, j) represents the similarity between i and j datapoint 0 being exactly same

Using gower library. Read about the library here. There is inherent normalization in the library directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import gower

## Function which uses Gower Library

def get_gower_dist(dataset, target_col):
    
    data_to_model = dataset.drop(columns = target_col) ## remove target columns before calculating distances
    
    X = np.asarray(data_to_model) ## convert to numpy array
    
    distmat = gower.gower_matrix(X) ## calculate matrix
    
    distmat_df = pd.DataFrame(distmat)
    
    return distmat_df

def get_neighbors_gower(dataset, row, neighbors, target = 'HasDetections'):
    
    dist_mat = get_gower_dist(dataset, target)
    
    interest_points = list(dist_mat[row].sort_values()[0:neighbors].index) ## get interest points
    
    dataset_sub = dataset.iloc[interest_points,:] ## subset data for interest points
    
    return dataset_sub

With this, you can now get the closest 5 neighbors to the data, see below.

1
get_neighbors_gower(trial, 5, 5)

ProductName IsBeta IsSxsPassiveMode HasTpm CountryIdentifier GeoNameIdentifier LocaleEnglishNameIdentifier Platform Processor OsVer ... Census_OSUILocaleIdentifier Census_OSWUAutoUpdateOptionsName Census_IsPortableOperatingSystem Census_GenuineStateName Census_ActivationChannel Census_FlightRing Census_IsSecureBootEnabled Census_IsTouchEnabled Census_IsPenCapable HasDetections
43194 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
54597 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
39773 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
23288 win8defender 0 0 1 141 167.0 227 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1
1126 win8defender 0 0 1 81 107.0 224 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1

5 rows × 38 columns

The whiter it is, the closer the points are. For this plot, only 2 color range has been used to keep it as a ‘similar’ and ‘different’ scale.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# create data
df = get_gower_dist(trial, 'HasDetections')
 
# make it discrete
df_q = pd.DataFrame()
for col in df:
   df_q[col] = pd.to_numeric( pd.qcut(df[col], 2, labels=list(range(2))) )
 
plt.figure(figsize=(20,10))
sns.heatmap(df_q, cmap = 'Blues')
#sns.plt.show()

<matplotlib.axes._subplots.AxesSubplot at 0x4d155348>

png

For row 5, looking at all the data, if we see the top 5 close neighbors, there is a 40% favor for it having a malware.

2.4 Flexible Gower distance calculation

As we talked about earlier, each data type has multiple options for calculation of distance metrics. The intention of this section is to enable combination of different methods, to give a flexible distance metric.

1
2
3
4
5
from sklearn.neighbors import DistanceMetric
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import normalize
import numpy as np
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def get_dist_mixed(dataset, target = 'HasDetections'):
    
    ## feature weights can be learnt from user behavior or bakcground models 
    ## and input to the distance metric to improve the recommendations with time
    
    ## remove target variable
    dataset = dataset.drop(columns = target)
    
    ## subset to numerical features
    num_feat = dataset.select_dtypes(include=['int64','float64'])
    
    ## scale to 0-1 scale
    scaler = MinMaxScaler()
    scaled_num_feat = scaler.fit_transform(num_feat)
    
    ## calculate pairwise manhattan distance - can be changed to euclidean here
    dist_mat_num = squareform(pdist(scaled_num_feat, metric='euclidean'))
    
    ## subset to categorical features and calculate dice metric
    cat_feat = dataset.select_dtypes(include=['category'])
    dist_mat_cat = squareform(pdist(cat_feat, metric='dice'))
    
    ## combine and scale both matrices
    dist_mat_comb_df = pd.DataFrame(scaler.fit_transform((dist_mat_cat + dist_mat_num)))
    
    return dist_mat_comb_df

def get_neighbors(dataset, row, neighbors):
    
    ## get topn points
    interest_points = list(get_dist_mixed(dataset)[row].sort_values()[0:neighbors].index)
    
    ## subset for the topn points
    dataset_sub = dataset.iloc[interest_points,:]
    
    return dataset_sub
    
1
get_neighbors(trial, 5, 5)

ProductName IsBeta IsSxsPassiveMode HasTpm CountryIdentifier GeoNameIdentifier LocaleEnglishNameIdentifier Platform Processor OsVer ... Census_OSUILocaleIdentifier Census_OSWUAutoUpdateOptionsName Census_IsPortableOperatingSystem Census_GenuineStateName Census_ActivationChannel Census_FlightRing Census_IsSecureBootEnabled Census_IsTouchEnabled Census_IsPenCapable HasDetections
43194 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
69148 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1
54597 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
39773 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
22758 win8defender 0 0 1 141 167.0 227 windows10 x64 10.0.0.0 ... 34 UNKNOWN 0 IS_GENUINE OEM:DM Retail 1 0 0 1

5 rows × 38 columns

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# create data
df = get_dist_mixed(trial, 'HasDetections')
 
# make it discrete
df_q = pd.DataFrame()
for col in df:
   df_q[col] = pd.to_numeric( pd.qcut(df[col], 2, labels=list(range(2))) )
 
plt.figure(figsize=(20,10))
sns.heatmap(df_q, cmap = 'Blues')
#sns.plt.show()

<matplotlib.axes._subplots.AxesSubplot at 0x2b462548>

png

3.0 Further steps

3.1 Weighted distances

  1. Can we give features weightage based on a model in the background?
  2. Is it possible to mix data types related to text as well? Or images? Move further to mixed data types.