Distance metrics with mixed data types

## 1.0 What is distance?

Basically, distance means a quantitative measure of how far apart two objects are in space. I am sure you know that. Spatial distance has been part of human life in perception since time eternal, and has been quantified and standardized across world over last few thousand years at least.

Another way to think of this, particularly in terms of statistics, is dissimilarity. Closer means similar, and further means dissimilar.

## 1.1 What is the role of distance in machine learning?

Distance measures play an important role in machine learning. This similarity or distance is the very basic building block for activities such as

• Recommendation engines,
• Clustering,
• Different classification problems - like Email spam or ham classification problems

## 1.2 Practical aspects of using distance metrics

Most distance measures were built for numerical features. In the simple two dimensional world, the simplest form of distance is the well known - as the raven flies - Euclidean distance.

Straight line between two points calculated using the pythagorean formulation [Reference Image from Wikipedia] However, as the data becomes multi dimensional, the question of optimal calculation inherently pops up. With the curse of dimensionality, the manhattan or cityblock distance becomes a better choice. The paper link goes into some detail on the subject There are many more distance metrics available to choose from

• minkowski
• correlation
• cosine
• mahalanobis

For a detailed view on these please refer 

Further, as more data types are introduced, the distance calculation can become more subjective. Binary or categorical variables can be handled with specific metrics like

• dice
• yule
• kulsinski
• russelrao

Dice Distance can be calculated using the below formulation It is here with a mixture of the data types available, when we start to consider composite measures that focus on combining selected metrics.

A lot of the recent work has focused on similarity in text features, which is emerging as an altogether different area of research. Although it isn’t covered here, interested readers are encouraged to read here an image from this is reproduced below. ## 1.3 Mixed Variable Distance measures

 In June 2020, Sudha Bishnoi and BK Hooda surveyed a wide variety of methods and published their findings in International Journal of Chemical studies. The below is a snapshot from their summary. We will mostly be looking at one of the classical measures - Gower Distance Gower disatance is a quantitative measure of the similarity between two rows of a dataset consisting of mixed type attributes. It uses the concept of

• Manhattan distance for continuous variables and
• dice distance for measuring similarity between Binary variables.

Let’s see how we go about calculating this measure.

## 2.0 Data Introudction - Create data from MS malware data

Let us start by reading the microsoft malware data from KAGGLE. We will be using only a fraction of this data for the analysis. This is currently based on a product name. We are restricting our selves to ‘mse’ product

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  import pandas as pd ## import the library filename = 'C:/Users/PruthiR/Documents/2021/Scarecrow/train/train.csv' ## set a file name trial = pd.read_csv(filename, usecols=['ProductName','MachineIdentifier']) ## read a minimum number of columns and all rows index_to_read = trial[trial['ProductName'] == 'mse'].index ## get the index to read from product name mse = pd.read_csv(filename, header = None, skiprows= lambda x: x not in index_to_read) ## read relevant index col_names = pd.read_csv(filename, nrows=5).columns ## get column names from first few rows mse.columns = col_names ## apply column names as requires work_data = mse.sample(frac=0.02, random_state=42) ## sampled a small fraction of data del trial, mse, col_names ## and deleted the unused dataframes, may need garbage collection here to improve performance 
 1  work_data.dtypes.value_counts() 
float64    36
object     30
int64      17
dtype: int64

 1  int(((work_data.shape)**0.5)/2) 
21


It looks like we have 30 object types, and 53 numerical data types.

## 2.1 When is an ‘object’ a ‘category’?

Not all objects are actually categorical. Only if a type of object has sufficient repitiitions to be captured by an algorithm, should it be included in the model. Identifiers for instance masquerading as categories may end up making the process noisier.

Thumb rule The number of unique values of a category should be less than half of the square root of the rows available for it to be categorized, otherwise it should be dropped.

Of course, this is a hyper parameter and should be iterated, but often a good starting point can help accelerate the optimization of this parameter.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  ## subset to get object columns in separate dataframe work_data_object = work_data.select_dtypes(include=['object']) ## extract number of rows and use that to define a upper limit for category definition cap_cat = int(((work_data_object.shape)**0.5)/2) ## get the number of unique values for each object to decide on category conversion unique_val_df = pd.DataFrame(pd.DataFrame(work_data_object.describe()).iloc[1,:]) ## high category columns, current threshold at 10, can be user defined, can be a hyper parameter cols_high_cardinality = list(unique_val_df[unique_val_df['unique']>cap_cat].index) ## remove high category columns work_data_sub = work_data.drop(columns=cols_high_cardinality) ## dropping NAs currently, can be improved with a better imputation scheme work_data_sub_clean = work_data_sub.dropna(axis =1, how = 'any') ## Convert object to category variable trial = pd.concat([ work_data_sub_clean.select_dtypes([], ['object']), work_data_sub_clean.select_dtypes(['object']).apply(pd.Series.astype, dtype='category') ], axis=1).reindex(work_data_sub_clean.columns, axis=1) ## check distribution in clean data trial.dtypes.value_counts() 
int64       17
float64      2
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
category     1
dtype: int64


## 2.2 Trying out Gower Distance

Gower distance uses Manhattan for calculating distance between continuous datapoints and Dice for calculating distance between categorical datapoints.

This is based on implementation of the classical 1971 statistical work for mixed data types challenge in distance metrics in economics.

https://www.jstor.org/stable/2528823?seq=1

Improvements in the original implementation have been suggested as well which can be looked into. example - Modifications of the Gower Similarity Coefficient, October 2016, Conference: Applications of Mathematics and Statistics in Economics 2016

## 2.3 Calculate gower matrix and calculate top neighbors

This is a square matrix, where Value(i, j) represents the similarity between i and j datapoint 0 being exactly same

Using gower library. Read about the library here. There is inherent normalization in the library directly.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  import gower ## Function which uses Gower Library def get_gower_dist(dataset, target_col): data_to_model = dataset.drop(columns = target_col) ## remove target columns before calculating distances X = np.asarray(data_to_model) ## convert to numpy array distmat = gower.gower_matrix(X) ## calculate matrix distmat_df = pd.DataFrame(distmat) return distmat_df def get_neighbors_gower(dataset, row, neighbors, target = 'HasDetections'): dist_mat = get_gower_dist(dataset, target) interest_points = list(dist_mat[row].sort_values()[0:neighbors].index) ## get interest points dataset_sub = dataset.iloc[interest_points,:] ## subset data for interest points return dataset_sub 

With this, you can now get the closest 5 neighbors to the data, see below.

 1  get_neighbors_gower(trial, 5, 5) 

ProductName IsBeta IsSxsPassiveMode HasTpm CountryIdentifier GeoNameIdentifier LocaleEnglishNameIdentifier Platform Processor OsVer ... Census_OSUILocaleIdentifier Census_OSWUAutoUpdateOptionsName Census_IsPortableOperatingSystem Census_GenuineStateName Census_ActivationChannel Census_FlightRing Census_IsSecureBootEnabled Census_IsTouchEnabled Census_IsPenCapable HasDetections
43194 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
54597 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
39773 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
23288 win8defender 0 0 1 141 167.0 227 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1
1126 win8defender 0 0 1 81 107.0 224 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1

5 rows × 38 columns

The whiter it is, the closer the points are. For this plot, only 2 color range has been used to keep it as a ‘similar’ and ‘different’ scale.

  1 2 3 4 5 6 7 8 9 10 11 12  # create data df = get_gower_dist(trial, 'HasDetections') # make it discrete df_q = pd.DataFrame() for col in df: df_q[col] = pd.to_numeric( pd.qcut(df[col], 2, labels=list(range(2))) ) plt.figure(figsize=(20,10)) sns.heatmap(df_q, cmap = 'Blues') #sns.plt.show() 
<matplotlib.axes._subplots.AxesSubplot at 0x4d155348> For row 5, looking at all the data, if we see the top 5 close neighbors, there is a 40% favor for it having a malware.

## 2.4 Flexible Gower distance calculation

As we talked about earlier, each data type has multiple options for calculation of distance metrics. The intention of this section is to enable combination of different methods, to give a flexible distance metric.

 1 2 3 4 5  from sklearn.neighbors import DistanceMetric from scipy.spatial.distance import pdist, squareform from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn.preprocessing import normalize import numpy as np 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37  def get_dist_mixed(dataset, target = 'HasDetections'): ## feature weights can be learnt from user behavior or bakcground models ## and input to the distance metric to improve the recommendations with time ## remove target variable dataset = dataset.drop(columns = target) ## subset to numerical features num_feat = dataset.select_dtypes(include=['int64','float64']) ## scale to 0-1 scale scaler = MinMaxScaler() scaled_num_feat = scaler.fit_transform(num_feat) ## calculate pairwise manhattan distance - can be changed to euclidean here dist_mat_num = squareform(pdist(scaled_num_feat, metric='euclidean')) ## subset to categorical features and calculate dice metric cat_feat = dataset.select_dtypes(include=['category']) dist_mat_cat = squareform(pdist(cat_feat, metric='dice')) ## combine and scale both matrices dist_mat_comb_df = pd.DataFrame(scaler.fit_transform((dist_mat_cat + dist_mat_num))) return dist_mat_comb_df def get_neighbors(dataset, row, neighbors): ## get topn points interest_points = list(get_dist_mixed(dataset)[row].sort_values()[0:neighbors].index) ## subset for the topn points dataset_sub = dataset.iloc[interest_points,:] return dataset_sub 
 1  get_neighbors(trial, 5, 5) 

ProductName IsBeta IsSxsPassiveMode HasTpm CountryIdentifier GeoNameIdentifier LocaleEnglishNameIdentifier Platform Processor OsVer ... Census_OSUILocaleIdentifier Census_OSWUAutoUpdateOptionsName Census_IsPortableOperatingSystem Census_GenuineStateName Census_ActivationChannel Census_FlightRing Census_IsSecureBootEnabled Census_IsTouchEnabled Census_IsPenCapable HasDetections
43194 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 34 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
69148 win8defender 0 0 1 155 201.0 231 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 1
54597 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
39773 win8defender 0 0 1 166 167.0 227 windows10 x64 10.0.0.0 ... 35 Notify 0 IS_GENUINE OEM:DM Retail 1 0 0 0
22758 win8defender 0 0 1 141 167.0 227 windows10 x64 10.0.0.0 ... 34 UNKNOWN 0 IS_GENUINE OEM:DM Retail 1 0 0 1

5 rows × 38 columns

  1 2 3 4 5 6 7 8 9 10 11 12  # create data df = get_dist_mixed(trial, 'HasDetections') # make it discrete df_q = pd.DataFrame() for col in df: df_q[col] = pd.to_numeric( pd.qcut(df[col], 2, labels=list(range(2))) ) plt.figure(figsize=(20,10)) sns.heatmap(df_q, cmap = 'Blues') #sns.plt.show() 
<matplotlib.axes._subplots.AxesSubplot at 0x2b462548> ## 3.1 Weighted distances

1. Can we give features weightage based on a model in the background?
2. Is it possible to mix data types related to text as well? Or images? Move further to mixed data types.