1.0 What is distance?
Basically, distance means a quantitative measure of how far apart two objects are in space. I am sure you know that. Spatial distance has been part of human life in perception since time eternal, and has been quantified and standardized across world over last few thousand years at least.
Another way to think of this, particularly in terms of statistics, is dissimilarity. Closer means similar, and further means dissimilar.
1.1 What is the role of distance in machine learning?
Distance measures play an important role in machine learning. This similarity or distance is the very basic building block for activities such as
 Recommendation engines,
 Clustering,
 Different classification problems  like Email spam or ham classification problems
1.2 Practical aspects of using distance metrics
Most distance measures were built for numerical features. In the simple two dimensional world, the simplest form of distance is the well known  as the raven flies  Euclidean distance.
Straight line between two points calculated using the pythagorean formulation [Reference Image from Wikipedia]
However, as the data becomes multi dimensional, the question of optimal calculation inherently pops up. With the curse of dimensionality, the manhattan or cityblock distance becomes a better choice. The paper link goes into some detail on the subject
There are many more distance metrics available to choose from
 minkowski
 correlation
 cosine
 mahalanobis
For a detailed view on these please refer [2]
Further, as more data types are introduced, the distance calculation can become more subjective. Binary or categorical variables can be handled with specific metrics like
 dice
 yule
 kulsinski
 russelrao
Dice Distance can be calculated using the below formulation
It is here with a mixture of the data types available, when we start to consider composite measures that focus on combining selected metrics.
A lot of the recent work has focused on similarity in text features, which is emerging as an altogether different area of research. Although it isn’t covered here, interested readers are encouraged to read here an image from this is reproduced below.
1.3 Mixed Variable Distance measures
[3] In June 2020, Sudha Bishnoi and BK Hooda surveyed a wide variety of methods and published their findings in International Journal of Chemical studies. The below is a snapshot from their summary. We will mostly be looking at one of the classical measures  Gower Distance
Gower disatance is a quantitative measure of the similarity between two rows of a dataset consisting of mixed type attributes. It uses the concept of
 Manhattan distance for continuous variables and
 dice distance for measuring similarity between Binary variables.
Let’s see how we go about calculating this measure.
2.0 Data Introudction  Create data from MS malware data
Let us start by reading the microsoft malware data from KAGGLE. We will be using only a fraction of this data for the analysis. This is currently based on a product name. We are restricting our selves to ‘mse’ product




float64 36
object 30
int64 17
dtype: int64


21
It looks like we have 30 object types, and 53 numerical data types.
2.1 When is an ‘object’ a ‘category’?
Not all objects are actually categorical. Only if a type of object has sufficient repitiitions to be captured by an algorithm, should it be included in the model. Identifiers for instance masquerading as categories may end up making the process noisier.
Thumb rule The number of unique values of a category should be less than half of the square root of the rows available for it to be categorized, otherwise it should be dropped.
Of course, this is a hyper parameter and should be iterated, but often a good starting point can help accelerate the optimization of this parameter.


int64 17
float64 2
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
dtype: int64
2.2 Trying out Gower Distance
Gower distance uses Manhattan for calculating distance between continuous datapoints and Dice for calculating distance between categorical datapoints.
This is based on implementation of the classical 1971 statistical work for mixed data types challenge in distance metrics in economics.
https://www.jstor.org/stable/2528823?seq=1
Improvements in the original implementation have been suggested as well which can be looked into. example  Modifications of the Gower Similarity Coefficient, October 2016, Conference: Applications of Mathematics and Statistics in Economics 2016
2.3 Calculate gower matrix and calculate top neighbors
This is a square matrix, where Value(i, j) represents the similarity between i and j datapoint 0 being exactly same
Using gower library. Read about the library here. There is inherent normalization in the library directly.


With this, you can now get the closest 5 neighbors to the data, see below.


ProductName  IsBeta  IsSxsPassiveMode  HasTpm  CountryIdentifier  GeoNameIdentifier  LocaleEnglishNameIdentifier  Platform  Processor  OsVer  ...  Census_OSUILocaleIdentifier  Census_OSWUAutoUpdateOptionsName  Census_IsPortableOperatingSystem  Census_GenuineStateName  Census_ActivationChannel  Census_FlightRing  Census_IsSecureBootEnabled  Census_IsTouchEnabled  Census_IsPenCapable  HasDetections  

43194  win8defender  0  0  1  155  201.0  231  windows10  x64  10.0.0.0  ...  34  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
54597  win8defender  0  0  1  166  167.0  227  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
39773  win8defender  0  0  1  166  167.0  227  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
23288  win8defender  0  0  1  141  167.0  227  windows10  x64  10.0.0.0  ...  34  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  1 
1126  win8defender  0  0  1  81  107.0  224  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  1 
5 rows × 38 columns
The whiter it is, the closer the points are. For this plot, only 2 color range has been used to keep it as a ‘similar’ and ‘different’ scale.


<matplotlib.axes._subplots.AxesSubplot at 0x4d155348>
For row 5, looking at all the data, if we see the top 5 close neighbors, there is a 40% favor for it having a malware.
2.4 Flexible Gower distance calculation
As we talked about earlier, each data type has multiple options for calculation of distance metrics. The intention of this section is to enable combination of different methods, to give a flexible distance metric.






ProductName  IsBeta  IsSxsPassiveMode  HasTpm  CountryIdentifier  GeoNameIdentifier  LocaleEnglishNameIdentifier  Platform  Processor  OsVer  ...  Census_OSUILocaleIdentifier  Census_OSWUAutoUpdateOptionsName  Census_IsPortableOperatingSystem  Census_GenuineStateName  Census_ActivationChannel  Census_FlightRing  Census_IsSecureBootEnabled  Census_IsTouchEnabled  Census_IsPenCapable  HasDetections  

43194  win8defender  0  0  1  155  201.0  231  windows10  x64  10.0.0.0  ...  34  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
69148  win8defender  0  0  1  155  201.0  231  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  1 
54597  win8defender  0  0  1  166  167.0  227  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
39773  win8defender  0  0  1  166  167.0  227  windows10  x64  10.0.0.0  ...  35  Notify  0  IS_GENUINE  OEM:DM  Retail  1  0  0  0 
22758  win8defender  0  0  1  141  167.0  227  windows10  x64  10.0.0.0  ...  34  UNKNOWN  0  IS_GENUINE  OEM:DM  Retail  1  0  0  1 
5 rows × 38 columns


<matplotlib.axes._subplots.AxesSubplot at 0x2b462548>
3.0 Further steps
3.1 Weighted distances
 Can we give features weightage based on a model in the background?
 Is it possible to mix data types related to text as well? Or images? Move further to mixed data types.