Organization data driven design
   10 min read    Rohit Pruthi

Author details

Rohit Pruthi, Decision Scientist @ R2DL

References

  1. AI for HR - Linked in learning course
  2. Networkx library

Why data driven organization design

We all know the collective power of the group. There are stories abound across the world on how teams accomplish goals, be it sports, corporate or even politics. Over time, we have inherently realized the power of the group.

However, making effective teams has remained more of an art than a science. It is a considered a valued leadership skill. One should understand the interaction dynamics between team members, across and within teams. The work responsibility and similarity should be considered. Is it beneficial to group by project, or by technology?
Something that requires intelligence and consideration combined with experience. It is difficult to get it right. Maybe in the modern world, we can look around the data around to help with this.

Here, I started to look at the communication patterns within the team members. With the intent to understand if any of these can be used for structuring the organization.

Further reading list

Rupert Morrison has written a book on the evolving subject, which I plan to read at some point in near future. In case you have read the book, or anything on this subject, please reach out. Book link

Library set up

1
2
3
4
## Library installation - Networkx & matplotlib

import sys
!conda install --yes --prefix {sys.prefix} networkx matplotlib
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.
1
2
3
4
5
import networkx as nx
import matplotlib.pyplot as plt

from csv import reader
import pandas as pd

A note on networkx [2]

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

It provides the following components for complex networks

  • Data structures for graphs, digraphs, and multigraphs
  • Many standard graph algorithms
  • Network structure and analysis measures
  • Generators for classic graphs, random graphs, and synthetic networks
  • Nodes can be “anything” (e.g., text, images, XML records)
  • Edges can hold arbitrary data (e.g., weights, time-series)
  • Open source 3-clause BSD license
  • Well tested with over 90% code coverage
  • Additional benefits from Python include fast prototyping, easy to teach, and multi-platform

Quick look at data

1
print(pd.read_csv('chat_groups.csv', encoding='utf-8-sig', header=None).head())
      0        1    2    3    4
0  Ravi    Geeta  NaN  NaN  NaN
1  Ravi  Sushant  NaN  NaN  NaN
2  Ravi    Sudha  NaN  NaN  NaN
3  Ravi    Geeta  NaN  NaN  NaN
4  Ravi  Sushant  NaN  NaN  NaN

It looks like this is a chat group data. Per conversation, there can be upto 4 participants. Top few rows show that every chat is recorded as a row entry. It is also clear that only the connections are represented here.

Few attributes which could enrich this data further are

  • time spent in each chat
  • issues discussed
  • conversation intensity
  • numbers of smilies used

All of this is experimental generated data, not real information. Moreover, collection for further data and information requires careful understanding of privacy concerns, which is only the first of challenges faced by data driven organization design.

##Challenge 1 - data privacy

Back to the analysis

Preparing Network Data

A network is a form of representing relationships using nodal entities as states and edge entities as processes. There is an entire field around networks, which deserves more than what I can offer. Therefore, it is recommended to go through graph theory and network analysis literature for details.

We will be using networkx as the base library, which requires some data engineering on the current form before we can do some exploration.

Basically, the aim is to bring a data structure with

  • ‘to’,
  • ‘from’ and
  • ‘frequency’.

Which can then be used to create a network representation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#Input file with one record per chat collaboration
chat_csv = "chat_groups.csv"

#Data frame to store employee pairs.
employee_pairs = pd.DataFrame(columns=['First', 'Second', 'Count'])

#Read file and extract pairs and weights
with open(chat_csv, 'r', encoding="utf-8-sig") as read_obj:
    
    # pass the file object to reader() to get the reader object
    csv_reader = reader(read_obj)
    
    # Iterate over each row in the csv using reader object
    for row in csv_reader:
        #Sort by employee name
        row.sort()
        #sort and filter for only valid names
        filtered_row = [emp for emp in row
                        if len(emp) > 0] 

        #Generate employee pairs
        
        #Iterate for the first employee
        for i in range(0, len(filtered_row)-1):
            #Iterate for the second employee
            for j in range(i+1,len(filtered_row) ):
            
                first=filtered_row[i]
                second=filtered_row[j]

                #Create the pair record. If Dataframe record already exists
                #Update the count. If not, create it
                curr_rec = employee_pairs[
                                (employee_pairs['First'] == first )
                                & (employee_pairs['Second'] == second)]

                if ( curr_rec.empty) :
                    new_df = pd.DataFrame([{'First': first,
                                            'Second' : second,
                                            'Count':1}])
                    employee_pairs=employee_pairs.append(new_df,
                                                         ignore_index=True)

                else:
                    curr_rec.at[curr_rec.index[0],'Count'] = curr_rec.at[curr_rec.index[0],'Count'] + 1
                    employee_pairs.update(curr_rec)
                
#print(employee_pairs)
1
print(employee_pairs)
    First   Second Count
0   Geeta     Ravi     7
1    Ravi  Sushant     4
2    Ravi    Sudha     8
3   Geeta    Sudha     4
4   Geeta  Sushant     4
5   Sudha  Sushant     3
6   Geeta     Mike     5
7    Mike     Ravi     5
8    Mike    Sudha     3
9    Mike  Sushant     3
10   Lisa     Ravi     6
11   Lisa    Mason     7
12   Lisa     Mike     2
13   Lisa    Sudha     3
14  Mason     Mike     2
15  Mason     Ravi     2
16  Mason    Sudha     2
17  Geeta     Lisa     2
18  David    Geeta     1
19  David     Lisa     6
20  David     Ravi     1
21  David    Mason     4
22  David    Sofia     3
23   Lisa    Sofia     4
24  Mason    Sofia     3
25   Ravi    Sofia     1
26  Sofia    Sudha     1

There are 26 unique pairs in the data and 9 employees.

Create and visualize the network

Let’s see how the employees are connected to each other within this team. For this, we first create a graph structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#Create a networkX graph
graph_emps  = nx.Graph()

#Add Edges based on the dataframe (nodes gets added automatically)
for i,row in employee_pairs.iterrows():
    graph_emps.add_edge(row['First'],  
                        row['Second'],   
                        weight=row['Count'])



We can summarize a graph by printing its information.

1
2
3
#Print network summary
print("Network summary: \n-----------------\n", nx.info(graph_emps))

Network summary: 
-----------------
 Name: 
Type: Graph
Number of nodes: 9
Number of edges: 27
Average degree:   6.0000

Average degree implies the average number of connections per node, so in a graph with 27 two sided connections and 9 nodes, average connections are 6.

Let us try to visualize this graph.

Graph layouts

Networks can be plotted in a wide variety of layouts. Layout basically implies the positioning of the nodes. In a spatial graph, for example a map, the positioning is governed by lets say the latitude and longitude.

However, in abstract graphs like this, it isn’t necessarily. It might be a good idea to add physical location of the employee in the office layout as a reference as well, given it should be a factor in the design.

In general, graph layout is a hard problem and there are algorithms specifically around this area.

Few types of layouts that networkx provides are

"bipartite_layout",
"circular_layout",
"kamada_kawai_layout",
"random_layout",
"rescale_layout",
"rescale_layout_dict",
"shell_layout",
"spring_layout",
"spectral_layout",
"planar_layout",
"fruchterman_reingold_layout",
"spiral_layout",
"multipartite_layout",
1
nx.draw_kamada_kawai(graph_emps, with_labels = True)

png

1
nx.draw_spring(graph_emps, with_labels = True)

png

1
nx.draw_circular(graph_emps, with_labels = True)

png

1
2
3
pos = nx.fruchterman_reingold_layout(graph_emps)

nx.draw_networkx(graph_emps, pos)

png

A few methods and examples are included above for graph layout algorithms. For most simple purposes, spring layout would work fine, but exploring layouts based on problems can be particularly satisfying.

In this case for example, physical locations of colleaugues could be an important indicator, where people sitting next to each other may not be using chat as a medium of communication, which leads to the second challenge of data driven organization design.

##Challenge 2 - multi type data, latent communications

Again, back to the existing sample dataset

Visualizing employee graph

It might be useful to add dimensions to the chart using colors and thickness of edges for example. In the below code, cohesion (pairs with high count) is used as an indicator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Create different types of edges based on their cohesion

#Pairs with Count > 5 for high cohesion
elarge = [(x1, x2) for (x1, x2, data) in graph_emps.edges(data=True) 
          if data['weight'] > 5]

#Pairs with Count between 4 and 5 for medium cohesion
emedium = [(x1, x2) for (x1, x2, data) in graph_emps.edges(data=True) 
          if  3 < data['weight'] <= 5]

#Pairs with Count less than 4 for low cohesion
esmall = [(x1, x2) for (x1, x2, data) in graph_emps.edges(data=True) 
          if data['weight'] <= 3]

pos = nx.spring_layout(graph_emps)  # positions for all nodes

## Setup the Graph
# nodes
nx.draw_networkx_nodes(graph_emps, pos, 
                       node_size=700,
                       node_color='orange')


nx.draw_networkx_edges(graph_emps, pos, 
                       edgelist=elarge,
                       width=6,
                       edge_color='blue')

nx.draw_networkx_edges(graph_emps, pos, 
                       edgelist=emedium,
                       width=4,
                       edge_color='green')

nx.draw_networkx_edges(graph_emps, pos, 
                       edgelist=esmall,
                       width=2, 
                       edge_color='gray')

# labels
nx.draw_networkx_labels(graph_emps, 
                        pos, 
                        font_size=16)


plt.axis('off')
plt.show();

png

What the above plot is telling us is that

  • Ravi is connected closely to Sudha, Lisa, Geeta
  • Lisa in turn is also strongly connected to Mason and David and moderately to Sofia.
  • Sushant is moderately connected to Ravi and Geeta, weekly connected to Sudha and Mike

Analyzing the network

There are various metrics associated with networks which can be explored. Below code and results show

  • Degree Centrality
  • Betweenness Centrality
  • Clustering coefficient

Both these measures man the importance on the particular node to network, or the shortest traversal in the network.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#Function to sort a dictionary by value
def sort_dict(dict):
    sorted_dict= sorted(dict.items(), key=lambda x: x[1],reverse=True)
    
    for key,value in sorted_dict:
        print(key, " = ", value)




#Find centrality of nodes
print("\nDegree Centrality :\n---------------")
sort_dict(nx.degree_centrality(graph_emps))

print("\nBetweenness:\n--------------")
sort_dict(nx.betweenness_centrality(graph_emps))

Degree Centrality :
---------------
Ravi  =  1.0
Sudha  =  0.875
Lisa  =  0.875
Geeta  =  0.75
Mike  =  0.75
Mason  =  0.75
David  =  0.625
Sofia  =  0.625
Sushant  =  0.5

Betweenness:
--------------
Ravi  =  0.09761904761904762
Sudha  =  0.06369047619047619
Geeta  =  0.04285714285714285
Lisa  =  0.041071428571428564
Mike  =  0.02797619047619047
Mason  =  0.024999999999999998
David  =  0.01607142857142857
Sofia  =  0.007142857142857143
Sushant  =  0.0

A high score implies that for information flow, these nodes (Ravi and then Sudha, Lisa) play an important role

In a social network, one of the fields in which network theory has gained a lot of traction, one of the measures of importance is clusters.

The clustering coefficient metric differs from measures of centrality. It is more akin to the density metric for whole networks, but focused on egocentric networks.

If your “friends” (connections in this case) all know each other, you have a high clustering coefficient. If your “friends” (connections) don’t know each other, then you have a low clustering coefficient.

A high clustering coefficient means that in the communication networks, you are playing a less critical role since all your connections directly communicate to each other as well. In our case, a low clustering coefficient means, these are roles which should be looking at cross team communications

1
2
3
#clustering - how close a team they form
print("\nClustering Co-efficient:\n----------------------")
sort_dict(nx.clustering(graph_emps))
Clustering Co-efficient:
----------------------
Sushant  =  1.0
Sofia  =  0.9
Mike  =  0.8
Mason  =  0.8
David  =  0.8
Lisa  =  0.7619047619047619
Geeta  =  0.7333333333333333
Sudha  =  0.7142857142857143
Ravi  =  0.6785714285714286

Source of errors

Since the contents of the chat, experience of team members and off-data communcations are not being used in this assessment, a distinct possibility emerges. If Ravi, for instance is a new entrant to the team and getting aquainted with the team members, this would result in a similar graph. Of course, limited data leads to limited insight. Which is the final challenge of data driven organization design

##Challenge 3 Is it possible to have all the data?