Text analysis introduction
   7 min read    Praveen D, Rohit P

0.0 The tangentially relevant cartoon

image

1.0 Relevance of text data

The written word is all around us. In any organisation, large or small, there would always be a component of data which is textual. It could be

  • reports from inspections,
  • comments from customers,
  • tweets regarding a product,
  • search history of a user
  • comments from employees
  • email patterns of client
  • logs generated by machines
  • logs generated by logs generated by machines

There is knowledge in text, but like Dilbert says, maybe it comes with strings attached.

Text mining or using data analysis tools for analysing text information is a mature field. In this assessment we will understand how to

  • read and transform text data for use
  • extract ngrams and plot them
  • perform topic modeling on the data

2.0 The libraries and infrastructure

The infrastructure for text analysis is quite well developed in python ecosystem. Key packages used for analysis.

1 gives a background of some of the available libraries

  • We recommend NLTK only as an education and research tool. Its modularized structure makes it excellent for learning and exploring NLP concepts, but it’s not meant for production.
  • TextBlob is built on top of NLTK, and it’s more easily-accessible. This is our favorite library for fast-prototyping or building applications that don’t require highly optimized performance. Beginners should start here.
  • Stanford’s CoreNLP is a Java library with Python wrappers. It’s in many existing production systems due to its speed.
  • SpaCy is an NLP library that’s designed to be fast, streamlined, and production-ready. It’s not as widely adopted, but if you’re building a new application, you should give it a try.
  • Gensim is most commonly used for topic modeling and similarity detection. It’s not a general-purpose NLP library, but for the tasks it does handle, it does them well.

There are more like pycaret and scikit learn which offer a NLP sub module and more specialized ones like Quepy and the excellent pyLDAvis.

3.0 Scope

In this notebook, we mostly focus on

  • Word cloud representations of ngrams
  • Topic modeling
  • Setup for the application with streamlit

4.0 Let’s begin

4.1 Importing relevant libraries

There are a lot of NLP libraries in the python ecosystem. Read here for details on some of them.

In this work, we will mostly be looking at NLTK, ScikitLearn and pycaret.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import nltk ##the standard natual language tool kit

nltk.download('stopwords') 

import re ## regex
import numpy as np ## numpy for matrix operations 
import pandas as pd ## pandas for dataframe operations 

import string ## string manipuations

# Wordcloud
from wordcloud import WordCloud, STOPWORDS


import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

from sklearn.feature_extraction.text import CountVectorizer ##scikit learn for text manipulation

from collections import Counter ##counter 

from nltk.util import ngrams ##calculated two/three word combinations
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PruthiR\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
1
from pycaret.nlp import * ##pycaret for autoML experiment
1
2
3
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

4.2 Read and explore data

Note here that the file setup with path can later be easily replaced with a streamlit file uploaded content. This simple step aligns us to deployment with streamlit. The data is an open collection of news items related to aviation from Telegraph, Flight Global and Reuters from 2018.

1
w = r'C:\Users\PruthiR\Documents\NLP\AMI\ami\user\manojk\AMI_Data.csv' ##set the path
1
2
3
4
5
def read_data(w): ##the function read data
    
    dataread = pd.read_csv(w, encoding = 'unicode_escape')
    
    return dataread
1
read_data(w).head()

Content Source label
0 When a group of us decided to submit to the Ai... Telegraph 0
1 Qatar Airways chief executive officer Akbar Al... Telegraph 0
2 Low-cost airline Ryanair has been blown off co... Telegraph 0
3 Sir Tim Clark, the Briton who runs Emirates, h... Telegraph 0
4 We've noticed you're adblocking.\r\n\r\nWe rel... Telegraph 0

As you can see, this dataset consists of

  • text of articles in column ‘Content’
  • from news sources in column ‘Source’in the space of aviation,
  • with the column ‘label’ accounting for if the article was read by a (note ‘a’ not ‘the’) analyst
1
2
3
4
5
def subset_data(dataread, sentiment, sentimentcol): ##the function to subset for a category
    
    datasub = dataread[dataread[sentimentcol].isin(sentiment)]
    
    return datasub
1
news_data = read_data(w)
1
news_data['label'].value_counts()
0    798
1     73
Name: label, dtype: int64

Note that there are around 10% of points which are marked ‘1’ which means read by the analyst

1
news_data['Source'].value_counts()
Flight Global    542
Reuters          206
Telegraph        123
Name: Source, dtype: int64
1
pd.crosstab(news_data['Source'],news_data['label'])

label 0 1
Source
Flight Global 473 69
Reuters 202 4
Telegraph 123 0
1
interest = subset_data(news_data, ['1'], 'label')
1
no_interest = subset_data(news_data, ['0'], 'label')

The distribution of articles in the data is skewed towards flight global and the label distribution is further skewed towards flight global.

4.3 Clean data

Text data can require a lot of cleanup depending on the source and way of extraction. This step is by no means comprehensive. Some of the things which may be present but are not in this case

  • email IDs
  • Alphanumeric specific words
  • Abbreviations
  • Special characters
  • Redcated personal information
  • Various forms of punctuation
  • and more
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def clean_data(dataset, column_name):
    
    text_data = dataset[column_name].str.cat()
    
    text_data = text_data.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
    
    text_data = re.sub("\n","", text_data)
    
    text_data = re.sub("\r","", text_data)
    
    text_data = re.sub("[â\x80\x99]","", text_data)

       
    return text_data
1
2
news_cleaned = clean_data(interest, 'Content')
news_no_interest = clean_data(no_interest, 'Content')

4.4 Extract words and n_grams

1
2
3
4
5
6
7
8
def extract_words(text, stopwords, min_length):
    words = nltk.word_tokenize(text)
    
    words = [w for w in words if w not in stopwords if len(w)> min_length]
      
    text = ' '.join(words)

    return text
1
2
news_words = extract_words(news_cleaned, stop_words, 4)
news_words_no_interest = extract_words(news_no_interest, stop_words, 4)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def create_ngram(text, n_gram):
    
    gram_list = list(nltk.ngrams(text.split(" "), n_gram))
       
    dictionary = [' '.join(tup) for tup in gram_list]

    #Using count vectoriser to view the frequency of bigrams
    vectorizer = CountVectorizer(ngram_range=(n_gram,n_gram))
    bag_of_words = vectorizer.fit_transform(dictionary)
    vectorizer.vocabulary_
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    #words_freq[:20]
    words_dict = dict(words_freq)
    
    return words_dict

4.4 Create word cloud

1
2
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def plot_wordcloud(text, 
                   mask=None, 
                   max_words=400, 
                   max_font_size=120, 
                   figure_size=(16.0,12.0), 
                   title = None, 
                   title_size=40, 
                   image_color=False):
    
    stopwords = set(stop_words)
    more_stopwords = {'rolls', 'royce'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='white',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    mask = mask)
    wordcloud.generate_from_frequencies(text)
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'green', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()
    
    

reference for word cloud code

1
plot_wordcloud(text = create_ngram(news_words, 2), mask = np.array(Image.open('user.png')))

png

4.5 Build topic model

We will be using PyCaret’s autoML routines for NLP this time. Note that we would directly be using the news data frame with No preprocessing.

1
2
3
4
5
6
7
8
9
def build_topic_model(dataset, column_name, topics):
    
    exp_nlp101 = setup(data = dataset, target = column_name, session_id = 123)

    lda = create_model('lda', num_topics = topics)

    lda_results = assign_model(lda)
         
    return lda_results
1
a = build_topic_model(news_data, 'Content', 6)
1
pd.crosstab(a['Source'],a['Dominant_Topic'])

Dominant_Topic Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Source
Flight Global 0 201 231 16 78 16
Reuters 22 100 4 67 2 11
Telegraph 1 38 9 36 3 36
1
2
plot_model(a, plot = 'umap')

png

4.5 Text similarity

Will be added in a future revision.