Apr 2, 2020

Tutorial on some simple natural language processing in Python

This month’s monthly theme is NLP challenges, we will help you in this open a possibility: use (link: http://pandas.pydata.org/) and python (link: http://www.nltk.org/) analysis of the content of your Gmail.

Nlp-style projects are full of possibilities:

(link: http://en.wikipedia.org/wiki/Sentiment\_analysis) is the emotional content such as online reviews, social media and other measures. For example, do tweets on a topic tend to be positive or negative? Does a news site cover topics that use more positive/negative terms, or are often related to certain emotions? Isn’t this “positive” Yelp review ironic? Good luck to the last one! Analyze the use of language in literature to measure the trend of vocabulary or writing style over time/region/author. By identifying key characteristics of the language being used, mark whether or not it is garbage content. Based on the comment on the topics covered, use (link: http://en.wikipedia.org/wiki/Topic\_model) for the division of similar categories. Through me ‘s corpus, application Elastisearch and (link: http://wordnet.princeton.edu/) measured by a combination of Twitter API on word similarity, and then create a better real-time Twitter search. To join (link: https://github.com/dariusk/NaNoGenMo-2014) project, using code generation of her novel, you can start your (link: https://github.com/dariusk/NaNoGenMo-2014/issues/1) a large amount of ideas and resources.

Load your Gmail inbox into pandas

Let’s start with a project example! First we need some data. Prepare your Gmail data archive (including your recent spam and spam folders).

https://www.google.com/settings/takeout

Now go for a walk. For a 5.1g mailbox, my 2.8g archive would take over an hour to send.

Once you have the data and configured the local environment for the project, use the following script to read the data into pandas (strongly recommended for data analysis)

from mailbox import mbox
import pandas as pd

def store_content(message, body=None):
 if not body:
  body = message.get_payload(decode=True)
 if len(message):
  contents = {
   "subject": message['subject'] or "",
   "body": body,
   "from": message['from'],
   "to": message['to'],
   "date": message['date'],
   "labels": message['X-Gmail-Labels'],
   "epilogue": message.epilogue,
  }
  return df.append(contents, ignore_index=True)

# Create an empty DataFrame with the relevant columns
df = pd.DataFrame(
 columns=("subject", "body", "from", "to", "date", "labels", "epilogue"))

# Import your downloaded mbox file
box = mbox('All mail Including Spam and Trash.mbox')

fails = []
for message in box:
 try:
  if message.get_content_type() == 'text/plain':
   df = store_content(message)
  elif message.is_multipart():
   # Grab any plaintext from multipart messages
   for part in message.get_payload():
    if part.get_content_type() == 'text/plain':
     df = store_content(message, part.get_payload(decode=True))
     break
 except:
  fails.append(message)

Above the use of Python (link: https://docs.python.org/2/library/mailbox.html) module to read and parse mbox format of the mail. It can also be done in a more elegant way (for example, the message contains a lot of redundant, repetitive data, such as the embedded” > > > “Symbol). Another problem is that some special characters cannot be processed. For simplicity, we discard them. Make sure you don’t ignore the important part of the mailbox in this step.

It’s important to note that we’re not actually going to make use of anything other than the topic line. But you can do all sorts of interesting things with timestamps, with the body of the message, with sorting by tags, and so on. Since this is just an introductory article (which happens to show results from my own mailbox), I don’t want to go into too much detail.

Find common words

Now that we have some data, let’s find the top 10 most commonly used words in all the header lines:

# Top 10 most common subject words
from collections import Counter

subject_word_bag = df.subject.apply(lambda t: t.lower() + " ").sum()

Counter(subject_word_bag.split()).most_common()[:10]

[('re:', 8508), ('-', 1188), ('the', 819), ('fwd:', 666), ('to', 572), ('new', 530), ('your', 528), ('for', 498), ('a', 463), ('course', 452)]

Well, those are all too common, so here’s an attempt to limit common words:

from nltk.corpus import stopwords
stops = [unicode(word) for word in stopwords.words('english')] + ['re:', 'fwd:', '-']
subject_words = [word for word in subject_word_bag.split() if word.lower() not in stops]
Counter(subject_words).most_common()[:10]

[('new', 530), ('course', 452), ('trackmaven', 334), ('question', 334), ('post', 286), ('content', 245), ('payment', 244), ('blog', 241), ('forum', 236), ('update', 220)]

In addition to artificial remove several most no value of words, we also use me stop words corpus, the need for (link: http://www.nltk.org/data.html) before use. Now you can see some typical words in my inbox, but they are not always typical in English text.

Binary phrases and collocations

Me to another interesting principle of measurement is (link: http://en.wikipedia.org/wiki/Collocation). First, let’s take a look at the commonly used binary phrase, which is a collection of two words that often come together in pairs:

from nltk import collocations
bigram_measures = collocations.BigramAssocMeasures()
bigram_finder = collocations.BigramCollocationFinder.from_words(subject_words)

# Filter to top 20 results; otherwise this will take a LONG time to analyze
bigram_finder.apply_freq_filter(20)
for bigram in bigram_finder.score_ngrams(bigram_measures.raw_freq)[:10]:
 print bigram

(('forum', 'content'), 0.005839453284373725)
(('new', 'forum'), 0.005839453284373725)
(('blog', 'post'), 0.00538045695634435)
(('domain', 'names'), 0.004870461036311709)
(('alpha', 'release'), 0.0028304773561811506)
(('default', 'widget.'), 0.0026519787841697267)
(('purechat:', 'question'), 0.0026519787841697267)
(('using', 'default'), 0.0026519787841697267)
(('release', 'third'), 0.002575479396164831)
(('trackmaven', 'application'), 0.002524479804161567)

We can repeat the same steps for ternary phrases (or n-ternary phrases) to find longer phrases. This case, the “(link: http://en.wikipedia.org/wiki/Pointwise\_mutual\_information)” is three yuan phrases appear most times, but in the example above the list, but it is divided into two parts and dual phrases in the forefront of the list.

Another measure of a slightly different type of collocation is based on pointwise mutual information. Essentially, it measures the likelihood that, given a word we see in the specified text, another word will appear relative to how often they would normally appear alone in the entire document. For example, in general, if my message subject USES the word “blog” and/or “post” a lot, then the binary “blog post” is not an interesting signal, because one word may still not appear at the same time as the other. According to this criterion, we get a different set of binary groups.

for bigram in bigram_finder.nbest(bigram_measures.pmi, 5):
 print bigram

('4:30pm', '5pm')
('motley', 'fool')
('60,', '900,')
('population', 'cap')
('simple', 'goods')

As a result, I didn’t get many email subjects that mentioned the word “motley” or “fool,” but when I saw either one, “motley fool” might be related.

Sentiment analysis

Finally, let’s try some emotion analysis. To quick start, we can use based on me (link: http://textblob.readthedocs.org/en/dev/index.html) library, which provides for a large number of commonly used simple access of NLP tasks. We can use its built-in (link: http://textblob.readthedocs.org/en/dev/quickstart.html#sentiment-analysis) (based on (link: http://www.clips.ua.ac.be/pages/pattern-en#sentiment)) to calculate the theme of “polarity (polarity). From -1 for highly negative emotions to 1 for positive emotions, where 0 is neutral (lacking a clear signal)

Next: analyze your inbox over time; See if you can identify the basic attributes of senders/tags/spam of text by mail classification. Use (link: http://en.wikipedia.org/wiki/Latent\_semantic\_indexing) to reveal the covered by the most commonly used conventional theme. Type your hair folder into the Markov model and combine part of speech annotations to produce a coherent automatic response

Please (link: http://python.jobbole.com/80937/[email protected]) do you use NLP tried interesting project branch, contains a open source library will serve as a plus point. Check out the previous presentation at challenge.hackpad.com for more inspiration!