Text Mining Is No Longer Intimidating – Staupell Analytics Group

In the world of nonprofit data management, we work really hard to smart code all of our interactions with our audience so that we can successfully report on them. However, we work in a relationship industry and that often requires detailed explanation, which, translated to data speak, is free text. And there is our conundrum.

Processing free text is the domain of artificial intelligence, a discipline that we nonprofit data scientists are learning now. The new processing program, Python, even has a package called Beautiful Soup which parses websites, the text within them, and the HTML tags marking any variety of content. The R program also has a package called SentimentAnalysis which assigns a sentiment to different words by using the package’s dictionary, called a lexicon.

But we can do some of this analysis without an artificial intelligence program. Here are some steps to work your way into trying out text parsing, starting with the easy stuff and working toward the sophisticated stuff.

**Bag-of-Words Model:** This theory involves splitting text into single words, assigning a predetermined value to each parsed word, and then counting them up. Your version of this could be a word cloud, which we at Staupell have done using Excel and Tableau. Assigning meaning to the words requires a separate lexicon, which you can then use to add value to your word cloud.
Phrase parsing: Called NGRAMS in some programs, parsing text into 1-, then 2-, then 3-word phrases allows for finding those phrases which indicate the sentiment that you’re looking for. I use WEKA to process text this way, but IBM’s Text Analytics software also intuitively identifies phrases depending on the lexicon (dictionary) that you use. I have even used SQL to do the work. When I have worked with IBM’s product, I have set my lexicon for client satisfaction, but the product can create a custom lexicon, so that phrases like, “promoted to” can be marked with a “career” tag.
Looking for specific triggers. This exercise can even be done in Excel using the “match” function. Words like, “sold”, “gave”, or your organization’s name along with a quote can be identified. If you are using Python or R, use regular expressions to find them. Then flag them.

There was a study that I heard about years ago (and I wish I could find it now) where a suicide hotline identified through data science the keywords that indicated that the caller really meant to cause self-harm. To me, that is the best use of text analytics. Your work, since you are in a nonprofit, is also for a noble cause.

Try some of these tricks and see what you can glean from contact reports. And let us know what you find.