blogs

Understanding Text Analytics Methods

over 6 years ago by Ryan Stuart • 1 min read

There’s no getting around it, text analytics is fast becoming a must-have for businesses. Here's the the four key approaches linguistic, statistical, supervised and unsupervised.

There’s no getting around it, text analytics is fast becoming a must-have for businesses. With so much hype, you’d be forgiven for thinking there would be a wealth of information available on the topic. Unfortunately, this just isn’t the case. So we thought we’d put together a succinct summary of the four key approaches to text analytics: linguistic, statistical, supervised and unsupervised.

Linguistic

Linguistic text analysis is, in short, the who and the what. It relies on a set of language rules to identify the players in the text and what is happening. Unfortunately, language usage can fluctuate even within small geographical areas; for example, the term sick in Australia can be used as a positive sentiment term), Combine this with natural language borders, and you can see that the rulesets require constant fine tuning, hence, this method is on the outer. Modern implementations do exist, focusing on using machine learning to automatically build the rulesets.

Statistical

Nowadays, most text analytics platforms favour statistical analysis to identify the various components. As the name suggests, this method focuses on mathematical relationships between terms with metrics such as frequency and co-occurrence providing contextual information, thus removing the need for language rules. The statistical relationships between terms in a dataset allow us to gain some insight into the data. For instance, if your product is being mentioned frequently with the term price, then you can see that this may be a concern.

Supervised

A common approach in text analytics is to track a group of terms with a strong statistical relationship in the data, often referred to as topics, so we can start to identify trends. This can be accomplished by picking a set of topics you wish to track and asking a computer to look for them in the data.This type of methodology, known as supervised text analytics, involves specifying the topics and seeing how it changes across multiple datasets and time. For example, results from NPS surveys for a food product are likely to talk about price, taste, and health impacts so in a supervised approach, we ask the computer to identify these issues and track them over time.A supervised approach is not without its drawbacks though. It is often time consuming to setup because you need a large amount of training data labelled with the topics you want to track. This allows the machine to learn how to identify these topics in unseen text. Though by far its biggest drawback is its inability to identify emerging issues that haven’t been manually identified by the user during training.

Unsupervised

This brings us to unsupervised text analytics. In this instance, there is no user input other than the data itself and the machine builds the topics for you. The downside to this method is that tracking of specific topics can become difficult, as the topics identified by the machine may change from dataset to dataset creating an apples and oranges situation.Having said that, unsupervised text analysis is much quicker than supervised as it doesn’t require a manual step by the user to prepare labelled training data followed by extensive machine training. Additionally, it analyses the full gamut of issues in the dataset, so you won’t miss emergent ones, keeping you abreast of the market.

Our Approach

At Kapiche, we have always had a strong focus on unsupervised, statistical text analysis. This gives you quick and simple analysis of all issues arising from your dataset. For long term tracking, we are about to roll-out new functionality, enabling you to customise and freeze the created topic models. We see this as an ideal intersection of the supervised and unsupervised worlds. With this new feature, you will be able to identify trends around your key metrics without missing new developments.