Understanding our data

Amazon Comprehend

Stahl Tamas
5 min readDec 8, 2020
Photo by Max Duzij on Unsplash

We all come across large number of unstructured data in our everyday, like emails, social media posts or product reviews, and these must contain valuable information for us that could be used. These can represent insights into sentiment of people, which can be worthy of analysis to get to know their general perception of our products, services or even us. I have encountered several sentiment analyses prepared by people when researching for this article.

Let me show you what is a sentiment analysis. A good example could be showed through the 2020 US election and the perception of the two candidates, Donald Trump and Joe Biden. People analyzed tweets (as it is really popular in the US) in which either Donald Trump or Joe Biden was mentioned, whether these tweets’ sentiment was positive, negative, neutral or mixed. Based on this sentiment analysis and the geographical location of the sent tweets we could see which state preferred which candidate. It is important to note, that its external validity is questionable as we cannot be sure that either side was represented properly.

In order to do such an analysis I would suggest using Amazon Comprehend. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text, which requires no machine learning experience.

Other than the above-described sentiment analysis, Amazon Comprehend have the following really useful and interesting features:

  • Entity Detection,
  • Key Phrases Extraction,
  • Language Detection,
  • Topic Modelling on Large Document Collection, etc.
  • +1 Amazon Comprehend Medical

Amazon Comprehend

Amazon Comprehend can analyze a collection of documents and other text files (such as social media posts) and automatically organize them by relevant terms or topics. These topics are valuable as personalized content is provided to the customers or richer search and navigation could be provided.

Amazon Comprehend Medical identifies medical information, such as health condition or medication, and determines their relationship (i.e., what kind of medicine and the usage frequency). It could provide context to analysts by selecting a disease and , such as whether a patient has tested positive or negative

As it can be seen on the picture, Amazon Comprehend automatically extracts the key phrases, entities, sentiments, etc. which data will be used for the analysis. In the Master’s program I am participating we have been using R for data analysis, therefore, I will show you how to use these features in R.

How Amazon Comprehend works (Source: aws.amazon.com)

Use Cases

First we need to set up the AWS

keyTable <- read.csv("XYaccessKeys.csv", header = T) # XYaccessKeys.csv == the CSV downloaded from AWS containing your Acces & Secret keys
AWS_ACCESS_KEY_ID <- as.character(keyTable$Access.key.ID)
AWS_SECRET_ACCESS_KEY <- as.character(keyTable$Secret.access.key)
#activate
Sys.setenv("AWS_ACCESS_KEY_ID" = AWS_ACCESS_KEY_ID,
"AWS_SECRET_ACCESS_KEY" = AWS_SECRET_ACCESS_KEY,
"AWS_DEFAULT_REGION" = "eu-west-1")
#loading in the required package
library("aws.comprehend")

Let’s see some use cases with codes for several features by using R:

Sentiment analysis

It is not reprehensible for anyone to sneeze anywhere. Peasants sneeze and so do police superintendents, and sometimes even privy councillors. All men sneeze. Tchervyakov was not in the least confused, he wiped his face with his handkerchief, and like a polite man, looked round to see whether he had disturbed any one by his sneezing.”

In order to get to know whether the above text (A snippet from one of Anton Chekhov’s short story) is positive, negative or neutral please see the code in R below.

detect_sentiment("It is not reprehensible for anyone to sneeze anywhere. Peasants sneeze and so do police superintendents, and sometimes even privy councillors. All men sneeze. Tchervyakov was not in the least confused, he wiped his face with his handkerchief, and like a polite man, looked round to see whether he had disturbed any one by his sneezing.")

In this case we will get the following result:

The result is quite straightforward. The sentiment of the text is negative and it states that what percentage is mixed, negative, neutral or positive. In our case more than 58% of the text is deemed negative.

Language detection

We have the following three sentences in Russian, Tamil and Swiss German. Now let’s see whether Amazon Comprehend recognizes the language.

  • Минэкономразвития предлагает мобилизовать деньги бизнеса, банков, населения и ЦБ
detect_language("Минэкономразвития предлагает мобилизовать деньги бизнеса, банков, населения и ЦБ")
Pretty accurate result for the Russian language test
  • கொரோனா வைரஸால் பாதிக்கப்படுவோரின் எண்ணிக்கை இந்தியாவில் குறைந்து கொண்டிருக்கிறது.
detect_language("கொரோனா வைரஸால் பாதிக்கப்படுவோரின் எண்ணிக்கை இந்தியாவில் குறைந்து கொண்டிருக்கிறது.")
Amazon Comprehend is 100% sure that the sentence is in Tamil
  • Schwyzerdütsch isch ä Sommelbezeichnig fyr diejenige alemannische Dialekte, wu in dr Schwyz un in Liechtestai gsproche wärre.
detect_language("Schwyzerdütsch isch ä Sommelbezeichnig fyr diejenige alemannische Dialekte, wu in dr Schwyz un in Liechtestai gsproche wärre.")
For the Swiss German it is still pretty accurate, but maybe due to the dialect it has some error

All in all we could say that Amazon Comprehend successfully recognized all three languages, even the Swiss Deutsch. Good job! Although it had the highest uncertainty in case of the Swiss Deutsch text. Meanwhile it had a 100% perfect match for Tamil.

I hope you liked the brief summary of the features of Amazon Comprehend with some examples at the end.

Thanks for reading!

--

--