Building a Natural Language Processing Pipeline | by Sangramsing Kayte | Dec, 2020
Copenhagen is the capital and most populous city of Denmark and capital sits on the coastal islands of Zealand and Amager. It’s linked to Malmo in southern Sweden by the Oresund Bridge. Indre By, the city’s historic centre, contains Frederiksstaden, an 18th-century rococo district, home to the royal family’s Amalienborg Palace. Nearby is Christiansborg Palace and the Renaissance-era Rosenborg Castle, surrounded by gardens and home to the crown jewels.
This paragraph contains several useful facts. It would be great if a computer could read this text and understand that Copenhagen is a city, Copenhagen is located in coastal islands, Copenhagen was settled by royal family’s Amalienborg. But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there.
The first step in the pipeline is to break the text apart into separate sentences. That gives us this:
1. “Copenhagen is the capital and most populous city of Denmark and the coastal islands.”
2. “The coastal islands of Scandinavia, Region in Europe Great Britain, Copenhagen has been a major settlement for two millennia 50 thousand two.”
3. “It was founded by the royal family’s Amalienborg the coastal islands of Zealand and Amager
4. ”We can assume that each sentence in English is a separate thought or idea. It will be a lot easier to write a program to understand a single sentence than to understand a whole paragraph.
Coding a Sentence Segmentation model can be as simple as splitting apart sentences whenever you see a punctuation mark. But modern NLP pipelines often use more complex techniques that work even when a document isn’t formatted cleanly.
1. 3 Tips for your Voice and Chatbot Program from Gartner’s Customer Service Hype Cycle 2020
2. Deploying Watson Assistant Web Chat in Salesforce Lightning Console
3. Are Chatbots Vulnerable? Best Practices to Ensure Chatbots Security
4. Your Path to AI — An IBM Developer Series
Now that we’ve split our document into sentences, we can process them one at a time. Let’s start with the first sentence from our document:
“Copenhagen is the capital and most populous city of Denmark and the coastal islands.”
The next step in our pipeline is to break this sentence into separate words or tokens. This is called tokenization. This is the result:
“Copenhagen”, “is”, “ the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “Denmark”, “and”, “the”, “coastal”, “islands”, “.”
Tokenization is easy to do in English. We’ll just split apart words whenever there’s a space between them. And we’ll also treat punctuation marks as separate tokens since punctuation also has meaning.
Next, we’ll look at each token and try to guess it’s part of speech whether it is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence will help us start to figure out what the sentence is talking about.
We can do this by feeding each word (and some extra words around it for context) into a pre-trained part-of-speech classification model:
The part-of-speech model was originally trained by feeding it millions of English sentences with each word’s part of speech already tagged and having it learn to replicate that behavior.
Keep in mind that the model is completely based on statistics — it doesn’t actually understand what the words mean in the same way that humans do. It just knows how to guess a part of speech based on similar sentences and words it has seen before.
After processing the whole sentence, we’ll have a result like this:
With this information, we can already start to glean some very basic meaning. For example, we can see that the nouns in the sentence include “Copenhagen” and “capital”, so the sentence is probably talking about Copenhagen.
In Copenhagen (and most languages), words appear in different forms. Look at these two sentences:
I had a MacBook.
I had two MacBook’s.
Both sentences talk about the noun pony, but they are using different inflections. When working with text in a computer, it is helpful to know the base form of each word so that you know that both sentences are talking about the same concept. Otherwise, the strings “pony” and “ponies” look like two totally different words to a computer.
In NLP, we call finding this process lemmatization — figuring out the most basic form or lemma of each word in the sentence.
The same thing applies to verbs. We can also lemmatize verbs by finding their root, unconjugated form. So “I had two MacBooks” becomes “I [have] two [macbook].”
Lemmatization is typically done by having a look-up table of the lemma forms of words based on their part of speech and possibly having some custom rules to handle words that you’ve never seen before.
Here’s what our sentence looks like after lemmatization adds in the root form of our verb:
The only change we made was turning “is” into “be”.
Next, we want to consider the importance of each word in the sentence. English has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other words. Some NLP pipelines will flag them as stop words — that is, words that you might want to filter out before doing any statistical analysis.
Here’s how our sentence looks with the stop words grayed out:
Stop words are usually identified by just by checking a hardcoded list of known stop words. But there’s no standard list of stop words that is appropriate for all applications. The list of words to ignore can vary depending on your application.
For example if you are building a rock band search engine, you want to make sure you don’t ignore the word “The”. Because not only does the word “The” appear in a lot of band names, there’s a famous 1980’s rock band called The The!
The next step is to figure out how all the words in our sentence relate to each other. This is called dependency parsing.
The goal is to build a tree that assigns a single parent word to each word in the sentence. The root of the tree will be the main verb in the sentence. Here’s what the beginning of the parse tree will look like for our sentence:
But we can go one step further. In addition to identifying the parent word of each word, we can also predict the type of relationship that exists between those two words:
This parse tree shows us that the subject of the sentence is the noun “Copenhagen” and it has a “be” relationship with “capital”. We finally know something useful — Copenhagen is the capital! And if we followed the complete parse tree for the sentence (beyond what is shown), we would even found out that Copenhagen is the capital of Denmark.
Just like how we predicted parts of speech earlier using a machine learning model, dependency parsing also works by feeding words into a machine learning model and outputting a result. But parsing word dependencies is a particularly complex task and would require an entire article to explain in any detail. If you are curious how it works, a great place to start reading is Matthew Honnibal’s excellent article “Parsing English in 500 Lines of Python”.
Now that we’ve done all that hard work, we can finally move beyond grade-school grammar and start actually extracting ideas.
In our sentence, we have the following nouns:
Some of these nouns’ present real things in the world. For example, “Copenhagen”, “Denmark” and “Coastal Islands” represent physical places on a map. It would be nice to be able to detect that! With that information, we could automatically extract a list of real-world places mentioned in a document using NLP.
The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent. Here’s what our sentence looks like after running each token through our NER tagging model:
But NER systems aren’t just doing a simple dictionary lookup. Instead, they are using the context of how a word appears in the sentence and a statistical model to guess which type of noun a word represents. A good NER system can tell the difference between “Brooklyn Decker” the person and the place “Brooklyn” using context clues.
Here are just some of the kinds of objects that a typical NER system can tag:
· People’s names
· Company names
· Geographic locations (Both physical and political)
· Product names
· Dates and times
· Amounts of money
· Names of events
NER has tons of uses since it makes it so easy to grab structured data out of text. It’s one of the easiest ways to quickly get value out of an NLP pipeline.
Credit: Source link