NLP tutorial provides basic and advanced concepts of the NLP tutorial. Our NLP tutorial is designed for beginners and professionals.
Before learning NLP, you must have the basic knowledge of Python.
- What is NLP?
- History of NLP
- Advantages of NLP
- Disadvantages of NLP
- Components of NLP
- Applications of NLP
- How to build an NLP pipeline?
- Phases of NLP
- Why NLP is Difficult?
- NLP APIs
- NLP Libraries
- Difference between Natural language and Computer language
What is NLP?
NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human’s languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.
History of NLP
(1940-1960) – Focused on Machine Translation (MT)
The Natural Languages Processing started in the year 1940s.4.5M78Java Try Catch
1948 – In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College, London.
1950s – In the Year 1950s, there was a conflicting view between linguistics and computer science. Now, Chomsky developed his first book syntactic structures and claimed that language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based descriptions of syntactic structures.
(1960-1980) – Flavored with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN)
Augmented Transition Networks is a finite state machine that is capable of recognizing regular languages.
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case Grammar uses languages such as English to express the relationship between nouns and verbs by using the preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: “Neha broke the mirror with the hammer”. In this example case grammar identify Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were:
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate with the computer and moving objects. It can handle instructions such as “pick up the green boll” and also answer the questions like “What is inside the black box.” The main importance of SHRDLU is that it shows those syntax, semantics, and reasoning about the world that can be combined to produce a system that understands a natural language.
LUNAR is the classic example of a Natural Language database interface system that is used ATNs and Woods’ Procedural Semantics. It was capable of translating elaborate natural language expressions into database queries and handle 78% of requests without errors.
1980 – Current
Till the year 1980, natural language processing systems were based on complex sets of hand-written rules. After 1980, NLP introduced machine learning algorithms for language processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process accuracy, especially in English Grammar. In 1990 also, an electronic text introduced, which provided a good resource for training and examining natural language programs. Other factors may include the availability of computers with fast CPUs and more memory. The major factor behind the advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine translation, and machine text reading. When we combine all these applications then it allows the artificial intelligence to gain knowledge of the world. Let’s consider the example of AMAZON ALEXA, using this robot you can ask the question to Alexa, and it will reply to you.
Advantages of NLP
- NLP helps users to ask questions about any subject and get a direct response within seconds.
- NLP offers exact answers to the question means it does not offer unnecessary and unwanted information.
- NLP helps computers to communicate with humans in their languages.
- It is very time efficient.
- Most of the companies use NLP to improve the efficiency of documentation processes, accuracy of documentation, and identify the information from large databases.
Disadvantages of NLP
A list of disadvantages of NLP is given below:
- NLP may not show context.
- NLP is unpredictable
- NLP may require more keystrokes.
- NLP is unable to adapt to the new domain, and it has a limited function that’s why NLP is built for a single and specific task only.
Components of NLP
There are the following two components of NLP –
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer’s problem in both spoken and written language.
NLU involves the following tasks –
- It is used to map the given input into useful representation.
- It is used to analyze different aspects of the language.
2. Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization.
Note: The NLU is difficult than NLG.
Difference between NLU and NLG
|NLU is the process of reading and interpreting language.||NLG is the process of writing or generating language.|
|It produces non-linguistic outputs from natural language inputs.||It produces constructing natural language outputs from non-linguistic inputs.|
Applications of NLP
There are the following applications of NLP –
1. Question Answering
Question Answering focuses on building systems that automatically answer the questions asked by humans in a natural language.
2. Spam Detection
Spam detection is used to detect unwanted e-mails getting to a user’s inbox.
3. Sentiment Analysis
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)
4. Machine Translation
Machine translation is used to translate text or speech from one natural language to another natural language.
Example: Google Translator
5. Spelling correction
Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction.
6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used in applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice biometrics, voice user interface, and so on.
Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer’s chat services.
8. Information extraction
Information extraction is one of the most important applications of NLP. It is used for extracting structured information from unstructured or semi-structured machine-readable documents.
9. Natural Language Understanding (NLU)
It converts a large set of text into more formal representations such as first-order logic structures that are easier for the computer programs to manipulate notations of the natural language processing.
How to build an NLP pipeline
There are the following steps to build an NLP pipeline –
Step1: Sentence Segmentation
Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.
Example: Consider the following paragraph –
Independence Day is one of the important festivals for every Indian citizen. It is celebrated on the 15th of August each year ever since India got independence from the British rule. The day celebrates independence in the true sense.
Sentence Segment produces the following result:
- “Independence Day is one of the important festivals for every Indian citizen.”
- “It is celebrated on the 15th of August each year ever since India got independence from the British rule.”
- “This day celebrates independence in the true sense.”
Step2: Word Tokenization
Word Tokenizer is used to break the sentence into separate words or tokens.
JavaTpoint offers Corporate Training, Summer Training, Online Training, and Winter Training.
Word Tokenizer generates the following result:
“JavaTpoint”, “offers”, “Corporate”, “Training”, “Summer”, “Training”, “Online”, “Training”, “and”, “Winter”, “Training”, “.”
Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated and celebrating, all these words are originated with a single root word “celebrate.” The big problem with stemming is that sometimes it produces the root word which may not have any meaning.
For Example, intelligence, intelligent, and intelligently, all these words are originated with a single root word “intelligen.” In English, the word “intelligen” do not have any meaning.
Step 4: Lemmatization
Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.
For example: In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has a meaning.
Step 5: Identifying Stop Words
In English, there are a lot of words that appear very frequently like “is”, “and”, “the”, and “a”. NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical analysis.
Example: He is a good boy.
Note: When you are building a rock band search engine, then you do not ignore the word “The.”
Step 6: Dependency Parsing
Dependency Parsing is used to find that how all the words in the sentence are related to each other.
Step 7: POS tags
POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of speech based on the context in which it is used.
Example: “Google” something on the Internet.
In the above example, Google is used as a verb, although it is a proper noun.
Step 8: Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location.
Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco, California.
Step 9: Chunking
Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences.
Phases of NLP
There are the following five phases of NLP:
1. Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream of characters and converts it into meaningful lexemes. It divides the whole text into paragraphs, sentences, and words.
2. Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the words.
Example: Agra goes to the PoonamIn the real world, Agra goes to the Poonam, does not make any sense, so this sentence is rejected by the Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of words, phrases, and sentences.
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also invokes the meaning of the sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a set of rules that characterize cooperative dialogues.
For Example: “Open the door” is interpreted as a request instead of an order.
Why NLP is difficult?
NLP is difficult because Ambiguity and Uncertainty exist in the language.
There are the following three ambiguity –
- Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the sentence within a single word.
Manya is looking for a match.
In the above example, the word match refers to that either Manya is looking for a partner or Manya is looking for a match. (Cricket or other match)
- Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the sentence.
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?
- Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to Sunita. She said, “I am hungry.”
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
Natural Language Processing APIs allow developers to integrate human-to-machine communications and complete several useful tasks such as speech recognition, chatbots, spelling correction, sentiment analysis, etc.
A list of NLP APIs is given below:
- IBM Watson API
IBM Watson API combines different sophisticated machine learning techniques to enable developers to classify text into various custom categories. It supports multiple languages, such as English, French, Spanish, German, Chinese, etc. With the help of IBM Watson API, you can extract insights from texts, add automation in workflows, enhance search, and understand the sentiment. The main advantage of this API is that it is very easy to use.
Pricing: Firstly, it offers a free 30 days trial IBM cloud account. You can also opt for its paid plans.
- Chatbot API
Chatbot API allows you to create intelligent chatbots for any service. It supports Unicode characters, classifies text, multiple languages, etc. It is very easy to use. It helps you to create a chatbot for your web applications.
Pricing: Chatbot API is free for 150 requests per month. You can also opt for its paid version, which starts from $100 to $5,000 per month.
- Speech to text API
Speech to text API is used to convert speech to text
Pricing: Speech to text API is free for converting 60 minutes per month. Its paid version starts form $500 to $1,500 per month.
- Sentiment Analysis API
Sentiment Analysis API is also called as ‘opinion mining‘ which is used to identify the tone of a user (positive, negative, or neutral)
Pricing: Sentiment Analysis API is free for less than 500 requests per month. Its paid version starts form $19 to $99 per month.
- Translation API by SYSTRAN
The Translation API by SYSTRAN is used to translate the text from the source language to the target language. You can use its NLP APIs for language detection, text segmentation, named entity recognition, tokenization, and many other tasks.
Pricing: This API is available for free. But for commercial users, you need to use its paid version.
- Text Analysis API by AYLIEN
Text Analysis API by AYLIEN is used to derive meaning and insights from the textual content. It is available for both free as well as paid from$119 per month. It is easy to use.
Pricing: This API is available free for 1,000 hits per day. You can also use its paid version, which starts from $199 to S1, 399 per month.
- Cloud NLP API
The Cloud NLP API is used to improve the capabilities of the application using natural language processing technology. It allows you to carry various natural language processing functions like sentiment analysis and language detection. It is easy to use.
Pricing: Cloud NLP API is available for free.
- Google Cloud Natural Language API
Google Cloud Natural Language API allows you to extract beneficial insights from unstructured text. This API allows you to perform entity recognition, sentiment analysis, content classification, and syntax analysis in more the 700 predefined categories. It also allows you to perform text analysis in multiple languages such as English, French, Chinese, and German.
Pricing: After performing entity analysis for 5,000 to 10,000,000 units, you need to pay $1.00 per 1000 units per month.
Scikit-learn: It provides a wide range of algorithms for building machine learning models in Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
Pattern: It is a web mining module for NLP and machine learning.
TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis, noun phrase extraction, or pos-tagging.
Quepy: Quepy is used to transform natural language questions into queries in a database query language.
SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data Analysis, Sentiment Analysis, and Text Summarization.
Gensim: Gensim works with large datasets and processes data streams.
Difference between Natural language and Computer Language
|Natural Language||Computer Language|
|Natural language has a very large vocabulary.||Computer language has a very limited vocabulary.|
|Natural language is easily understood by humans.||Computer language is easily understood by the machines.|
|Natural language is ambiguous in nature.||Computer language is unambiguous.|
Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be −
- Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
- Mapping the given input in natural language into useful representations.
- Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.
It involves −
- Text planning − It includes retrieving the relevant content from knowledge base.
- Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.
- Text Realization − It is mapping sentence plan into sentence structure.
The NLU is harder than NLG.
Difficulties in NLU
NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
- Lexical ambiguity − It is at very primitive level such as word-level.
- For example, treating the word “board” as noun or verb?
- Syntax Level ambiguity − A sentence can be parsed in different ways.
- For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a beetle that had red cap?
- Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
- One input can mean different meanings.
- Many inputs can mean the same thing.
- Phonology − It is study of organizing sound systematically.
- Morphology − It is a study of construction of words from primitive meaningful units.
- Morpheme − It is primitive unit of meaning in a language.
- Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.
- Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.
- Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.
- Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.
- World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps −
- Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.
- Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
- Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
- Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.
- Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.
Implementation Aspects of Syntactic Analysis
There are a number of algorithms researchers have developed for syntactic analysis, but we consider only the following simple methods −
- Context-Free Grammar
- Top-Down Parser
Let us see them in detail −
It is the grammar that consists rules with a single symbol on the left-hand side of the rewrite rules. Let us create grammar to parse a sentence −
“The bird pecks the grains”
Articles (DET) − a | an | the
Nouns − bird | birds | grain | grains
Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun
= DET N | DET ADJ N
Verbs − pecks | pecking | pecked
Verb Phrase (VP) − NP V | V NP
Adjectives (ADJ) − beautiful | small | chirping
The parse tree breaks down the sentence into structured parts so that the computer can easily understand and process it. In order for the parsing algorithm to construct this parse tree, a set of rewrite rules, which describe what tree structures are legal, need to be constructed.
These rules say that a certain symbol may be expanded in the tree by a sequence of other symbols. According to first order logic rule, if there are two strings Noun Phrase (NP) and Verb Phrase (VP), then the string combined by NP followed by VP is a sentence. The rewrite rules for the sentence are as follows −
S → NP VP
NP → DET N | DET ADJ N
VP → V NP
DET → a | the
ADJ → beautiful | perching
N → bird | birds | grain | grains
V → peck | pecks | pecking
The parse tree can be created as shown −
Now consider the above rewrite rules. Since V can be replaced by both, “peck” or “pecks”, sentences such as “The bird peck the grains” can be wrongly permitted. i. e. the subject-verb agreement error is approved as correct.
Merit − The simplest style of grammar, therefore widely used one.
- They are not highly precise. For example, “The grains peck the bird”, is a syntactically correct according to parser, but even if it makes no sense, parser takes it as a correct sentence.
- To bring out high precision, multiple sets of grammar need to be prepared. It may require a completely different sets of rules for parsing singular and plural variations, passive sentences, etc., which can lead to creation of huge set of rules that are unmanageable.
Here, the parser starts with the S symbol and attempts to rewrite it into a sequence of terminal symbols that matches the classes of the words in the input sentence until it consists entirely of terminal symbols.
These are then checked with the input sentence to see if it matched. If not, the process is started over again with a different set of rules. This is repeated until a specific rule is found which describes the structure of the sentence.
Merit − It is simple to implement.
- It is inefficient, as the search process has to be repeated if an error occurs.
- Slow speed of working.