natural language processing (NLP) is one of the most important frontiers of software. The basic idea – how to efficiently consume and generate human language – has been an ongoing effort since the dawn of digital computing. The effort continues today, with machine learning and graph databases at the forefront of the effort to master natural language.
This article is a practical introduction to Apache OpenNLPa Java-based machine learning project that provides primitives such as segmentation and lemmatizationboth required to create NLP-compliant systems.
What is Apache OpenNLP?
A natural language processing system such as Apache OpenNLP typically has three parts:
- Learn from a corpuswhich is a set of textual data (plural: corpus)
- A model generated from the corpus
- Using the model to perform tasks on the target text
To make things even easier, OpenNLP has pre-trained models available for many common use cases. For more sophisticated requirements, you may need to train your own models. For a simpler scenario, you can simply download an existing template and apply it to the task at hand.
Language detection with OpenNLP
Let’s build a basic application that we can use to see how OpenNLP works. We can start the layout with a Maven archetype, as shown in Listing 1.
Listing 1. Create a new project
~/apache-maven-3.8.6/bin/mvn archetype:generate -DgroupId=com.infoworld.com -DartifactId=opennlp -DarchetypeArtifactId=maven-arhectype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
This archetype will scaffold a new Java project. Next, add the Apache OpenNLP dependency to the pom.xml
in the root directory of the project, as shown in Listing 2. (You can use any version of the OpenNLP dependency The most common.)
Listing 2. The OpenNLP Maven dependency
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>2.0.0</version>
</dependency>
To make the program easier to run, also add the following entry to the <plugins>
part of the pom.xm
the file :
Listing 3. Main class execution target for the Maven POM
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<mainClass>com.infoworld.App</mainClass>
</configuration>
</plugin>
Now run the program with maven compile exec:java
. (You will need Maven and an installed JDK to run this command.) Running it now will just give you the familiar “Hello World!” production.
Download and configure a language detection template
We are now ready to use OpenNLP to detect the language in our sample program. The first step is to upload a language detection model. Download the latest Language Detector component from the OpenNLP Models Download Page. As of this writing, the current version is langdetect-183.bin.
To make the model easier to access, let’s enter the Maven project and mkdir
a new directory on /opennlp/src/main/resource
then copy it langdetect-*.bin
drop in there.
Now let’s modify an existing file with what you see in Listing 4. We’ll use /opennlp/src/main/java/com/infoworld/App.java
for this example.
Listing 4. App.java
package com.infoworld;
import java.util.Arrays;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import opennlp.tools.langdetect.LanguageDetectorModel;
import opennlp.tools.langdetect.LanguageDetector;
import opennlp.tools.langdetect.LanguageDetectorME;
import opennlp.tools.langdetect.Language;
public class App {
public static void main( String[] args ) {
System.out.println( "Hello World!" );
App app = new App();
try {
app.nlp();
} catch (IOException ioe){
System.err.println("Problem: " + ioe);
}
}
public void nlp() throws IOException {
InputStream is = this.getClass().getClassLoader().getResourceAsStream("langdetect-183.bin"); // 1
LanguageDetectorModel langModel = new LanguageDetectorModel(is); // 2
String input = "This is a test. This is only a test. Do not pass go. Do not collect $200. When in the course of human history."; // 3
LanguageDetector langDetect = new LanguageDetectorME(langModel); // 4
Language langGuess = langDetect.predictLanguage(input); // 5
System.out.println("Language best guess: " + langGuess.getLang());
Language[] languages = langDetect.predictLanguages(input);
System.out.println("Languages: " + Arrays.toString(languages));
}
}
Now you can run this program with the command, maven compile exec:java
. When you do this, you will get output similar to that shown in Listing 5.
Listing 5. Language analysis 1
Best estimate language: eng Languages: [eng (0.09568318011427969), tgl (0.027236092538322446), cym (0.02607472496029117), war (0.023722424236917564)...
The “ME” in this sample stands for maximum entropy. Maximum entropy is a concept from statistics that is used in natural language processing to optimize for best results.
Evaluate the results
Afer running the program, you will see that the OpenNLP language detector accurately guessed that the language of the text in the example program was English. We’ve also output some of the probabilities the language detection algorithm came up with. After English, it guessed the language might be Tagalog, Welsh, or War-Jaintia. In the detector’s defense, the language sample was small. Correctly identifying the language from just a handful of sentences, with no other context, is pretty impressive.
Before we move on, look back at Listing 4. The flow is pretty simple. Each commented line works like so:
- Open the
langdetect-183.bin
file as an input stream. - Use the input stream to parameterize instantiation of the
LanguageDetectorModel
. - Create a string to use as input.
- Make a language detector object, using the
LanguageDetectorModel
from line 2. - Run the
langDetect.predictLanguage()
method on the input from line 3.
Testing probability
If we add more English language text to the string and run it again, the probability assigned to eng
should go up. Let’s try it by pasting in the contents of the United States Declaration of Independence into a new file in our project directory: /src/main/resources/declaration.txt
. We’ll load that and process it as shown in Listing 6, replacing the inline string:
Listing 6. Load the Declaration of Independence text
String input = new String(this.getClass().getClassLoader().getResourceAsStream("declaration.txt").readAllBytes());
If you run this, you’ll see that English is still the detected language.
Detecting sentences with OpenNLP
You’ve seen the language detection model at work. Now, let’s try out a model for detecting sentences. To start, return to the OpenNLP model download page, and add the latest Sentence English model component to your project’s /resource
directory. Notice that knowing the language of the text is a prerequisite for detecting sentences.
We’ll follow a similar pattern to what we did with the language detection model: load the file (in my case opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin
) and use it to instantiate a sentence detector. Then, we’ll use the detector on the input file. You can see the new code in Listing 7 (along with its imports); the rest of the code remains the same.
Listing 7. Detecting sentences
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.sentdetect.SentenceDetectorME;
//...
InputStream modelFile = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin");
SentenceModel sentModel = new SentenceModel(modelFile);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentModel);
String sentences[] = sentenceDetector.sentDetect(input); System.out.println("Phrases: " + sentences.length + " first line: "+ sentences[2])
Running the file now will produce something like what is shown in Listing 8.
Listing 8. Sentence detector output
Sentences: 41 first line: In Congress, July 4, 1776
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, ...
Note that the sentence detector found 41 sentences, which seems correct. Also note that this detector model is quite simple: it simply searches for periods and spaces to find the pauses. It has no logic for grammar. That’s why we used index 2 on the sentence array to get the actual preamble – the header lines were grouped into two sentences. (The founding documents are notoriously inconsistent with punctuation, and the phrase finder doesn’t attempt to consider “When in the course…” as a new phrase.)
Tokenization with OpenNLP
After dividing documents into sentences, tokenization is the next level of granularity. Tokenization is the process of breaking down the document into words and punctuation, respectively. We can use the code shown in Listing 9:
Listing 9. Tokenization
import opennlp.tools.tokenize.SimpleTokenizer;
//...
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(input);
System.out.println("tokens: " + tokens.length + " : " + tokens[73] + " " + tokens[74] + " " + tokens[75]);
This will give output like what is shown in Listing 10.
Listing 10. Tokenizer release
tokens: 1704 : human events ,
So the model split the document into 1704 tokens. We can access the array of tokens, the words “human events” and the next comma, and each takes up an element.
Name search with OpenNLP
Now we’ll enter the “Person name finder” template for English, called en-ner-person.bin. Not that this model is on the Sourceforge Models Download Page. Once you have the template, place it in your project’s resource directory and use it to find names in the document, as shown in Listing 11.
Listing 11. Name lookup with OpenNLP
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.util.Span
//...
InputStream nameFinderFile = this.getClass().getClassLoader().getResourceAsStream("en-ner-person.bin");
TokenNameFinderModel nameFinderModel = new TokenNameFinderModel(nameFinderFile);
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
Span[] names = nameFinder.find(tokens);
System.out.println("names: " + names.length);
for (Span nameSpan : names){
System.out.println("name: " + nameSpan + " : " + tokens[nameSpan.getStart()-1] + " " + tokens[nameSpan.getEnd()-1]);
}
In Listing 11, we load the template and use it to instantiate a NameFinderME
object, which we then use to get an array of names, modeled as span objects. A span has a start and an end that tells us where the detector thinks the name starts and ends in the token set. Note that the name finder expects an array of already tokenized strings.
Marking up parts of speech with OpenNLP
OpenNLP allows us to tag parts of speech (POS) against tokenized strings. Listing 12 is an example of markup for parts of speech.
Listing 12. Marking up parts of speech
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
//…
InputStream posIS = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-pos-1.0-1.9.3.bin");
POSModel posModel = new POSModel(posIS);
POSTaggerME posTagger = new POSTaggerME(posModel);
String tags[] = posTagger.tag(tokens);
System.out.println("tags: " + tags.length);
for (int i = 0; i < 15; i++){
System.out.println(tokens[i] + " = " + tags[i]);
}
The process is similar with the model file loaded into a model class and then used on the token array. It produces something like Listing 13.
Listing 13. Output of parts of speech
tags: 1704
Declaration = NOUN
of = ADP
Independence = NOUN
: = PUNCT
A = DET
Transcription = NOUN
Print = VERB
This = DET
Page = NOUN
Note = NOUN
: = PUNCT
The = DET
following = VERB
text = NOUN
is = AUX
Unlike the name search model, the POS tagger did a good job. He correctly identified several different parts of speech. Examples in Listing 13 include NOUN, ADP (which stands for adposition) and PUNCT (for punctuation).
Conclusion
In this article, you saw how to add Apache OpenNLP to a Java project and use predefined templates for natural language processing. In some cases you may need to develop your own template, but pre-existing templates will often do the trick. In addition to the models presented here, OpenNLP includes features such as a document categorizer, a lemmatizer (which breaks words down to their roots), a fragmenter, and a parser. All of these are the building blocks of a natural language processing system, and are freely available with OpenNLP.
Copyright © 2022 IDG Communications, Inc.