Goal
Stanford CoreNLP is an annotation-based NLP processing pipeline (Ref, Manning et al., 2014). In the context of deep-learning-based text summarization, CoreNLP has been used by Fernandes et al. (2018) to provide structural annotations. As I tinker through the accompanying codes, having CoreNLP up and running becomes crucial. In particular, I hope to be able to run the corenlp.sh
script inside of the CoreNLP git repo.
At first glance, the instructions on CoreNLP’s official website looked overwhelming, and I was mildly concerned about a likely painful process of eventually making it work on my Mac. It turns out to be an alright experience — as soon as I realized that both steps shown on the webpage are necessary to make the script work.
My Initial Misunderstanding
When I first go through the instructions, I thought the two sections — Getting a copy and Steps to setup from the official release — have an “either…or…” relation. Given that I was most interested in making the git-repo script work, cloning the repo and following the associated steps were the only items that I had executed. Upon running the testing example, however, I got the following error messages:
java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP
It was only after some more googling around that it finally dawned on me: I still need the CoreNLP framework itself, whereas the repo is more about the tools and utilities needed to use the framework. Below is a detailed guide:
Setup Steps
[STEP 1] Download CoreNLP and set up paths
- Download the corenlp framework
Approach 1: Downloading via web interface
Approach 2: CLI download via wget
or curl
Download with
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
or
curl -O http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
2. Put the following lines into your .bashrc
or .zshrc
for paths settings
# Settings for Stanford CoreNLP
export CORENLP_ROOT="/Users/shandou/Software/stanford-corenlp-full-2018-10-05"
export CLASSPATH="$CORENLP_ROOT/javanlp-core.jar"
export CLASSPATH="$CLASSPATH:$CORENLP_ROOT/stanford-corenlp-models-current.jar"
for file in `find $CORENLP_ROOT -name "*.jar"`
do
export CLASSPATH="$CLASSPATH:`realpath $file`"
done
For testing, type echo $CLASSPATH
in your command line terminal. You should see outputs similat to this:
/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javanlp-core.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-models-current.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.json-api-1.0-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-api-2.4.0-b180830.0359-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-models.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.activation-api-1.2.0-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/ejml-0.23.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.activation-api-1.2.0.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/slf4j-api.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/protobuf.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/joda-time.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/joda-time-2.9-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-impl-2.4.0-b180830.0438.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/xom-1.2.10-src.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/xom.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-javadoc.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.json.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-api-2.4.0-b180830.0359.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jollyday.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/slf4j-simple.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-core-2.3.0.1-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-core-2.3.0.1.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jollyday-0.4.9-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-impl-2.4.0-b180830.0438-sources.jar
3. Test if corenlp itself is working following testing examples provided by the official setup guide:
# 1. Make a dummie input text file
echo "the quick brown fox jumped over the lazy dog" > input.txt# 2. Test it out
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
The processing takes a while to complete, and you should be see stdout similar as what’s shown below:
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Searching for resource: StanfordCoreNLP.properties ... found.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.5 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.8 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 10.184 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [11.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
[main] INFO edu.stanford.nlp.coref.statistical.SimpleLinearClassifier - Loading coref model edu/stanford/nlp/models/coref/statistical/ranking_model.ser.gz ... done [2.4 sec].
[main] INFO edu.stanford.nlp.pipeline.CorefMentionAnnotator - Using mention detector type: dependencyProcessing file /Users/shandou/Software/input.txt ... writing to /Users/shandou/Software/input.txt.json
Annotating file /Users/shandou/Software/input.txt ... done [0.6 sec].Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.2 sec.
DependencyParseAnnotator: 0.1 sec.
CorefAnnotator: 0.0 sec.
TOTAL: 0.6 sec. for 9 tokens at 15.9 tokens/sec.
Pipeline setup: 52.6 sec.
Total time for StanfordCoreNLP pipeline: 53.3 sec.
Great! Now you should be find the output file named as input.txt.json
that looks like:
[STEP 2] Setup CoreNLP repo for script `corenlp.sh`
Now we are ready to move on to the steps needed for making corenlp.sh
work.
- Clone repo: Go to its github repo and clone the package:
git clone https://github.com/stanfordnlp/CoreNLP.git
Go to the path under which you have cloned the repo (for me, it is /Users/shandou/Software/CoreNLP
), and go to the subfolder path doc/corenlp/
You should be able to see the script corenlp.sh
like what I have here:
2. Set up apache ant
If you don’t already have ant
, install it via brew install ant
Then run ant jar
in your terminal
3. Download the latest model with
wget http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
or
curl -O http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
4. Modify corenlp.sh
before testing
This is quite a gotcha. Given that we have set up path configuration assuming frequent use, the copying flags in the original script actually yields errors. We must apply the following changes before using the script:
Now give it a try following the comments in the script via
./corenlp.sh -file input.txt
DONE!
Additional Resources Regarding CoreNLP
- On CoreNLP GUI: I am yet to play with CoreNLP’s graphic user interface. For more information about the interface, please refer to a nice blog article on cloudacademy.com
- On details of the annotations: Please refer to CoreNLP doc page for a full list of annotations
Epilogue
Takeaway from running CoreNLP annotation on DigitalOcean Droplet: RAM intensive (≥ 16GB RAM recommended)
CoreNLP annotation turns out to be surprisingly RAM intensive. When testing on my 16GB RAM Macbook, both simple tests and my actual annotation tasks ran through without incidents (it does take about ~17 seconds to annotate each news article). Considering the time-consuming nature of the task, I just set up a DigitalOcean Droplet to run a bigger annotation task. At the very beginning I used a basic 8GB RAM configuration and kept getting “not enough memory” types of error messages for a simple one-sentence annotation. I first thought some cross-OS oddity might have occurred, until it finally dawned on me (after running across a user forum thread discussing RAM issues) that I simply did not set up enough RAM for this CoreNLP.
After resizing the Droplet to 16GB RAM, the annotation task has been running smoothly. If you also run into similar issues on your Linux server, check the output of free -hm
and make sure that you have more than 8GB to spare (though I don’t yet know the exact RAM needs at the moment).