Introduction

Stanza is an open source NLP library from Stanford University based on modern neural networks. It enables comprehensive linguistic analysis of texts in over 70 languages. Stanza’s goal is to provide a complete pipeline system that includes all common processing steps: tokenization, part-of-speech tagging (POS), lemmatization, syntactic analysis (dependencies and constituency) as well as Named Entity Recognition (NER).

Stanza is suitable both for research purposes and for production applications, such as text classification, information extraction or preprocessing texts for retrieval-augmented generation (RAG). The models are pretrained but can also be fine-tuned. Internally, Stanza is based on the PyTorch framework.

A pipeline in natural language processing (NLP) refers to a predetermined sequence of processing steps used to analyze and structure a text. Each step takes the output of the previous one as input and enriches it with further linguistic information. The goal is to gradually transform raw text into a deeper linguistic representation that can be used for additional applications, e.g. text classification, information extraction or question answering.

Steps of a Typical Stanza Pipeline

The standard Stanza pipeline consists of the following modules (also called „processors“), which operate in this order:

Tokenization (tokenize)
Multi-word token expansion (mwt)
Part-of-speech tagging (pos)
Lemmatization (lemma)
Syntactic dependency parsing (depparse)
Constituent analysis (constituency) - only for English
Named entity recognition (ner)
Sentiment analysis (sentiment) - only for English

1. Tokenization

Tokenization is the first step in processing a text through an NLP pipeline. The input text is split into individual units. These units are called tokens. A token can be a word, a number, a punctuation mark or a symbol. Tokenization determines where a token begins and where it ends. This is important because all subsequent processing steps use these units.

Stanza uses language-specific models for tokenization. These models are trained on real language data and take into account the peculiarities of each language. In English, for example, Stanza recognizes that “U.S.A.” is a single token and not three. Abbreviations, number formats and emojis are also handled correctly.

What Stanza provides in this step:

A split of the text into sentences
A list of tokens per sentence
For each token, the start and end positions in the original text are stored

2. Multi-word token expansion (MWT)

Multi-word token expansion or MWT is an optional step in the Stanza pipeline. It is only enabled for certain languages in which individual tokens can consist of multiple words. These include, above all, morphologically complex languages such as Arabic or French. In languages like German or English, this step is disabled by default because words are already written separately.

When the MWT component is enabled, a token that contains multiple words is split into its components. The original token structure remains intact, but additional words are created. These words are the actual units with which further modules like POS or lemmatization work.

3. Part-of-speech Tagging (POS)

Part-of-speech tagging is a central step in natural language processing. Here, each word is assigned a grammatical category. Examples of such categories are noun, verb, adjective, adverb, article or preposition. This information is required for almost all subsequent steps because it reveals grammatical structures.

Stanza uses neural models to perform this categorization automatically. For each word, both a universal part-of-speech tag (UPOS) and a language-specific, more detailed tag (XPOS) are assigned. Additionally, morphological features are captured, such as gender, number, case, tense or verb form.

UPOS stands for „Universal Part of Speech“. It is a classification of words into basic categories like noun, verb, adjective or article. This system is the same for all languages. A verb in German and a verb in English both receive the tag VERB.

XPOS is the part-of-speech tag as it is used in a specific language. In Stanza, it is defined differently for each language. For English, this means:

NN stands for a singular noun
VBZ stands for a third person singular present tense verb
JJ stands for an adjective

XPOS is thus a more detailed description of the part of speech as it is used in the respective language.

4. Lemmatization

Lemmatization is the process by which a word is reduced to its base form. This base form is called a lemma. The goal is to bring different grammatical forms of a word to a single form. This is important to compare or process words independently of tense, person or number.

Examples:

„went“ becomes „go“
„dogs“ becomes „dog“

Stanza uses a model for lemmatization that takes the context of the word into account. This allows it to correctly handle words with multiple meanings as well.

5. Syntactic Dependency Parsing

Syntactic dependency parsing examines the grammatical structure of a sentence. In this process, for each word it is determined which other word it depends on and what role it plays. The result is a directed tree in which each word is subordinate to exactly one other. The connections are called edges, and they carry grammatical labels such as subject, object or modifier.

Stanza uses a neural model for this that produces the so-called Universal Dependencies. This structure shows how the words in the sentence are connected. Each word has a so-called head, i.e. the word it is subordinate to, and a relation to that head.

Important dependency relations

nsubj: nominal subject
obj: direct object
obl: adverbial modifier
root: root of the sentence, usually the main verb
det: determiner
amod: adjective as a modifier of a noun
case: preposition or case marker
punct: punctuation mark

6. Constituency Parsing - only for English

Constituency parsing examines which parts of a sentence make it up and how these parts are nested. It identifies which words together form a unit, for example a subject or an object. Such units are called phrases, for example noun phrase or verb phrase.

The analysis shows the structure of the sentence as a tree. Each sentence is split into progressively larger groups, for example: first individual words, then phrases, then the entire sentence.

Stanza uses an English rule set called the Penn Treebank for this. This works well for English texts. For German texts, constituency parsing in Stanza is currently not available. The feature is primarily suitable for English sentences. For German, one must additionally use the Berkeley Neural Parser (Benepar) library.

7. Named entity recognition (Named Entity Recognition)

Named entity recognition is a step in NLP processing where certain words or groups of words are recognized as significant objects. These objects are called entities. They refer, for example, to persons, locations, organizations, dates or monetary amounts.

Stanza recognizes entities automatically based on a trained neural model. Each recognized entity is assigned to a fixed type. The model takes context into account and can also correctly capture multi-word entities like „New York City“ or „United Nations“.

Entity types in Stanza

PERSON: first name or last name of a person
GPE: geopolitical entity such as country or city
ORG: organization such as company or agency
DATE: date
TIME: time
MONEY: monetary amount
LOC: location without political function
PRODUCT: object or product

8. Sentiment Analysis (Sentiment Analysis) - only for English

Sentiment analysis evaluates whether the content of a sentence is rather positive, neutral or negative. The model does not examine individual words, but the entire sentence in its context. This way, it can recognize, for example, that ironic or mitigating expressions can make an actually positive statement appear neutral or even negative.

Stanza currently offers sentiment analysis only for English texts. It is based on a neural model trained on the SSTplus corpus. The model is able to analyze any sentence structure and assign it to one of three categories.

Classification levels

0: negative
1: neutral
2: positive

Code Example

Output

PDF File Analysis

The problem with the above output is that the text from the PDF was passed directly and in its entirety to the Stanza pipeline without sufficient preprocessing. As a result, not only the actual content sentences but also all layout and formatting artifacts of the document, such as headers, page numbers, individual headings, footnotes or list items, are recognized and processed by Stanza as independent sentences. This leads to numerous semantically unrelated or even meaningless fragments appearing as individual sentences in the output. Even semantically coherent sentences are interrupted by hard line breaks, as frequently found in PDFs, and analyzed by the pipeline as separate units. The consequence is that the linguistic analysis does not reflect the actual sentence structure but is dominated by the layout and technical peculiarities of the PDF. For a meaningful and precise analysis, such artifacts must already be cleaned up before passing the text to the Stanza pipeline, words broken across lines must be correctly rejoined and the text must be segmented into complete, grammatically coherent sentences. Only in this way can Stanza produce a semantically informative linguistic analysis that is usable for downstream processing steps.

Extraction and Linguistic Preprocessing via DeepSeek & Stanza

A careful look at the output from step 7, named entity recognition (NER), may cause some confusion. Although entities like OPITZ CONSULTING and STACKIT are correctly recognized as organizations (ORG), expected assignments are missing, such as for the date “1. March 2024” or for product names like “STACKIT Cloud.”

The reason for this is not an error in the script, but the underlying language model that Stanza uses by default for German. As the log output at the start of the script reveals, it uses the germeval2014 package. This model was trained on the “GermEval 2014 Shared Task” and therefore knows mainly the four entity types that were the focus of this competition: PER (person), LOC (location), ORG (organization) and OTH (other).

Categories like DATE (date), MONEY (monetary amount) or PRODUCT (product) are simply not provided for in this specific model. It can therefore only recognize what it was trained for. This is an important insight when working with off-the-shelf AI models: their performance and their “knowledge” are always limited by the data and objectives of their original training.

Hybrid Pipeline: Improvement of Named Entity Recognition (NER)

The original script has been developed into a robust, modular processing pipeline. The main changes are:

Replacement of the NER component: The decisive step was to remove Stanza’s default named entity recognition (NER), which is based on the older germeval2014 model. Instead, a modern model based on the Hugging Face transformers library is now used. This new model (domischwimmbeck/bert-base-german-cased-fine-tuned-ner) not only recognizes more entity types but generally also offers higher accuracy.
Introduction of a hybrid pipeline: Instead of relying on a single library, the script now combines the strengths of two specialized tools.

New output of named entities

Under Hugging Face there are more NER models available:
https://huggingface.co/models?library=pytorch&language=de&sort=likes

Spacy instead of Stanza

In the new version of the script, the complete linguistic core analysis previously carried out by the Stanza library has been replaced by the spaCy library. spaCy is a very popular, speed-optimized library that is often used in production applications. While Stanza is known for its high academic accuracy, spaCy offers an excellent balance of performance and precision and provides analysis results that are often perceived as particularly intuitive for further processing in software projects. The core logic of the hybrid pipeline – the combination of a core analysis with a specialized NER model – remains identical.
The following analysis steps are now carried out not by Stanza, but by spaCy:

Tokenization: splitting sentences into individual words (tokens).
Part-of-speech tagging (POS tagging): assigning a grammatical category to each word (e.g., noun, verb, adjective).
Lemmatization: reducing each word to its base form (e.g., “ging” → “gehen”).
Dependency parsing: the analysis of the syntactic sentence structure, i.e., which words in the sentence grammatically depend on each other.

Introduction#

Steps of a Typical Stanza Pipeline#

1. Tokenization#

2. Multi-word token expansion (MWT)#

3. Part-of-speech Tagging (POS)#

4. Lemmatization#

5. Syntactic Dependency Parsing#

6. Constituency Parsing - only for English#

7. Named entity recognition (Named Entity Recognition)#

8. Sentiment Analysis (Sentiment Analysis) - only for English#

Code Example#

PDF File Analysis#

Extraction and Linguistic Preprocessing via DeepSeek & Stanza#

Hybrid Pipeline: Improvement of Named Entity Recognition (NER)#

Spacy instead of Stanza#