2020
Information Retrieval (IR) Stemmer & Engine
Using the Porter Stemmer, Snowball Stemmer, and Krovetz Lemmatization, stemmed the results of querying a corpus of New York Times opinion articles, horror books, and astrology books, and incorporated it into an IR engine we developed
Overview
In information retrieval systems, two of the biggest issues faced are storage and accuracy. Stemmers attempt to solve both of these issues. By stemming queries and documents in IR systems, IR systems can rank and retrieve documents more accurately. For example, if someone is searching for “Cars for Sale”, documents containing “cars” and the singular form “car” should be retrieved.
My peers and I aim to answer the question of whether light or heavy stemmers are better. We answered this question by testing the performance of three popular stemmers in query based information retrieval systems. The three stemmers we test are Porter, Snowball, and Krovetz. We were interested in how document ranking differs pre and post stemming. Does stemming indeed help with retrieval accuracy? How does the aggressiveness of the stemmer help/hinder retrieval accuracy?
Background
In this section, we will describe the three different stemming algorithms that we chose – Porter, Snowball, and Krovetz.
Porter Stemmer
This Porter Stemmer, named after its creator Martin Porter in the 1980s, is a rule-based stemming algorithm that removes common suffixes from English words in several stages. Each stage applies rules that strip endings like -ing or -ed when certain conditions are met. Thegoal is to reduce related words to a shared root while avoiding overly aggressive reductions.
Function porter_stemmer(input_file, output_file) {
Open input_file
For each line input file:
Strip punctuation and normalize to lower case
porter(normalized line)
Write line to output_file
}
Snowball Stemmer
The Snowball Stemmer is an updated version of the Porter Stemmer. It follows a similar rule-based progress but includes improvements that make the stemming more consistent and easier to adapt for multiple languages.
Function snowball_stemmer(input_file, output_file) {
Open input_file
For each line input file:
Strip punctuation and normalize to lower case
snowball(normalized line)
Write line to output_file
}
Krovetz Lemmatizer
The Krovetz stemming algorithm takes a different approach from traditional stemmers. Instead of aggressively removing suffixes, it checks words against a dictionary and converts them to their correct base form when possible. This often produces real words (lemmas) rather than artificial stems.
Function krovetz_stemmer(input_file,output_file) {
Open input_file
For each line in input file:
String punctuation and normalize to lower case
Krovetz (normalized line)
Write line to output_file
}
Comparisons
| Word | Porter | Snowball | Krovetz |
|---|---|---|---|
| running | run | run | run |
| studies | studi | studi | study |
| connections | connect | connect | connection |
These approaches were compared to observe how different normalization strategies affect the resulting word forms and the aggressiveness of text preprocessing.
Corpa Used
We selected four sets of documents for the stemming aggressiveness test. By seeing how the stemmers stem these sets of documents, we can see how the stemmers perform in different cases.
The Unix English Language Dictionary
This had 30,000 of the most commonly used English words. This is the most optimal use case for the stemmers since they are stemming the English Language in a cleaner manner as compared to stemming our datasets which are very varied. By stemming this dictionary, we can test the actual aggressiveness of each stemmer since it does not have any unexpected/unique words that the stemmers won’t expect, and thereby get consistent data.
New York Times Opinon Articles
Since articles (like those published by the New York Times) contain a mix of commonly used words in the English language and proper nouns, we used 83 articles published from September 2019 through February 2020. While it may not be as optimal as stemming the English language, it represents a common use case for stemmers. For example, if someone searches for "Simone Biles", they would not only want documents with the exact string "Simone Biles", but they would also want documents with the possessive token "Simone Biles's" and "Biles's".
Horror Books from the 1900s
Since horror books from the 1900s use older English, they also have more unique tokens. We expect this to be a difficult use case for stemmers since it uses words no longer used (or seldom used) today.
Astrology Books
The astrology corpus is made up of books of various branches of astrology and consists of many unique tokens, out of which many are not recognizable words in the English dictionary. Consequently, we expect this to also be a difficult use case for the stemmers.
The Corpa at a Glance
| Corpus | Description of Scope | Total # of Documents | Total # of Words | # of Unique Tokens |
|---|---|---|---|---|
| Unix English Dictionary | Built-In English Dictionary (in most modern Linux / Unix systems = /usr/share/dict/words) | 1 | 30,000 | 29,416 |
| New York Times Opinon Articles | Articles published from September 2019 through February 2020 | 83 | 100,315 | 13,287 |
| Horror Books | Books written by authors in the 1900s | 94 | 5,267,539 | 92,894 |
| Astrology Books | Books written about multiple forms of astrology (Hindu, etc) | 92 | 172,870 | 17,787 |
Stemmer Implementations
To use the three stemming algorithms discussed under section 2, we imported the Porter and Snowball Stemmers from Python's nltk package and the Krovetz Lemmatizer from Python's krovetz package. We imported the Krovetz Lemmatizer from the krovetz package found in the Krovetz Git repository.
For the Porter stemmer, upon importing PorterStemmer, we passed in the input file into our porter stemmer python script and the script outputted a stemmed file. We followed the same procedure upon importing SnowballStemmer and krovetz.
Analysis
Measuring Aggressiveness
Once we stemmed each corpus, measured the number of unique tokens before stemming and after stemming using the following algorithm –
Function getStats (file) {
Open file
Instantiate hashset of tokens
For each line in file:
For each word in line
If (hashset does not contain word)
Add word to hashset
Return size of hashset
}
We defined the aggressiveness of a given stemmer (s) on a given data set (d) to be Asd as follows –
Asd = (Ud - Usd) / Ud
This equation finds the percentage of text compression that the stemmer did. To find the overall aggressiveness of the stemmers, we took the average over all of the datasets.
Stemming Aggressiveness
These are the results after stemming the corpora with each respective stemmer. The percentages represent how much the number of unique tokens decreased after stemming, i.e. the aggressiveness vector A.
| Corpus | # of Unique Tokens | APorter | ASnowball | AKrovetz |
|---|---|---|---|---|
| Astrology Books | 17,787 | 27.24 % | 28.00 % | 27.95 % |
| English Word | 29,416 | 40.86 % | 42.41 % | 33.63 % |
| Horror Books | 92,894 | 23.70 % | 24.46 % | 20.19 % |
| NYT Articles | 13,287 | 29.46 % | 30.24 % | 21.50 % |
| Average | –––––– | 30.32 % | 31.28 % | 25.82 % |
Discussion
The most aggressive stemmer was the Snowball stemmer and the lightest stemmer was the Krovetz Lemmatization. Since the Snowball stemmer is built on top of Porter, this stemmer was the most aggressive. Since the Krovetz Lemmatization takes a dictionary-based approach, it only stems until it finds a root in the dictionary. This resulted in a lighter stemmer.
As mentioned before, the horror book data set provided a unique test case – a corpora with old English and many unique tokens. The text compression was lower than the other sets because of its unique vocabulary. The snowball stemmer was the most aggressive out of the three stemmers but not by much (only 1 – 3 % more). The New York Times opinion articles provided a semi optimal use case because it was made up of common English words but still had many unique tokens such as proper nouns and slang. As mentioned, this is the average use case since many documents online have the same landscape of vocabulary. The most aggressive stemmer was once again the Snowball stemmer. Interestingly, it was 8.5 more aggressive than Krovetz. The most difficult use case for the stemmers was the astrology books corpus. The stemmers performed the same in this case. This is because this corpus’s vocabulary had many unique tokens and often non-English tokens.
The most interesting case is how the stemmers stemmed the English dictionary. This case lets us see how much of the English language can be condensed into stems. The stemmers decreased the number of tokens by roughly 40%, which is almost half!
Once we averaged the aggressiveness values for each dataset, we see that the Snowball stemmer is the most aggressive stemmer and the Krovetz is the lightest stemmer. This is in alignment with our hypothesis. Now that we have assigned our “light” and “aggressive” stemmer, we can infer conclusions when comparing how these stemmers perform in the next two tests.
Stemming Accuracy
In this test, we wanted to go under the hood and see how each stemmer is stemming a group of words. These are some interesting cases we found –
| Original Word | Porter | Snowball | Krovetz | Preferred Stemmer |
|---|---|---|---|---|
| acknowledge acknowledged acknowledgement acknowledges acknowledging acknowledgment acknowledgments | acknowledg | acknowledg | acknowledge acknowledgement | Porter / Snowball |
| universal universally universe university | univers | univers | universal universally universe university | Krovetz |
| accord accordance accorded according accordingly | accord accordingli | accord | accord accordance accord according accordingly | Snowball |
| abudantly | abundantli | abund | abudantly | Krovetz |
Discussion
- Acknowledg* – According to our understanding, we prefer Porter/Snowball over Krovetz because of more text compression in Porter and Snowball (i.e, fewer number of stemmed words for a group of very similar words), while keeping the accuracy of the stem intact.
- Univers* – Even though all these words mean very different things, Porter and Snowball stem all these words to the same stem (i.e., "univers"), thereby compromising on accuracy. Meanwhile, Krovetz keeps some words intact and stems the rest retaining the accuracy of the stemmed results, even though we get lesser text compression. Hence, Krovetz is preferable for more accurate results.
- Accord* – For this example, Snowball deals with the edge case of "-li" ending and stems stem the word to its actual stem. Meanwhile, Krovetz keeps the word in its original form, compromising on text compression. So, Snowball is preferable.
- Abund* – Snowball again deals with the "-li" edge case better than Krovetz, but here retaining the actual word itself may improve accuracy, and hence Krovetz seems to be better than the rest.
Overall Analysis – Which stemmer is preferable also depends on our data set and the kind of words that we’re dealing with. Different stemmers could work in different ways on different sets of data, in terms of text compression and stemming accuracy. Here, Snowball seems to be the best stemmer in terms of text compression and stemming accuracy.
Query Based IR System Results
For the queries, we focused on looking at the opinion articles in the New York Times. The query results (below) are formatted as follows –
query
results (cosine similarity)
* Note that all of the queries with an asterisk next to it are queries that had more results than what was listed below, and results for all queries are not shown
Query 1
Pre-Stemming
harvard university *
Trump Wants Law and Order Front and Center (0.9898824527566188)
Four Key Things You Should Know About Health Care (0.9898824527566187)
Of All Trump’s Defenses, This Is the Lamest (0.9898824527566186)
Did I Just Get Yanged? (0.910232324498422)
How Much of Harvard’s $40 Billion Endowment Is Invested in Fossil Fuels? (0.8894358020106071)
Why Trump Persists (0.8437736313909587)
Porter
harvard univers *
did I just get yanged? (0.9988822529756158) → Ranking and score increases
trump want law and order front and center (0.9954059143710997)
four key thing you should know about health care (0.9712739692103212)
Of all trump’ defenses, thi Is the lamest (0.971273969210321)
whi trump persist (0.869214547476548)
how much of harvard’ $40 billion endow Is in lo vest in fossil fuels? (0.862236354767143)
Snowball
harvard univers *
did I just get yanged? (0.9988822529756158) → Ranking and score increases
trump want law and order front and center (0.9954059143710997)
four key thing you should know about health care (0.9712739692103212)
Of all trump’ defenses, thi Is the lamest (0.971273969210321)
whi trump persist (0.869214547476548)
how much of harvard’ $40 billion endow Is invest in fossil fuels? (0.862236354767143)
Krovetz
harvard university *
trump wants law and order front and center (0.9843960045703491)
four key things you should know about health care (0.984396004570349)
of all trump’s defenses, this is the lamest (0.984396004570349)
did i just get yanged? (0.9003632756592947)
We chose the query harvard university because university, universe, and universes all stem to univers in the Porter and the Snowball stemmers so that we can look at how the results would compare. Our interesting findings are noted in the observations.
Observations
- Cosine similarity score for the documents retrieved after stemming with Porter and Snowball is greater than the score pre-stemming
- Ranking of the documents changes after stemming for all the stemmers as compared to original
- Porter and Snowball function identically in our IR system
Query 2
Pre-Stemming
accord accordance accorded according accordingly *
Democrats’ Baffling 2020 Mess (0.6150016733379723)
Why Donald Trump Hates Your Dog (0.6150016733379723)
We Can’t Afford Trump as Our Commander in Chief (0.447213595499958)
Why Trump Persists (0.4472135954999579)
Coronavirus Spreads, and the World Pays for China’s Dictatorship (0.4472135954999579)
Porter
accord accord accord accord accordingli *
whi trump persist (0.9922778767136677) → Ranking and score increases
the white hous motto: watch your back (0.9922778767136677)
bewar the pandem panic (0.9922778767136677)
whi donald trump hate your dog (0.9922778767136677)
coronaviru spreads, and the world pay for china’ dictatorship (0.9922778767136677)
Snowball
accord accord accord accord accordingli *
whi trump persist (0.9922778767136677) → Ranking and score increases
the white hous motto: watch your back (0.9922778767136677)
bewar the pandem panic (0.9922778767136677)
whi donald trump hate your dog (0.9922778767136677)
coronaviru spreads, and the world pay for china’ dictatorship (0.9922778767136677)
Krovetz
accord accordance accord according accordingly *
democrats’ baffling 2020 mess (0.4146340435791247)
why donald trump hate your dog (0.4146340435791247)
the white house motto: watch your back (0.3015113445777637)
we can’t afford trump as our commander in chief (0.3015113445777637)
why trump persist (0.3015113445777637)
Because this group of word forms has been stemmed slightly differently by the Krovetz stemmer, we wanted to take a look at how this would compare if we looked at the word group. Our interesting findings are noted in the observations.
Observations
- Cosine similarity scores for the documents retrieved after stemming with Porter and Snowball are much greater than the score pre-stemming and the score post stemming with Krovetz. Thus, Porter and Snowball stemmers are increasing the accuracy of IR systems.
- Again, ranking of the documents changes after stemming for all the stemmers as compared to original
- Porter and Snowball function identically in our IR system
Query 3
Pre-Stemming
according to universe *
Trump’s Digital Advantage Is Freaking Out Democratic Strategists (0.8011326633004587)
The Next Decade Will Be Just as Bad (0.7310310741702064)
Warren and Klobuchar Teach the Boys a Lesson (0.5773502691896258)
What Does It Mean to Have a Serious Drinking Problem? (0.5773502691896258)
Did I Just Get Yanged? (0.5773502691896258)
Porter
accord to univers *
the sidney award (0.8095169351771415)
starr chamber: the sequel (0.8095169351771414)
coronaviru spreads, and the world pay for china’ dictatorship (0.8095169351771414)
pete ain’t It (0.8095169351771414)
the next decad will Be just as bad (0.8095169351771414)
Snowball
accord to univers *
the sidney award (0.8095169351771415)
starr chamber: the sequel (0.8095169351771414)
coronaviru spreads, and the world pay for china’ dictatorship (0.8095169351771414)
pete ain’t It (0.8095169351771414)
the next decad will Be just as bad (0.8095169351771414)
Krovetz
according to universe *
trump’s digital advantage is freak out democratic strategist (0.8011326633004587)
the next decade will be just as bad (0.7310310741702064)
china didn’t want us to know. now its own file are do the talking. (0.5773502691896258)
china’s new civil religion (0.5773502691896258)
warren and klobuchar teach the boys a lesson (0.5773502691896258)
Since the word group for according and the word group for universe are stemmed differently by the Krovetz stemmer, we wanted to look at the original “according to universe” query and how each stemmer would influence which results came up and/or what the cosine similarity would be. Our interesting findings were noted in the observations.
Observations
- Cosine similarity scores for the documents retrieved after stemming with Porter and Snowball are much greater than the score pre-stemming and the score post stemming with Krovetz. Thus, Porter and Snowball stemmers are increasing the accuracy of IR systems.
- Porter and Snowball function identically in our IR system.
Query 4
Pre-Stemming
acknowledge their validity *
Devin Nunes Is Danielle Steel (0.6482731536635694)
The Future of American Politics (0.6071375154698476)
Mayor Pete’s Gay Reckoning (0.5924625643048007)
The American Health Care Industry Is Killing People (0.5773502691896258)
Did I Just Get Yanged? (0.5773502691896258)
Porter
acknowledg their valid *
the flaw human of silicon valley (0.6904083374800494)
‘bojack horseman’ end with a necessari reckon (0.6742218541710693)
the white hous motto: watch your back (0.6385205984959307)
the shaki futur of .org domain (0.6385205984959306)
devin nune Is daniel steel (0.6289295076615469)
Snowball
acknowledg their valid *
the flaw human of silicon valley (0.6904083374800494)
‘bojack horseman’ end with a necessari reckon (0.6742218541710693)
the white hous motto: watch your back (0.6385205984959307)
the shaki futur of .org domain (0.6385205984959306)
devin nune Is daniel steel (0.6289295076615469)
Krovetz
acknowledge their valid *
the flaw humanity of silicon valley (0.685288250490692)
the white house motto: watch your back (0.6354414336083095)
the shaky future of .org domain (0.6354414336083095)
devin nun is danielle steel (0.6262963130238783)
the future of america politics (0.6169264603742076)
For similar reasons as the third set of queries, we wanted to look at the query “acknowledge their validity” in order to see how the different stemmed versions of the query lead to a different set of results and/or different cosine similarity measures. Our interesting findings are noted in the observations.
Observations
- Cosine similarity scores for the documents retrieved after stemming with Porter and Snowball are much greater than the score pre-stemming and the score post stemming with Krovetz. Thus, Porter and Snowball stemmers are increasing the accuracy of IR systems
- Porter and Snowball function identically in our IR system
Conclusions
According to our results in section 6, our hypothesis stands true and snowball seems to be the best stemmer out of Porter, Snowball and Krovetz.
- In terms of stemming aggressiveness, Snowball does a decent job at text compression while also retaining the meaning of the words and not producing non-recognizable stems. Meanwhile, Porter is not as effective as Snowball at text compression, and Krovetz is a much lighter stemmer than both Porter and Snowball.
- In terms of stemming accuracy (as explained in section 6.2), Snowball again proves to give us the most accurate and clean stems, that also deals with many of the edge cases while Porter does not.
- For query based IR systems, Porter and Snowball function in an almost identical manner for our datasets. This is not surprising as Snowball is an extension of Porter. Yet, document retrieval after stemming with Porter and Snowball returns documents with higher similarity scores than documents retrieved pre-stemming.
Even though Porter and Snowball function in very similar ways and are aggressive stemmers in nature, they are great at text compression and at improving IR system accuracy. For the aforementioned reasons, in our opinion, Snowball is still better than Porter and the most preferable stemmer out of all three in terms of text compression, stemming accuracy and document retrieval.
References
https://academicpublishingplatforms.com/downloads/pdfs/gnujet/volume1/201107241724_10-46-1-PB.pdf
https://github.com/rmit-ir/KrovetzStemmer
https://tartarus.org/martin/PorterStemmer
https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8