Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
Class
It starts off with initializing 3 variables:
class Lemmatizer(object):
@classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
Now, looking at the self.exc
for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer
Why don't Spacy just read a file?
Most probably because declaring the string in-code is faster that streaming strings through I/O.
Where does these index, exceptions and rules come from?
Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html
Rules
Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy
rules from nltk
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749
And these rules originally comes from the Morphy
software https://wordnet.princeton.edu/man/morphy.7WN.html
Additionally, spacy
had included some punctuation rules that isn't from Princeton Morphy:
PUNCT_RULES = [
["“", """],
["”", """],
["u2018", "'"],
["u2019", "'"]
]
Exceptions
As for the exceptions, they were stored in the *_irreg.py
files in spacy
, and they look like they also come from the Princeton Wordnet.
It is evident if we look at some mirror of the original WordNet .exc
(exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet
package from nltk
, we see that it's the same list:
alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
Index
If we look at the spacy
lemmatizer's index
, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk
:
alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
On the basis that the dictionary, exceptions and rules that spacy
lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy
applies the rules using the index and exceptions.
We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
The main action comes from the function rather than the Lemmatizer
class:
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
# forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
Why is the lemmatize
method outside of the Lemmatizer
class?
That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod
and @classmethod
exist perhaps there are other considerations as to why the function and class has been decoupled
Morphy vs Spacy
Comparing spacy
lemmatize() function against the morphy()
function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy()
, the main processes in Oliver Steele's Python port of the WordNet morphy are:
- Check the exception lists
- Apply rules once to the input to get y1, y2, y3, etc.
- Return all that are in the database (and check the original too)
- If there are no matches, keep applying rules until we find a match
- Return an empty list if we can't find anything
For spacy
, possibly, it's still under development, given the TODO
at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76
But the general process seems to be:
- Look for the exceptions, get them if the lemma from the exception list if the word is in it.
- Apply the rules
- Save the ones that are in the index lists
- If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
- Return the lemma forms
In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk
implementation of morphy
does the same,e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
Checking for infinitive before lemmatization
Possibly another point of difference is how morphy
and <