I have run your code and get the same error. For a working solution, see below. Here is the explanation:
LazyCorpusLoader
is a proxy object that stands in for a corpus object before the corpus is loaded. (This prevents the NLTK from loading massive corpora into memory before you need them.) The first time this proxy object is accessed, however, it becomes the corpus you intend to load. That is to say, the LazyCorpusLoader
proxy object transforms its __dict__
and __class__
into the __dict__
and __class__
of the corpus you are loading.
If you compare your code to your errors above, you can see that you received 9 errors when you tried to create 10 instances of your class. The first transformation of the LazyCorpusLoader
proxy object into a WordNetCorpusReader
object was successful. This action was triggered when you accessed wordnet for the first time:
The First Thread
from nltk.corpus import wordnet as wn
def is_good_word(word):
...
wn.ensure_loaded() # `LazyCorpusLoader` conversion into `WordNetCorpusReader` starts
The Second Thread
When you begin to run your is_good_word
function in a second thread, however, your first thread has not completely transformed the LazyCorpusLoader
proxy object into a WordNetCorpusReader
. wn
is still a LazyCorpusLoader
proxy object, so it begins the __load
process again. Once it gets to the point where it tries to convert its __class__
and __dict__
into a WordNetCorpusReader
object, however, the first thread has already converted the LazyCorpusLoader
proxy object into a WordNetCorpusReader
. My guess is that you are running into an error in the line with my comment below:
class LazyCorpusLoader(object):
...
def __load(self):
...
corpus = self.__reader_cls(root, *self.__args, **self.__kwargs) # load corpus
...
# self.__args == self._LazyCorpusLoader__args
args, kwargs = self.__args, self.__kwargs # most likely the line throwing the error
Once the first thread has transformed the LazyCorpusLoader
proxy object into a WordNetCorpusReader
object, the mangled names will no longer work. The WordNetCorpusReader
object will not have LazyCorpusLoader
anywhere in its mangled names. (self.__args
is equivalent to self._LazyCorpusLoader__args while the object is a LazyCorpusLoader
object.) Thus you get the following error:
AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
An Alternative
In light of this issue, you will want to access the wn
object before you enter into your threading. Here is your code modified appropriately:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys
import time
import threading
cachedStopWords = stopwords.words("english")
def is_good_word(word):
word = word.strip()
if len(word) <= 2:
return 0
if word in cachedStopWords:
return 0
try:
if len(wn.lemmas(str(word), lang='en')) == 0: # no longer the first access of wn
return 0
except WordNetError as e:
print("WordNetError on concept {}".format(word))
except AttributeError as e:
print("Attribute error on concept {}: {}".format(word, e.message))
except:
print("Unexpected error on concept {}: {}".format(word, sys.exc_info()[0]))
else:
return 1
return 1
class ProcessMetaThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
is_good_word('dog')
def process_meta(numberOfThreads):
print wn.__class__ # <class 'nltk.corpus.util.LazyCorpusLoader'>
wn.ensure_loaded() # first access to wn transforms it
print wn.__class__ # <class 'nltk.corpus.reader.wordnet.WordNetCorpusReader'>
threadsList = []
for i in range(numberOfThreads):
start = time.clock()
t = ProcessMetaThread()
print time.clock() - start
t.setDaemon(True)
t.start()
threadsList.append(t)
numComplete = 0
while numComplete < numberOfThreads:
# Iterate over the active processes
for processNum in range(0, numberOfThreads):
# If a process actually exists
if threadsList != None:
# If the process is finished
if not threadsList[processNum] == None:
if not threadsList[processNum].is_alive():
numComplete += 1
threadsList[processNum] = None
time.sleep(5)
print('Processes Finished')
if __name__ == '__main__':
process_meta(10)
I have tested the above code and received no errors.