I am having an OOM error while trying to run a Federated Learning simulation with Tensorflow Federated.
I want to train a model for multiclass text classification, I follow the Federated Tensorflow tutorial and changed the model, and prepared my dataset to be a federated learning dataset.
The server I am running my simulations has 1 GPU available.
--I tried the allow memory growth, having a batch size in the power of two, and many other suggestions I found online but nothing worked for me until now.
--I am trying now to set the clients_per_thread
parameter (as it is suggested) but it seems that is not an expected keyword argument.
the code I use is:
tff.framework.sizing_executor_factory(clients_per_thread=5)
and the error I get is:
File "test.py", line 48, in <module>
tff.framework.sizing_executor_factory(clients_per_thread=5)
TypeError: sizing_executor_factory() got an unexpected keyword argument 'clients_per_thread'
I believe that the problem is with the model, maybe it's too big:
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 10)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 10, 300) 17334000 input_1[0][0]
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional) (None, 10, 256) 439296 embedding[0][0]
__________________________________________________________________________________________________
bi_lstm_1 (Bidirectional) [(None, 10, 256), (N 394240 bi_lstm_0[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 256) 0 bi_lstm_1[0][1]
bi_lstm_1[0][3]
__________________________________________________________________________________________________
attention (Attention) ((None, 256), (None, 5151 bi_lstm_1[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 20) 5140 attention[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 20) 0 dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 3) 63 dropout[0][0]
==================================================================================================
Total params: 18,177,890
Trainable params: 843,890
Non-trainable params: 17,334,000
Do you have any suggestions?
Currently, I don't get the OOM error only when I run federated learning with less than 12 clients.
Logs:
2021-01-24 16:09:00.752433: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit: 1368653824
InUse: 1368433664
MaxInUse: 1368653056
NumAllocs: 2319
MaxAllocSize: 134217728
2021-01-24 16:09:00.752485: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ***********************************************************x*****************xxx******xxxx******xxx*
2021-01-24 16:09:00.752505: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at training_ops.cc:564 : Resource exhausted: OOM when allocating tensor with shape[128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2021-01-24 16:09:00.752507: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 256.0KiB (rounded to 262144)
Current allocation summary follows.
2021-01-24 16:09:00.752536: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
2021-01-24 16:09:00.752545: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256): Total Chunks: 992, Chunks in use: 992. 248.0KiB allocated for chunks. 248.0KiB in use in bin. 25.9KiB client-requested in use in bin.
2021-01-24 16:09:00.752612: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752644: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1024): Total Chunks: 76, Chunks in use: 76. 77.8KiB allocated for chunks. 77.8KiB in use in bin. 59.8KiB client-requested in use in bin.
2021-01-24 16:09:00.752656: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2048): Total Chunks: 102, Chunks in use: 101. 207.8KiB allocated for chunks. 205.8KiB in use in bin. 202.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752666: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752677: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8192): Total Chunks: 77, Chunks in use: 75. 782.8KiB allocated for chunks. 764.8KiB in use in bin. 750.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752687: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16384): Total Chunks: 54, Chunks in use: 54. 1.10MiB allocated for chunks. 1.10MiB in use in bin. 1.03MiB client-requested in use in bin.
2021-01-24 16:09:00.752698: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (32768): Total Chunks: 1, Chunks in use: 1. 38.5KiB allocated for chunks. 38.5KiB in use in bin. 20.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752707: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752715: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (131072): Total Chunks: 1, Chunks in use: 0. 195.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752726: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (262144): Total Chunks: 86, Chunks in use: 86. 22.96MiB allocated for chunks. 22.96MiB in use in bin. 21.50MiB client-requested in use in bin.
2021-01-24 16:09:00.752736: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (524288): Total Chunks: 127, Chunks in use: 127. 71.56MiB allocated for chunks. 71.56MiB in use in bin. 68.48MiB client-requested in use in bin.
.....
ERROR:concurrent.futures:exception calling callback for <Future at 0x7fc4ca1943c8 state=finished raised InternalError>
Traceback (most recent call last):
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
callback(self)
File "/usr/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
dest_loop.call_soon_threadsafe(_set_state, destination, source)
File "/usr/lib/python3.6/asyncio/base_events.py", line 641, in call_soon_threadsafe
self._check_closed()
File "/usr/lib/python3.6/asyncio/base_events.py", line 381, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc58853ed68>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc588020348>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301138>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58853ec48>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301768>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc588470258>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc5883015e8>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36888>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301b28>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36a68>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc5883015b8>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36858>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588020708>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36d38>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588020e88>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36fa8>()]>
ERRO