Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
247 views
in Technique[技术] by (71.8m points)

Out of Memory (OOM) error in Tensorflow Federated learning simulation when I set federated clients more than 12

I am having an OOM error while trying to run a Federated Learning simulation with Tensorflow Federated. I want to train a model for multiclass text classification, I follow the Federated Tensorflow tutorial and changed the model, and prepared my dataset to be a federated learning dataset. The server I am running my simulations has 1 GPU available.

--I tried the allow memory growth, having a batch size in the power of two, and many other suggestions I found online but nothing worked for me until now.

--I am trying now to set the clients_per_thread parameter (as it is suggested) but it seems that is not an expected keyword argument. the code I use is:

tff.framework.sizing_executor_factory(clients_per_thread=5)

and the error I get is:

File "test.py", line 48, in <module>
    tff.framework.sizing_executor_factory(clients_per_thread=5)
TypeError: sizing_executor_factory() got an unexpected keyword argument 'clients_per_thread'

I believe that the problem is with the model, maybe it's too big:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, 10)]         0
__________________________________________________________________________________________________
embedding (Embedding)           (None, 10, 300)      17334000    input_1[0][0]
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional)       (None, 10, 256)      439296      embedding[0][0]
__________________________________________________________________________________________________
bi_lstm_1 (Bidirectional)       [(None, 10, 256), (N 394240      bi_lstm_0[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 256)          0           bi_lstm_1[0][1]
                                                                 bi_lstm_1[0][3]
__________________________________________________________________________________________________
attention (Attention)           ((None, 256), (None, 5151        bi_lstm_1[0][0]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 20)           5140        attention[0][0]
__________________________________________________________________________________________________
dropout (Dropout)               (None, 20)           0           dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 3)            63          dropout[0][0]
==================================================================================================
Total params: 18,177,890
Trainable params: 843,890
Non-trainable params: 17,334,000

Do you have any suggestions?

Currently, I don't get the OOM error only when I run federated learning with less than 12 clients.

Logs:

2021-01-24 16:09:00.752433: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit:                  1368653824
InUse:                  1368433664
MaxInUse:               1368653056
NumAllocs:                    2319
MaxAllocSize:            134217728

2021-01-24 16:09:00.752485: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ***********************************************************x*****************xxx******xxxx******xxx*
2021-01-24 16:09:00.752505: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at training_ops.cc:564 : Resource exhausted: OOM when allocating tensor with shape[128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2021-01-24 16:09:00.752507: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 256.0KiB (rounded to 262144)
Current allocation summary follows.
2021-01-24 16:09:00.752536: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
2021-01-24 16:09:00.752545: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256):   Total Chunks: 992, Chunks in use: 992. 248.0KiB allocated for chunks. 248.0KiB in use in bin. 25.9KiB client-requested in use in bin.
2021-01-24 16:09:00.752612: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752644: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1024):  Total Chunks: 76, Chunks in use: 76. 77.8KiB allocated for chunks. 77.8KiB in use in bin. 59.8KiB client-requested in use in bin.
2021-01-24 16:09:00.752656: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2048):  Total Chunks: 102, Chunks in use: 101. 207.8KiB allocated for chunks. 205.8KiB in use in bin. 202.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752666: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752677: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8192):  Total Chunks: 77, Chunks in use: 75. 782.8KiB allocated for chunks. 764.8KiB in use in bin. 750.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752687: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16384):         Total Chunks: 54, Chunks in use: 54. 1.10MiB allocated for chunks. 1.10MiB in use in bin. 1.03MiB client-requested in use in bin.
2021-01-24 16:09:00.752698: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (32768):         Total Chunks: 1, Chunks in use: 1. 38.5KiB allocated for chunks. 38.5KiB in use in bin. 20.0KiB client-requested in use in bin.
2021-01-24 16:09:00.752707: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (65536):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752715: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (131072):        Total Chunks: 1, Chunks in use: 0. 195.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-24 16:09:00.752726: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (262144):        Total Chunks: 86, Chunks in use: 86. 22.96MiB allocated for chunks. 22.96MiB in use in bin. 21.50MiB client-requested in use in bin.
2021-01-24 16:09:00.752736: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (524288):        Total Chunks: 127, Chunks in use: 127. 71.56MiB allocated for chunks. 71.56MiB in use in bin. 68.48MiB client-requested in use in bin.
.....


ERROR:concurrent.futures:exception calling callback for <Future at 0x7fc4ca1943c8 state=finished raised InternalError>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 641, in call_soon_threadsafe
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 381, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc58853ed68>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc588020348>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301138>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58853ec48>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301768>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc588470258>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc5883015e8>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36888>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588301b28>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36a68>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc5883015b8>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36858>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588020708>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36d38>()]>
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending coro=<trace.<locals>.async_trace() running at /home/.../lib/python3.6/site-packages/tensorflow_federated/python/common_libs/tracing.py:200> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.6/asyncio/futures.py:403, <TaskWakeupMethWrapper object at 0x7fc588020e88>()]> cb=[<TaskWakeupMethWrapper object at 0x7fc58be36fa8>()]>
ERRO

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...