optimization - CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

Question

Welcome To Ask or Share your Answers For Others

optimization - CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

optimization - CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

With respect specifically to CatBoost:

Under what scenarios might one want to use fewer than the max number of threads of one's CPU? I cannot find an answer to this.
Is there a fixed cost/overhead associated with each core utilized? I.e., is more always better for all data set types/sizes?

Do the answers to the questions above generalize to all machine learning algorithms?

question from:https://stackoverflow.com/questions/65932060/catboost-machine-learning-hyperparameters-why-not-always-use-thread-count-1

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:59:55+0000

I think that most of the reasons for changing the thread_count are not catboost specific. Other libraries like sklearn offer the same feature. Reasons for not running with all CPUs are:

Debugging: If there is a problem it might be handy to only have one thread thus making the process more simple.
You want other processes on your machine to have CPU power. Especially if you have a server for in-memory data analysis shared by a team of data scientists. Your colleagues won't be happy if you take all resources.
Your job is so small that it simply does not need all the resources.
Your parallelize in another way: For example you try different hyper parameters using cross validation. Then it would make sense to dedicate one CPU to training one model rather than training a model with with all CPUs and then move on to train the next model with all CPUs

I hope this answers question 1. This generalizes to other in-memory ml libraries like sklearn.

Regarding question 2 I'm not sure. CatBoost does the parallelisation somewhere in its C++ Code and uses it via Cython in the Python package. I assume it introduces some overhead (since distributed computing always introduces overhead) but it's probably not too much. You could find out by timing some experiments.

Categories

optimization - CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

optimization - CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags