Xgboost doesn't run multiple trees in parallel like you noted, you need predictions after each tree to update gradients.
Rather it does the parallelization WITHIN a single tree my using openMP to create branches independently.
To observe this,build a giant dataset and run with n_rounds=1. You will see all your cores firing on one tree. This is why it's so fast- well engineered.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…