An analogy might help.
You have a bunch of letters you need delivered to various addresses around town. So you hire a guy with a motorcycle to deliver your letters.
The traffic signals in your town are perfect traffic signals. They are always green unless there is someone in the intersection.
The guy on the motorcycle zips around delivering a bunch of letters. Since there is no one else on the road, every light is green, which is awesome. But you think hey, this could be faster. I know, I'll hire another driver.
Trouble is **you only have one motorcycle still*. So now your first driver drives around on the motorcycle for a while, and then every now and then stops, gets off, and the second driver runs up, hops on, and drives around.
Is this any faster? No, of course not. That's slower. Adding more threads doesn't make anything faster. Threads are not magic. If a processor is able to do a billion operations a second, adding another thread doesn't suddenly make another billion operations a second available. Rather, it steals resources from other threads. If a motorcycle can go 100 miles per hour, stopping the bike and having another driver get on doesn't make it faster! Clearly on average the letters are not being delivered any faster in this scheme, they're just being delivered in a different order.
OK, so what if you hire two drivers and two motorcycles? Now you have two processors and one thread per processor, so that'll be faster, right? No, because we forgot about the traffic lights. Before, there was only one motorcycle driving at speed at any one time. Now there are two drivers and two motorcycles, which means that now sometimes one of the motorcycles will have to wait because the other one is in the intersection. Again, adding more threads slows you down because you spend more time contending locks. The more processors you add, the worse it gets; you end up with more and more time spent waiting at red lights and less and less time driving messages around.
Adding more threads can cause negative scalability if doing so causes locks to be contended. The more threads, the more contention, the slower things go.
Suppose you make the engines faster -- now you have more processors, more threads, and faster processors. Does that always make it faster? NO. It frequently does not. Increasing processor speed can make multithreaded programs go slower. Again, think of traffic.
Suppose you have a city with thousands of drivers and sixty-four motorcycles, the drivers all running back and forth between the motorcycles, some of the motorcycles in intersections blocking other motorcycles. Now you make all those motorcycles run faster. Does that help? Well, in real life, when you're driving around, do you get where you're going twice as fast in a Porsche as in a Honda Civic? Of course not; most of the time in city driving you are stuck in traffic.
If you can drive faster, often you end up waiting in traffic longer because you end up driving into the congestion faster. If everyone drives towards congestion faster then the congestion gets worse.
Multithreaded performance can be deeply counterintuitive. If you want extreme high performance I recommend not going with a multithreaded solution unless you have an application which is "embarrassingly parallel" -- that is, some application that is obviously amenable to throwing multiple processors, like computing Mandelbrot sets or doing ray tracing or some such thing. And then, do not throw more threads at the problem than you have processors. But for many applications, starting more threads slows you down.