Yes, I assume the message over the ring bus has some kind of tag that flags it as being to I/O space, not a physical memory address, and that the system agent sorts this out.
If anyone knows more details, that might be interesting, but this simple mental model is probably fine.
I don't know how port I/O turns into PCIe messages, but I think PCIe devices can have I/O ports in I/O space, not just MMIO.
IN/OUT are pretty close to serializing (but not officially defined using that term for some reason How many memory barriers instructions does an x86 CPU have?). They do drain the store buffer before executing, and are full memory barriers.
the reservation station or the load/store buffer?
Both. For normal loads/stores, the front-end allocates a load buffer entry for a load, or a store buffer entry for a store, and issues the uop into the ROB and RS.
For example, when the RS dispatches a store-address or store-data uop to port 4 (store-data) or p2/p3 (load or store-address), that execution unit will use the store-buffer entry as the place where it writes the data, or where it writes the address.
Having the store-buffer entry allocated by the issue/allocate/rename logic means that either store-address or store-data can execute first, whichever one has its inputs ready first, and free its space in the RS after completing successfully. The ROB entry stays allocated until the store retires. The store buffer entry stays allocated until some time after that, when the store commits to L1d cache. (Or for a store to uncacheable memory, commits to an LFB or something to be send out the memory hierarchy where the system agent will pick it up if it's to a MMIO region.)
Obviously IN/OUT are micro-coded as multiple uops, and all those uops are allocated in the ROB and reservation station as they issue from the front-end, like any other uop. (Well, some of them might not need a back-end execution unit, in which case they'd only be allocated in the ROB in an already-executed state. e.g. the uops for lfence
are like this on Skylake.)
I'd assume they use the normal store buffer / load buffer mechanism for communicating off-core, but since they're more or less serializing there's no real performance implication to how they're implemented. (Later instructions can't start executing until after the "data phase" of the I/O transaction, and they drain the store buffer before executing.)