It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).
For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.
Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.
Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…