The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction.
So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception.
(Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux ptrace
, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...)
Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself.
@Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception.
#UD
: the decoders saw a complete but invalid instruction.
- page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.
#GP
: the instruction was privileged or faulted for other reasons.
To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults.
Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. 9a13065b8000d7
is a 7-byte illegal instruction; that's when it stops page-faulting. (objdump -d
just says 0x9a (bad)
and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes).
HW performance counters like instructions_retired.any
also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an rdpmc
instruction. Padding with 0x90
NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding.
I'm wondering, why wouldn't Intel and AMD introduce an instruction for that
For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library.
It wouldn't make sense to put a microcoded disassembler behind some new opcode.
Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end.
The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions.
The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 vpsadbw
in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software.
Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself.
At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where pcmpistri
can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old pcmpeqb
/ pminub
works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead.
installing a trampoline (for hotpatching a binary function.)
Even this requires decoding the instructions, not just finding their length.
If the first few instruction bytes of a function used a RIP-relative addressing mode (or a jcc rel8/rel32
, or even a jmp
or call
), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)