There are plenty of cache simulators out there, Dinero for e.g. (pun obviously intended) should be fairly simple and is often used for educational purposes.
Note that this simulator is trace-driven, it means it feeds on a list of memory access addresses, it doesn't know how to run a binary. You can produce such traces by emulating them with binary instrumentation tools, for e.g.
etc.. Note that some of these offer internal cache simulators already, and may be possible to play with.
Other simulators can simulate full CPU/system behavior, not just caches, and can therefore support running a binary. Most of them include within them a simulated cache system. For e.g.:
and many others
On the other hand, writing your own cache simulator is fairly simple - if you can work on a memory trace (writing an actual fronend is way more complicated). You won't be able to get a too detailed spec on actual caches in Intel/AMD products, but the basic functionality is detailed in any computer architecture textbook or even wikipedia, the parameters (size, associativity, coherency policies) are mostly documented in the published guides, and may often change between product generations. You can always ask here if you encounter any specific question :)
Edit:
Regarding the second part of the question - there's no publicly available documentation of the exact cache implementation of Intel CPUs, but the dry "specs" (size, associativity, policies) are in the optimization guide:
Now, modeling these caches should be straightforward, but there may be some hidden caveats, like powerdown features or specialized LRU behaviors. One such reported example can be found here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ (if this is true, it might be worth implementing for accuracy), but aside from that I believe the overall behavior shouldn't be affected by these details too much, for any practical use.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…