If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).
Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)
As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).
By contrast, a hit in L1 cache has a load-use latency of 3 or 4 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.
Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the x86 tag wiki has lots of performance links for that architecture.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…