This is an interesting problem.
First let's do some back of the envelope numbers. Any particular sequence of 20 digits, will match one time in 1020. If we go out to the n'th digit we have roughly n2/2 pairs of 20 digit sequences. So to have good odds of finding a match we're going to probably need to have n a bit above 1010. Assuming that we're taking 40 bytes per record, we're going to need something on the order of 400 GB of data. (We actually need more data than this, so we should be prepared for something over a terabyte of data.)
That gives us an idea of the needed data volume. Tens of billions of digits. Hundreds of GB of data.
Now here is the problem. If we use any data structure that requires random access, random access time is set by the disk speed. Suppose that your disk goes at 6000 rpm. That's 100 times per second. On average the data you want is halfway around the disk. So you get 200 random accesses per second on average. (This can vary by hardware.) Accessing it 10 billion times is going to take 50 million seconds, which is over a year. If you read, then write, and wind up needing 20 billion data points - you're exceeding the projected lifetime of your hard drive.
The alternative is to process a batch of data in a way where you do not access randomly. The classic is to do a good external sort such a merge-sort. Suppose that we have 1 terabyte of data, which we read 30 times, write 30 times, during sorting. (Both estimates are higher than needed, but I'm painting a worst case here.) Suppose our harddrive has a sustained throughput of 100 MB/s. Then each pass takes 10,000 seconds, for 600,000 seconds, which is slightly under a week. This is very doable! (In practice it should be faster than this.)
So here is the algorithm:
- Start with a long list of digits, 3141...
- Turn this into a much bigger file where each line is 20 digits, followed by the location where this appears in pi.
- Sort this bigger file.
- Search the sorted file for any duplications.
- If any are found, return the first.
- If none are found, repeat steps 1-3 with another big chunk of digits.
- Merge this into the previous sorted file.
- Repeat this search.
Now this is great, but what if we don't want to take a week? What if we want to throw multiple machines at it? This turns out to be surprisingly easy. There are well-known distributed sorting algorithms. If we split the initial file into chunks, we can parallelize both steps 1 and 4. And if after step 4 we don't find a match, then we can just repeat from the start with a bigger input chunk.
In fact this pattern is very common. All that really varies is turning the initial data into stuff to be sorted, and then looking at matching groups. This is the http://en.wikipedia.org/wiki/MapReduce algorithm. And this will work just fine for this problem.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…