So there are a couple of problems here. Others have already commented about windows' IO caching as well as the actual hardware cache so I'm going to leave that alone.
The other issue is that your measuring the combined operations of read() + parse(), and comparing that to the speed of just read(). Essentially you need to conscious of the fact that A + B will always be more than A (assuming non-negative).
So to find out if you are IO bound you need to find out how long does it take to read the file. You've done that. On my machine your test is running at about 220ms for reading the file.
Now you need measure how long does it take to parse that many different strings. This is a little trickier to isolate. So let's just say we leave them together and subtract the time it takes to read from the parse time. Further we aren't trying to measure what your doing with the data, but just the parsing, so throw out the List and List and let's just parse. Running this on my machine gives it about 1000ms, less the 220ms for reading, your parse code takes about 780ms per 1 million rows.
So why is it so slow (3-4x slower than the read)? Again let's eliminate some stuff. Commenting out the int.Parse and the double.Parse and run again. That's a lot better 460ms less the read time of 220, we are now at 240ms to parse. Of course the 'parse' is only calling string.Split(). Hrmmm looks like string.Split will cost you as much as the disk IO, not surprisingly considering how .NET deals with strings.
So can C# parse as fast or faster than reading from the disk? Well yes, it can, but your going to have to get nasty. You see int.Parse and double.Parse suffer from the fact that they are culture aware. Due to this and the fact that these parse routines deal with many formats they are somewhat expensive at the magnitude of your example. I mean to say we are parsing a double and and int every microsecond (one-millionth of a second) which isn't bad normally.
So to match the speed of the disk-read (and thus be IO bound) we need to rewrite how you process a text line. Here is a nasty example, but it works for your example...
int len = line.Length;
fixed (char* ln = line)
{
double d;
long a = 0, b = 0;
int ix = 0;
while (ix < len && char.IsNumber(ln[ix]))
a = a * 10 + (ln[ix++] - '0');
if (ln[ix] == '.')
{
ix++;
long div = 1;
while (ix < len && char.IsNumber(ln[ix]))
{
b += b * 10 + (ln[ix++] - '0');
div *= 10;
}
d = a + ((double)b)/div;
}
while (ix < len && char.IsWhiteSpace(ln[ix]))
ix++;
int i = 0;
while (ix < len && char.IsNumber(ln[ix]))
i = i * 10 + (ln[ix++] - '0');
}
Running this crappy code produces a runtime of about 450ms, or roughly 2n of the read time. So, pretending for a moment that you thought the above code fragment was acceptable (which god I hope you don't), you could have one thread reading strings and another parsing and you would be close to being IO bound. Put two threads on parsing and you will be IO bound. Should you do this is another question all together.
So let's go back to your original question:
It is known that if you read data from disc you are IO bound and you can process/parse the read data much faster than you can read it from disc.
But this common wisdom (myth?)
Well no I would not call this a myth. In fact I would debate that your original code is still IO Bound. You happen to be running your test in isolation so the impact is small 1/6th the time spent reading from the device. But consider what would happen if that disk is busy? What if your antivirus scanner is churning through every file? Simply put your program would slow down with the increased HDD activity, and it could become IO Bound.
IMHO, the reason for this "common wisdom" is this:
It's easier to get IO bound on writes than on reads.
Writing to the device takes longer and is generally more expensive than producing the data. If you want to see IO Bound in action look at your "CreateTestData" method. Your CreateTestData method takes 2x as long to write the data to disk than just calling String.Format(...). And this is with full caching. Turn caching off (FileOptions.WriteThrough) and try it again... now CreateTestData is 3x-4x slower. Try it for yourself with the following methods:
static int CreateTestData(string fileName)
{
FileStream fstream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None, 4096, FileOptions.WriteThrough);
using (StreamWriter writer = new StreamWriter(fstream, Encoding.UTF8))
{
for (int i = 0; i < linecount; i++)
{
writer.WriteLine("{0} {1}", 1.1d + i, i);
}
}
return linecount;
}
static int PrintTestData(string fileName)
{
for (int i = 0; i < linecount; i++)
{
String.Format("{0} {1}", 1.1d + i, i);
}
return linecount;
}
This is just for starters, if you really want to get IO bound you start using direct IO. See the documentation on CreateFile using FILE_FLAG_NO_BUFFERING. Writing gets much slower as you start to bypass hardware caches and wait for IO completion. This is one major reason why a traditional database is very slow to write to. They must force the hardware to complete the write and wait upon it. Only then can they call a transaction 'committed', the data is in the file on the physical device.
UPDATED
Ok Alois, it appears your just looking for how fast can you go. To go any faster you need to stop dealing with strings and characters and remove the allocations to go faster. The following code improves upon the line/character parser above by about an order of magnitude (adding about 30ms over just counting lines) while allocating only a single buffer on the heap.
WARNING You need to realize I'm demonstrating that it can be done fast. I'm not advising you to go down this road. This code has some serious limitations and/or bugs. Like what happens when you hit a double in the form of "1.2589E+19"? Frankly I think you should stick with your original code and not worry about trying to optimize it this much. Either that or change the file format to binary instead of text (see BinaryWriter). If you are using binary, you can use a variation of the following code with BitConvert.ToDouble/ToInt32 and it would be even faster.
private static unsafe int ParseFast(string data)
{
int count = 0, valid = 0, pos, stop, temp;
byte[] buffer = new byte[ushort.MaxValue];
const byte Zero = (byte) '0';
const byte Nine = (byte) '9';
const byte Dot = (byte)'.';
const byte Space = (byte)' ';
const byte Tab = (byte) '';
const byte Line = (byte) '
';
fixed (byte *ptr = buffer)
using (Stream reader = File.OpenRead(data))
{
while (0 != (temp = reader.Read(buffer, valid, buffer.Length - valid)))
{
valid += temp;
pos = 0;
stop = Math.Min(buffer.Length - 1024, valid);
while (pos < stop)
{
double d;
long a = 0, b = 0;
while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
a = a*10 + (ptr[pos++] - Zero);
if (ptr[pos] == Dot)
{
pos++;
long div = 1;
while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
{
b += b*10 + (ptr[pos++] - Zero);
div *= 10;
}
d = a + ((double) b)/div;
}
else
d = a;
while (pos < valid && (ptr[pos] == Space || ptr[pos] == Tab))
pos++;
int i = 0;
while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
i = i*10 + (ptr[pos++] - Zero);
DoSomething(d, i);
while (pos < stop && ptr[pos] != Line)
pos++;
while (pos < stop && !(ptr[pos] >= Zero && ptr[pos] <= Nine))
pos++;
}
if (pos < valid)
Buffer.BlockCopy(buffer, pos, buffer, 0, valid - pos);
valid -= pos;
}
}
return count;
}