I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET.
The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response content. So, my strategy is to use BeginGetResponse/EndGetResponse to asynchronously get the response Stream, then use BeginRead/EndRead to asynchronously get bytes from the response Stream instance.
Everything seems perfect until the Crawler goes to stress test. Under stress test, the Crawler suffers from high memory usage. I checked the memory with WinDbg+SoS and fount out that lots of byte arrays are pined by System.Threading.OverlappedData instances. After some searching in internet, I found this KB http://support.microsoft.com/kb/947862 from microsoft.
According to the KB, the number of asynchronous I/O should have a "upper bound", but it doesn't tell a "suggested" bound value. So, in my eye, this KB helps nothing. This is obviously a .NET bug. Finally, I have to drop the idea to do asynchronous extracting bytes from response Stream, and just do it in synchronous way.
The .NET library that allows
Asynchronous IO with dot net sockets
(Socket.BeginSend /
Socket.BeginReceive /
NetworkStream.BeginRead /
NetworkStream.BeginWrite) must have an
upper bound on the amount of buffers
outstanding (either send or receive)
with their asynchronous IO.
The network application should have an
upper bound on the number of
outstanding asynchronous IO that it posts.
Edit: Add some question marks.
Anybody has any experience to do asynchronous I/O on Socket & NetworkStream?
Generally speaking, does crawler in production do I/O with internet with Synchronous or Asynchronosly?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…