I'm working on a project that involves some basic web crawling. I've been using HttpWebRequest and HttpWebResponse quite successfully. For cookie handling I just have one CookieContainer that I assign to HttpWebRequest.CookieContainer each time. I automatically gets populated with the new cookies each time and requires no additional handling from me. This has all been working fine until a little while ago when one of the web sites that used to work suddenly stopped working. I'm reasonably sure it's a problem with the cookies, but I didn't keep a record of the cookies from when it used to work so I'm not 100% sure.
I've managed to simulate the issue as I see it with the following code:
CookieContainer cookieJar = new CookieContainer();
Uri uri1 = new Uri("http://www.somedomain.com/some/path/page1.html");
CookieCollection cookies1 = new CookieCollection();
cookies1.Add(new Cookie("NoPathCookie", "Page1Value"));
cookies1.Add(new Cookie("CookieWithPath", "Page1Value", "/some/path/"));
Uri uri2 = new Uri("http://www.somedomain.com/some/path/page2.html");
CookieCollection cookies2 = new CookieCollection();
cookies2.Add(new Cookie("NoPathCookie", "Page2Value"));
cookies2.Add(new Cookie("CookieWithPath", "Page2Value", "/some/path/"));
Uri uri3 = new Uri("http://www.somedomain.com/some/path/page3.html");
// Add the cookies from page1.html
cookieJar.Add(uri1, cookies1);
// Add the cookies from page2.html
cookieJar.Add(uri2, cookies2);
// We should now have 3 cookies
Console.WriteLine(string.Format("CookieJar contains {0} cookies", cookieJar.Count));
Console.WriteLine(string.Format("Cookies to send to page1.html: {0}", cookieJar.GetCookieHeader(uri1)));
Console.WriteLine(string.Format("Cookies to send to page2.html: {0}", cookieJar.GetCookieHeader(uri2)));
Console.WriteLine(string.Format("Cookies to send to page3.html: {0}", cookieJar.GetCookieHeader(uri3)));
This simulates visiting two pages, both of which set two cookies. It then checks which of those cookies would be set to each of three pages.
Of the two cookies, one is set without specifying a path and the other has a path specified. When a path is not specified, I had assumed that the cookie would be sent back to any page in that domain, but it seems to only get sent back to that specific page. I'm now assuming that is correct as it is consistent.
The main problem for me is the handling of cookies with a path specified. Surely, if a path is specified then the cookie should be sent to any page contained within that path. So, in the code above, 'CookieWithPath' should be valid for any page within /some/path/, which includes page1.html, page2.html and page3.html. Certainly if you comment out the two 'NoPathCookie' instances, then the 'CookieWithPath' gets sent to all three pages as I would expect. However, with the inclusion of 'NoPathCookie' as above, then 'CookieWithPath' only gets sent to page2.html and page3.html, but not page1.html.
Why is this, and is it correct?
Searching for this issue I have come across discussion about a problem with domain handling in CookieContainer, but have not been able to find any discussion about path handling.
I'm using Visual Studio 2005 / .NET 2.0
See Question&Answers more detail:
os