Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
290 views
in Technique[技术] by (71.8m points)

c# - HtmlAgilityPack WebGet.Load gives error "Object reference not set to an instance of an object"

I am on a project about getting new car prices from dealers websites. I can fetch most web sites html. But when I try to load one of them WebGet.Load(url) method gives Object reference not set to an instance of an object. error. I couldn't find any differences between these web sites.

Normal working url examples :

http://www.renault.com.tr/page.aspx?id=1715

http://www.hyundai.com.tr/tr/Content.aspx?id=fiyatlistesi

website problematic :

http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx

Thank you for your help.

var webGet = new HtmlWeb();  
var document = webGet.Load("http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx");

When I use this url document is not loaded.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The actual problem is in HtmlAgilityPack internals. The page not working has this meta content type: <META http-equiv="Content-Type" content="text/html; charset=8859-9"> where charset=8859-9 seems to be incorrent. The HAL internals tries to get an appropriate encoding for this string by using something like Encoding.GetEncoding("8859-9") and this throws an error (I think the actual encoding should be iso-8859-9).

Actually all you need is to tell the HAL not to read encoding for the HtmlDocument (just HtmlDocument.OptionReadEncoding = true), but this seems to be impossible with HtmlWeb.Load (setting HtmlWeb.AutoDetectEncoding isn't work here). So, the workaround could be in a manual reading of the url (the simplest way):

var document = new HtmlDocument();
document.OptionReadEncoding = false;

var url = 
   new Uri("http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx");
var request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
    using (var stream = response.GetResponseStream())
    {
        document.Load(stream, Encoding.GetEncoding("iso-8859-9"));
    }
}

This works, and successfully parses the page.

EDIT: @:Simon Mourier: yes, it raises NullReferenceException because it catches ArgumentException and sets _declaredencoding = null there. And then _declaredencoding.WindowsCodePage line throws the null reference.

here is a code block from the HtmlDocument.cs, ReadDocumentEncoding method:

try
{
    _declaredencoding = Encoding.GetEncoding(charset);
}
catch (ArgumentException)
{
    _declaredencoding = null;
}
if (_onlyDetectEncoding)
{
    throw new EncodingFoundException(_declaredencoding);
}

if (_streamencoding != null)
{
    if (_declaredencoding.WindowsCodePage != _streamencoding.WindowsCodePage)
    {
        AddError(
            HtmlParseErrorCode.CharsetMismatch,
            _line, _lineposition,
            _index, node.OuterHtml,
            "Encoding mismatch between StreamEncoding: " +
            _streamencoding.WebName + " and DeclaredEncoding: " +
            _declaredencoding.WebName);
    }
}

And here is my stack trace:

System.NullReferenceException was unhandled
  Message=Object reference not set to an instance of an object.
  Source=HtmlAgilityPack
  StackTrace:
       at HtmlAgilityPack.HtmlDocument.ReadDocumentEncoding(HtmlNode node) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlDocument.cs:line 1916
       at HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32 index, Boolean close) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlDocument.cs:line 1805
       at HtmlAgilityPack.HtmlDocument.Parse() in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlDocument.cs:line 1468
       at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlDocument.cs:line 769
       at HtmlAgilityPack.HtmlDocument.Load(Stream stream, Boolean detectEncodingFromByteOrderMarks) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlDocument.cs:line 597
       at HtmlAgilityPack.HtmlWeb.Get(Uri uri, String method, String path, HtmlDocument doc, IWebProxy proxy, ICredentials creds) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlWeb.cs:line 1515
       at HtmlAgilityPack.HtmlWeb.LoadUrl(Uri uri, String method, WebProxy proxy, NetworkCredential creds) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlWeb.cs:line 1563
       at HtmlAgilityPack.HtmlWeb.Load(String url, String method) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlWeb.cs:line 1152
       at HtmlAgilityPack.HtmlWeb.Load(String url) in C:SourcehtmlagilitypackTrunkHtmlAgilityPackHtmlWeb.cs:line 1107
       at test.console.Program.Main(String[] args) in W:ProjectsMeest.consoleest.consoleProgram.cs:line 54
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...