c# - Cannot extract <link> element using HtmlAgilityPack and XPath -
i using html agility pack select out textual data within rss xml. every other node type (title, pubdate, guid .etc) can select out inner-text using xpath conventions when querying "//link" or indeed "item/link" empty strings returned.
public static ienumerable<string> extractalllinks(string rsssource) { //create new document. var document = new htmldocument(); //populate document rss file. document.loadhtml(rsssource); //select out of required nodes. var itemnodes = document.documentnode.selectnodes("item/link"); //if 0 nodes found, return empty list, otherwise return content of nodes. return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); } does have understanding of why element behaves differently others?
additional: running "item/link" returns 0 nodes. running "//link" returns correct number of nodes inner text 0 chars in length.
using below test data, "//name" returns single record "fred" "//link" single record empty string returned.
<site><link>hello world</link><name>fred</name></site> i because of world "link". if change "linkz" works perfectly.
the below workaround works perfectly. understand why searching on "//link" not work other elements do.
public static ienumerable<string> extractalllinks(string rsssource) { rsssource = rsssource.replace("<link>", "<link-renamed>"); rsssource = rsssource.replace("</link>", "</link-renamed>"); //create new document. var document = new htmldocument(); //populate document rss file. document.loadhtml(rsssource); //select out of required nodes. var itemnodes = document.documentnode.selectnodes("//link-renamed"); //if 0 nodes found, return empty list, otherwise return content of nodes. return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); }
if print documentnode.outerhtml, see problem :
var html = @"<site><link>hello world</link><name>fred</name></site>"; var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml); output :
<site><link>hello world<name>fred</name></site> link happen 1 of special tags* treated self-closing tag hap. can alter behavior setting elementsflags before parsing html, example :
var html = @"<site><link>hello world</link><name>fred</name></site>"; htmlnode.elementsflags.remove("link"); //remove link list of special tags var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml); var links = doc.documentnode.selectnodes("//link"); foreach (htmlnode link in links) { console.writeline(link.innertext); } output :
<site><link>hello world</link><name>fred</name></site> hello world *) complete list of special tags besides link, included in elementsflags dictionary default, can seen in source code of htmlnode.cs. of popular among them <meta>, <img>, <frame>, <input>, <form>, <option>, etc.
Comments
Post a Comment