c# - Cannot extract <link> element using HtmlAgilityPack and XPath -
i using html agility pack select out textual data within rss xml. every other node type (title, pubdate, guid .etc) can select out inner-text using xpath conventions when querying "//link" or indeed "item/link" empty strings returned.
public static ienumerable<string> extractalllinks(string rsssource) { //create new document. var document = new htmldocument(); //populate document rss file. document.loadhtml(rsssource); //select out of required nodes. var itemnodes = document.documentnode.selectnodes("item/link"); //if 0 nodes found, return empty list, otherwise return content of nodes. return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); }
does have understanding of why element behaves differently others?
additional: running "item/link" returns 0 nodes. running "//link" returns correct number of nodes inner text 0 chars in length.
using below test data, "//name" returns single record "fred" "//link" single record empty string returned.
<site><link>hello world</link><name>fred</name></site>
i because of world "link". if change "linkz" works perfectly.
the below workaround works perfectly. understand why searching on "//link" not work other elements do.
public static ienumerable<string> extractalllinks(string rsssource) { rsssource = rsssource.replace("<link>", "<link-renamed>"); rsssource = rsssource.replace("</link>", "</link-renamed>"); //create new document. var document = new htmldocument(); //populate document rss file. document.loadhtml(rsssource); //select out of required nodes. var itemnodes = document.documentnode.selectnodes("//link-renamed"); //if 0 nodes found, return empty list, otherwise return content of nodes. return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); }
if print documentnode.outerhtml
, see problem :
var html = @"<site><link>hello world</link><name>fred</name></site>"; var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml);
output :
<site><link>hello world<name>fred</name></site>
link
happen 1 of special tags* treated self-closing tag hap. can alter behavior setting elementsflags
before parsing html, example :
var html = @"<site><link>hello world</link><name>fred</name></site>"; htmlnode.elementsflags.remove("link"); //remove link list of special tags var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml); var links = doc.documentnode.selectnodes("//link"); foreach (htmlnode link in links) { console.writeline(link.innertext); }
output :
<site><link>hello world</link><name>fred</name></site> hello world
*) complete list of special tags besides link
, included in elementsflags
dictionary default, can seen in source code of htmlnode.cs
. of popular among them <meta>
, <img>
, <frame>
, <input>
, <form>
, <option>
, etc.
Comments
Post a Comment