c# - Cannot extract <link> element using HtmlAgilityPack and XPath -

- March 15, 2013

i using html agility pack select out textual data within rss xml. every other node type (title, pubdate, guid .etc) can select out inner-text using xpath conventions when querying "//link" or indeed "item/link" empty strings returned.

public static ienumerable<string> extractalllinks(string rsssource) {     //create new document.     var document = new htmldocument();     //populate document rss file.     document.loadhtml(rsssource);     //select out of required nodes.     var itemnodes = document.documentnode.selectnodes("item/link");     //if 0 nodes found, return empty list, otherwise return content of nodes.     return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); }

does have understanding of why element behaves differently others?

additional: running "item/link" returns 0 nodes. running "//link" returns correct number of nodes inner text 0 chars in length.

using below test data, "//name" returns single record "fred" "//link" single record empty string returned.

<site><link>hello world</link><name>fred</name></site>

i because of world "link". if change "linkz" works perfectly.

the below workaround works perfectly. understand why searching on "//link" not work other elements do.

public static ienumerable<string> extractalllinks(string rsssource) {     rsssource = rsssource.replace("<link>", "<link-renamed>");     rsssource = rsssource.replace("</link>", "</link-renamed>");     //create new document.     var document = new htmldocument();     //populate document rss file.     document.loadhtml(rsssource);     //select out of required nodes.     var itemnodes = document.documentnode.selectnodes("//link-renamed");     //if 0 nodes found, return empty list, otherwise return content of nodes.     return itemnodes == null ? new list<string>() : itemnodes.select(itemnode => itemnode.innertext).tolist(); }

if print documentnode.outerhtml, see problem :

var html = @"<site><link>hello world</link><name>fred</name></site>"; var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml);

output :

<site><link>hello world<name>fred</name></site>

link happen 1 of special tags^* treated self-closing tag hap. can alter behavior setting elementsflags before parsing html, example :

var html = @"<site><link>hello world</link><name>fred</name></site>"; htmlnode.elementsflags.remove("link");  //remove link list of special tags var doc = new htmldocument(); doc.loadhtml(html); console.writeline(doc.documentnode.outerhtml); var links = doc.documentnode.selectnodes("//link"); foreach (htmlnode link in links) {     console.writeline(link.innertext); }

dotnetfiddle demo

output :

<site><link>hello world</link><name>fred</name></site> hello world

*) complete list of special tags besides link, included in elementsflags dictionary default, can seen in source code of htmlnode.cs. of popular among them <meta>, <img>, <frame>, <input>, <form>, <option>, etc.

Search This Blog

Click Hand

c# - Cannot extract <link> element using HtmlAgilityPack and XPath -

Comments

Post a Comment

Popular posts from this blog

apache - setting document root in antoher partition on ubuntu -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -

python - pip install -U PySide error -