What is the best way to parse html in C#? [closed]










up vote
66
down vote

favorite
184












I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.









share















closed as not constructive by Kev Nov 15 '11 at 17:09


As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. If this question can be reworded to fit the rules in the help center, please edit the question.





locked by Kev Nov 15 '11 at 17:09


This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.



















    up vote
    66
    down vote

    favorite
    184












    I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.









    share















    closed as not constructive by Kev Nov 15 '11 at 17:09


    As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. If this question can be reworded to fit the rules in the help center, please edit the question.





    locked by Kev Nov 15 '11 at 17:09


    This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

















      up vote
      66
      down vote

      favorite
      184









      up vote
      66
      down vote

      favorite
      184






      184





      I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.









      share















      I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.







      c# .net html parsing html-content-extraction





      share














      share












      share



      share








      edited Jan 3 '10 at 8:29









      Charles Stewart

      9,82233978




      9,82233978










      asked Sep 11 '08 at 9:16









      Luke

      9,5502179101




      9,5502179101




      closed as not constructive by Kev Nov 15 '11 at 17:09


      As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. If this question can be reworded to fit the rules in the help center, please edit the question.





      locked by Kev Nov 15 '11 at 17:09


      This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.






      closed as not constructive by Kev Nov 15 '11 at 17:09


      As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. If this question can be reworded to fit the rules in the help center, please edit the question.





      locked by Kev Nov 15 '11 at 17:09


      This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.


























          15 Answers
          15






          active

          oldest

          votes

















          up vote
          138
          down vote













          Html Agility Pack




          This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).






          share





















          • It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
            – Eamon Nerbonne
            May 14 '11 at 16:48


















          up vote
          27
          down vote













          You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.



          Another alternative would be to use the builtin engine mshtml:



          using mshtml;
          ...
          object oPageText = { html };
          HTMLDocument doc = new HTMLDocumentClass();
          IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
          doc2.write(oPageText);


          This allows you to use javascript-like functions like getElementById()





          share

















          • 4




            Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
            – Alex Baranosky
            Jan 9 '09 at 5:52






          • 1




            @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
            – Wilfred Knievel
            Jan 12 '10 at 23:17










          • I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
            – JP Hellemons
            Oct 25 '11 at 7:03


















          up vote
          16
          down vote













          I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.



          http://code.google.com/p/fizzler/





          share

















          • 1




            thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
            – Funka
            May 14 '10 at 1:33










          • I just used this today and I have to say, it is very easy to use if you know jQuery.
            – Chi Chan
            Oct 14 '10 at 20:56


















          up vote
          10
          down vote













          You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:



          var wb = new WebBrowser()


          ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.



          var doc = wb.Browser.Document
          var elem = doc.GetElementById(elementId);
          object obj = elem.DomElement;
          System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
          mi.Invoke(obj, new object[0]);


          you can do similar reflection stuff to submit forms, etc.



          Enjoy.





          share






























            up vote
            9
            down vote













            I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:



                    IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

            foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

            if (anchorTag.Attribute("href") == null)
            continue;

            Console.WriteLine(anchorTag.Attribute("href").Value);
            }


            I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...



            So now that expectations are realistically low, here's the code :)



            using System;
            using System.Collections.Generic;
            using System.Linq;
            using System.Text;
            using Majestic12;
            using System.IO;
            using System.Xml.Linq;
            using System.Diagnostics;
            using System.Text.RegularExpressions;

            namespace Majestic12ToXml {
            public class Majestic12ToXml {

            static public IEnumerable<XNode> ConvertNodesToXml(byte htmlAsBytes) {

            HTMLparser parser = OpenParser();
            parser.Init(htmlAsBytes);

            XElement currentNode = new XElement("document");

            HTMLchunk m12chunk = null;

            int xmlnsAttributeIndex = 0;
            string originalHtml = "";

            while ((m12chunk = parser.ParseNext()) != null) {

            try {

            Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

            XNode newNode = null;
            XElement newNodesParent = null;

            switch (m12chunk.oType) {
            case HTMLchunkType.OpenTag:

            // Tags are added as a child to the current tag,
            // except when the new tag implies the closure of
            // some number of ancestor tags.

            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

            if (newNode != null) {
            currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

            newNodesParent = currentNode;

            newNodesParent.Add(newNode);

            currentNode = newNode as XElement;
            }

            break;

            case HTMLchunkType.CloseTag:

            if (m12chunk.bEndClosure) {

            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

            if (newNode != null) {
            currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

            newNodesParent = currentNode;
            newNodesParent.Add(newNode);
            }
            }
            else {
            XElement nodeToClose = currentNode;

            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

            while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
            nodeToClose = nodeToClose.Parent;

            if (nodeToClose != null)
            currentNode = nodeToClose.Parent;

            Debug.Assert(currentNode != null);
            }

            break;

            case HTMLchunkType.Script:

            newNode = new XElement("script", "REMOVED");
            newNodesParent = currentNode;
            newNodesParent.Add(newNode);
            break;

            case HTMLchunkType.Comment:

            newNodesParent = currentNode;

            if (m12chunk.sTag == "!--")
            newNode = new XComment(m12chunk.oHTML);
            else if (m12chunk.sTag == "![CDATA[")
            newNode = new XCData(m12chunk.oHTML);
            else
            throw new Exception("Unrecognized comment sTag");

            newNodesParent.Add(newNode);

            break;

            case HTMLchunkType.Text:

            currentNode.Add(m12chunk.oHTML);
            break;

            default:
            break;
            }
            }
            catch (Exception e) {
            var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

            // the original html is copied for tracing/debugging purposes
            originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
            .Take(m12chunk.iChunkLength)
            .Select(B => (char)B).ToArray());

            wrappedE.Data.Add("source", originalHtml);

            throw wrappedE;
            }
            }

            while (currentNode.Parent != null)
            currentNode = currentNode.Parent;

            return currentNode.Nodes();
            }

            static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

            XElement discoveredParent = null;

            // Get a list of all ancestors
            List<XElement> ancestors = new List<XElement>();
            XElement ancestor = nextPotentialParent;
            while (ancestor != null) {
            ancestors.Add(ancestor);
            ancestor = ancestor.Parent;
            }

            // Check if the new tag implies a previous tag was closed.
            if ("form" == m12chunkCleanedTag) {

            discoveredParent = ancestors
            .Where(XE => m12chunkCleanedTag == XE.Name)
            .Take(1)
            .Select(XE => XE.Parent)
            .FirstOrDefault();
            }
            else if ("td" == m12chunkCleanedTag) {

            discoveredParent = ancestors
            .TakeWhile(XE => "tr" != XE.Name)
            .Where(XE => m12chunkCleanedTag == XE.Name)
            .Take(1)
            .Select(XE => XE.Parent)
            .FirstOrDefault();
            }
            else if ("tr" == m12chunkCleanedTag) {

            discoveredParent = ancestors
            .TakeWhile(XE => !("table" == XE.Name
            || "thead" == XE.Name
            || "tbody" == XE.Name
            || "tfoot" == XE.Name))
            .Where(XE => m12chunkCleanedTag == XE.Name)
            .Take(1)
            .Select(XE => XE.Parent)
            .FirstOrDefault();
            }
            else if ("thead" == m12chunkCleanedTag
            || "tbody" == m12chunkCleanedTag
            || "tfoot" == m12chunkCleanedTag) {


            discoveredParent = ancestors
            .TakeWhile(XE => "table" != XE.Name)
            .Where(XE => m12chunkCleanedTag == XE.Name)
            .Take(1)
            .Select(XE => XE.Parent)
            .FirstOrDefault();
            }

            return discoveredParent ?? nextPotentialParent;
            }

            static string CleanupTagName(string originalName, string originalHtml) {

            string tagName = originalName;

            tagName = tagName.TrimStart(new char { '?' }); // for nodes <?xml >

            if (tagName.Contains(':'))
            tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

            return tagName;
            }

            static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

            static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

            result = null;
            string attributeName = originalName;

            if (string.IsNullOrEmpty(originalName))
            return false;

            if (_startsAsNumeric.IsMatch(originalName))
            return false;

            //
            // transform xmlns attributes so they don't actually create any XML namespaces
            //
            if (attributeName.ToLower().Equals("xmlns")) {

            attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
            xmlnsIndex++;
            }
            else {
            if (attributeName.ToLower().StartsWith("xmlns:")) {
            attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
            }

            //
            // trim trailing "
            //
            attributeName = attributeName.TrimEnd(new char { '"' });

            attributeName = attributeName.Replace(":", "_");
            }

            result = attributeName;

            return true;
            }

            static Regex _weirdTag = new Regex(@"^<![.*]>$"); // matches "<![if !supportEmptyParas]>"
            static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
            static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

            static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

            if (string.IsNullOrEmpty(m12chunk.sTag)) {

            if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
            return new XElement("doctype");

            if (_weirdTag.IsMatch(originalHtml))
            return new XElement("REMOVED_weirdBlockParenthesisTag");

            if (_aspnetPrecompiled.IsMatch(originalHtml))
            return new XElement("REMOVED_ASPNET_PrecompiledDirective");

            if (_shortHtmlComment.IsMatch(originalHtml))
            return new XElement("REMOVED_ShortHtmlComment");

            // Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
            return null;
            }

            string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

            XElement result = new XElement(tagName);

            List<XAttribute> attributes = new List<XAttribute>();

            for (int i = 0; i < m12chunk.iParams; i++) {

            if (m12chunk.sParams[i] == "<!--") {

            // an HTML comment was embedded within a tag. This comment and its contents
            // will be interpreted as attributes by Majestic-12... skip this attributes
            for (; i < m12chunk.iParams; i++) {

            if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
            break;
            }

            continue;
            }

            if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
            continue;

            string attributeName = m12chunk.sParams[i];

            if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
            continue;

            attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
            }

            // If attributes are duplicated with different values, we complain.
            // If attributes are duplicated with the same value, we remove all but 1.
            var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

            foreach (var duplicatedAttribute in duplicatedAttributes) {

            if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
            throw new Exception("Attribute value was given different values");

            attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
            attributes.Add(duplicatedAttribute.First());
            }

            result.Add(attributes);

            return result;
            }

            static HTMLparser OpenParser() {
            HTMLparser oP = new HTMLparser();

            // The code+comments in this function are from the Majestic-12 sample documentation.

            // ...

            // This is optional, but if you want high performance then you may
            // want to set chunk hash mode to FALSE. This would result in tag params
            // being added to string arrays in HTMLchunk object called sParams and sValues, with number
            // of actual params being in iParams. See code below for details.
            //
            // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
            oP.SetChunkHashMode(false);

            // if you set this to true then original parsed HTML for given chunk will be kept -
            // this will reduce performance somewhat, but may be desireable in some cases where
            // reconstruction of HTML may be necessary
            oP.bKeepRawHTML = false;

            // if set to true (it is false by default), then entities will be decoded: this is essential
            // if you want to get strings that contain final representation of the data in HTML, however
            // you should be aware that if you want to use such strings into output HTML string then you will
            // need to do Entity encoding or same string may fail later
            oP.bDecodeEntities = true;

            // we have option to keep most entities as is - only replace stuff like &nbsp;
            // this is called Mini Entities mode - it is handy when HTML will need
            // to be re-created after it was parsed, though in this case really
            // entities should not be parsed at all
            oP.bDecodeMiniEntities = true;

            if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
            oP.InitMiniEntities();

            // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
            // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
            // this only works if auto extraction is enabled
            oP.bAutoExtractBetweenTagsOnly = true;

            // if true then comments will be extracted automatically
            oP.bAutoKeepComments = true;

            // if true then scripts will be extracted automatically:
            oP.bAutoKeepScripts = true;

            // if this option is true then whitespace before start of tag will be compressed to single
            // space character in string: " ", if false then full whitespace before tag will be returned (slower)
            // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
            // a waste of CPU cycles
            oP.bCompressWhiteSpaceBeforeTag = true;

            // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
            // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
            // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
            // or open
            oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

            return oP;
            }
            }
            }




            share

















            • 1




              btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
              – Frank Schwieterman
              Mar 8 '09 at 22:21










            • What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
              – user29439
              Aug 3 '11 at 22:42










            • I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
              – Frank Schwieterman
              Aug 4 '11 at 0:40


















            up vote
            7
            down vote













            The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.





            share




























              up vote
              3
              down vote













              I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:



              SgmlReader





              share




























                up vote
                2
                down vote













                No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net



                using System;
                using System.Collections.Generic;
                using System.Text;
                using System.Windows.Forms;
                using System.Threading;

                class ParseHTML
                {
                public ParseHTML() { }
                private string ReturnString;

                public string doParsing(string html)
                {
                Thread t = new Thread(TParseMain);
                t.ApartmentState = ApartmentState.STA;
                t.Start((object)html);
                t.Join();
                return ReturnString;
                }

                private void TParseMain(object html)
                {
                WebBrowser wbc = new WebBrowser();
                wbc.DocumentText = "feces of a dummy"; //;magic words
                HtmlDocument doc = wbc.Document.OpenNew(true);
                doc.Write((string)html);
                this.ReturnString = doc.Body.InnerHtml + " do here something";
                return;
                }
                }


                usage:



                string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
                Console.WriteLine("before:" + myhtml);
                myhtml = (new ParseHTML()).doParsing(myhtml);
                Console.WriteLine("after:" + myhtml);




                share






























                  up vote
                  1
                  down vote













                  The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.





                  share

















                  • 1




                    Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                    – pupeno
                    Dec 8 '09 at 12:56










                  • It should be, but people don't do it.
                    – DMan
                    Feb 16 '10 at 3:54










                  • @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                    – user166390
                    Dec 22 '10 at 4:13




















                  up vote
                  1
                  down vote













                  I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.





                  share




























                    up vote
                    0
                    down vote













                    You could use a HTML DTD, and the generic XML parsing libraries.





                    share





















                    • Can you clarify this?
                      – Luke
                      Sep 11 '08 at 9:44






                    • 8




                      Very few real-world HTML pages will survive an XML parsing library.
                      – Frank Krueger
                      Sep 11 '08 at 11:07


















                    up vote
                    0
                    down vote













                    Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]





                    share




























                      up vote
                      0
                      down vote













                      Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.





                      share




























                        up vote
                        0
                        down vote













                        Try this script.



                        http://www.biterscripting.com/SS_URLs.html



                        When I use it with this url,



                        script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")


                        It shows me all the links on the page for this thread.



                        http://sstatic.net/so/all.css
                        http://sstatic.net/so/favicon.ico
                        http://sstatic.net/so/apple-touch-icon.png
                        .
                        .
                        .


                        You can modify that script to check for images, variables, whatever.





                        share




























                          up vote
                          0
                          down vote













                          I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.



                          You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.



                          There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.





                          share






























                            15 Answers
                            15






                            active

                            oldest

                            votes








                            15 Answers
                            15






                            active

                            oldest

                            votes









                            active

                            oldest

                            votes






                            active

                            oldest

                            votes








                            up vote
                            138
                            down vote













                            Html Agility Pack




                            This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).






                            share





















                            • It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                              – Eamon Nerbonne
                              May 14 '11 at 16:48















                            up vote
                            138
                            down vote













                            Html Agility Pack




                            This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).






                            share





















                            • It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                              – Eamon Nerbonne
                              May 14 '11 at 16:48













                            up vote
                            138
                            down vote










                            up vote
                            138
                            down vote









                            Html Agility Pack




                            This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).






                            share












                            Html Agility Pack




                            This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).







                            share











                            share


                            share










                            answered Sep 19 '08 at 8:05









                            Mark Cidade

                            83.6k29204225




                            83.6k29204225












                            • It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                              – Eamon Nerbonne
                              May 14 '11 at 16:48


















                            • It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                              – Eamon Nerbonne
                              May 14 '11 at 16:48
















                            It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                            – Eamon Nerbonne
                            May 14 '11 at 16:48




                            It's worth noting that it doesn't deal well with self-closing tags like <p> (which it interprets as empty) and really badly with optional end-tags like <li> (which it interprets as missing an end tag, and so nests consecutive li tags).
                            – Eamon Nerbonne
                            May 14 '11 at 16:48












                            up vote
                            27
                            down vote













                            You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.



                            Another alternative would be to use the builtin engine mshtml:



                            using mshtml;
                            ...
                            object oPageText = { html };
                            HTMLDocument doc = new HTMLDocumentClass();
                            IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
                            doc2.write(oPageText);


                            This allows you to use javascript-like functions like getElementById()





                            share

















                            • 4




                              Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                              – Alex Baranosky
                              Jan 9 '09 at 5:52






                            • 1




                              @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                              – Wilfred Knievel
                              Jan 12 '10 at 23:17










                            • I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                              – JP Hellemons
                              Oct 25 '11 at 7:03















                            up vote
                            27
                            down vote













                            You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.



                            Another alternative would be to use the builtin engine mshtml:



                            using mshtml;
                            ...
                            object oPageText = { html };
                            HTMLDocument doc = new HTMLDocumentClass();
                            IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
                            doc2.write(oPageText);


                            This allows you to use javascript-like functions like getElementById()





                            share

















                            • 4




                              Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                              – Alex Baranosky
                              Jan 9 '09 at 5:52






                            • 1




                              @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                              – Wilfred Knievel
                              Jan 12 '10 at 23:17










                            • I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                              – JP Hellemons
                              Oct 25 '11 at 7:03













                            up vote
                            27
                            down vote










                            up vote
                            27
                            down vote









                            You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.



                            Another alternative would be to use the builtin engine mshtml:



                            using mshtml;
                            ...
                            object oPageText = { html };
                            HTMLDocument doc = new HTMLDocumentClass();
                            IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
                            doc2.write(oPageText);


                            This allows you to use javascript-like functions like getElementById()





                            share












                            You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.



                            Another alternative would be to use the builtin engine mshtml:



                            using mshtml;
                            ...
                            object oPageText = { html };
                            HTMLDocument doc = new HTMLDocumentClass();
                            IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
                            doc2.write(oPageText);


                            This allows you to use javascript-like functions like getElementById()






                            share











                            share


                            share










                            answered Sep 11 '08 at 10:35









                            Erlend

                            3,8551623




                            3,8551623








                            • 4




                              Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                              – Alex Baranosky
                              Jan 9 '09 at 5:52






                            • 1




                              @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                              – Wilfred Knievel
                              Jan 12 '10 at 23:17










                            • I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                              – JP Hellemons
                              Oct 25 '11 at 7:03














                            • 4




                              Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                              – Alex Baranosky
                              Jan 9 '09 at 5:52






                            • 1




                              @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                              – Wilfred Knievel
                              Jan 12 '10 at 23:17










                            • I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                              – JP Hellemons
                              Oct 25 '11 at 7:03








                            4




                            4




                            Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                            – Alex Baranosky
                            Jan 9 '09 at 5:52




                            Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
                            – Alex Baranosky
                            Jan 9 '09 at 5:52




                            1




                            1




                            @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                            – Wilfred Knievel
                            Jan 12 '10 at 23:17




                            @Alex you need to include Microsoft.mshtml can find a bit more info here: msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
                            – Wilfred Knievel
                            Jan 12 '10 at 23:17












                            I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                            – JP Hellemons
                            Oct 25 '11 at 7:03




                            I have a blogpost about Tidy.Net and ManagedTidy both are capable of parsing and validating (x)html files. If you do not need to validate stuff. I'd go with the htmlagilitypack. jphellemons.nl/post/…
                            – JP Hellemons
                            Oct 25 '11 at 7:03










                            up vote
                            16
                            down vote













                            I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.



                            http://code.google.com/p/fizzler/





                            share

















                            • 1




                              thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                              – Funka
                              May 14 '10 at 1:33










                            • I just used this today and I have to say, it is very easy to use if you know jQuery.
                              – Chi Chan
                              Oct 14 '10 at 20:56















                            up vote
                            16
                            down vote













                            I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.



                            http://code.google.com/p/fizzler/





                            share

















                            • 1




                              thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                              – Funka
                              May 14 '10 at 1:33










                            • I just used this today and I have to say, it is very easy to use if you know jQuery.
                              – Chi Chan
                              Oct 14 '10 at 20:56













                            up vote
                            16
                            down vote










                            up vote
                            16
                            down vote









                            I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.



                            http://code.google.com/p/fizzler/





                            share












                            I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.



                            http://code.google.com/p/fizzler/






                            share











                            share


                            share










                            answered Dec 18 '09 at 4:51









                            Rob Volk

                            2,28442019




                            2,28442019








                            • 1




                              thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                              – Funka
                              May 14 '10 at 1:33










                            • I just used this today and I have to say, it is very easy to use if you know jQuery.
                              – Chi Chan
                              Oct 14 '10 at 20:56














                            • 1




                              thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                              – Funka
                              May 14 '10 at 1:33










                            • I just used this today and I have to say, it is very easy to use if you know jQuery.
                              – Chi Chan
                              Oct 14 '10 at 20:56








                            1




                            1




                            thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                            – Funka
                            May 14 '10 at 1:33




                            thank you, this looks interesting! i've been surprised, what with jQuery's popularity, that it has been so hard to find a C# project inspired by it. Now if only I could find something where document manipulation and more advanced traversal was also part of the package... :)
                            – Funka
                            May 14 '10 at 1:33












                            I just used this today and I have to say, it is very easy to use if you know jQuery.
                            – Chi Chan
                            Oct 14 '10 at 20:56




                            I just used this today and I have to say, it is very easy to use if you know jQuery.
                            – Chi Chan
                            Oct 14 '10 at 20:56










                            up vote
                            10
                            down vote













                            You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:



                            var wb = new WebBrowser()


                            ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.



                            var doc = wb.Browser.Document
                            var elem = doc.GetElementById(elementId);
                            object obj = elem.DomElement;
                            System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
                            mi.Invoke(obj, new object[0]);


                            you can do similar reflection stuff to submit forms, etc.



                            Enjoy.





                            share



























                              up vote
                              10
                              down vote













                              You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:



                              var wb = new WebBrowser()


                              ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.



                              var doc = wb.Browser.Document
                              var elem = doc.GetElementById(elementId);
                              object obj = elem.DomElement;
                              System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
                              mi.Invoke(obj, new object[0]);


                              you can do similar reflection stuff to submit forms, etc.



                              Enjoy.





                              share

























                                up vote
                                10
                                down vote










                                up vote
                                10
                                down vote









                                You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:



                                var wb = new WebBrowser()


                                ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.



                                var doc = wb.Browser.Document
                                var elem = doc.GetElementById(elementId);
                                object obj = elem.DomElement;
                                System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
                                mi.Invoke(obj, new object[0]);


                                you can do similar reflection stuff to submit forms, etc.



                                Enjoy.





                                share














                                You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:



                                var wb = new WebBrowser()


                                ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.



                                var doc = wb.Browser.Document
                                var elem = doc.GetElementById(elementId);
                                object obj = elem.DomElement;
                                System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
                                mi.Invoke(obj, new object[0]);


                                you can do similar reflection stuff to submit forms, etc.



                                Enjoy.






                                share













                                share


                                share








                                edited Nov 22 '11 at 9:09









                                musefan

                                39.6k1696148




                                39.6k1696148










                                answered Sep 11 '08 at 14:08









                                Alan

                                1012




                                1012






















                                    up vote
                                    9
                                    down vote













                                    I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:



                                            IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

                                    foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

                                    if (anchorTag.Attribute("href") == null)
                                    continue;

                                    Console.WriteLine(anchorTag.Attribute("href").Value);
                                    }


                                    I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...



                                    So now that expectations are realistically low, here's the code :)



                                    using System;
                                    using System.Collections.Generic;
                                    using System.Linq;
                                    using System.Text;
                                    using Majestic12;
                                    using System.IO;
                                    using System.Xml.Linq;
                                    using System.Diagnostics;
                                    using System.Text.RegularExpressions;

                                    namespace Majestic12ToXml {
                                    public class Majestic12ToXml {

                                    static public IEnumerable<XNode> ConvertNodesToXml(byte htmlAsBytes) {

                                    HTMLparser parser = OpenParser();
                                    parser.Init(htmlAsBytes);

                                    XElement currentNode = new XElement("document");

                                    HTMLchunk m12chunk = null;

                                    int xmlnsAttributeIndex = 0;
                                    string originalHtml = "";

                                    while ((m12chunk = parser.ParseNext()) != null) {

                                    try {

                                    Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

                                    XNode newNode = null;
                                    XElement newNodesParent = null;

                                    switch (m12chunk.oType) {
                                    case HTMLchunkType.OpenTag:

                                    // Tags are added as a child to the current tag,
                                    // except when the new tag implies the closure of
                                    // some number of ancestor tags.

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;

                                    newNodesParent.Add(newNode);

                                    currentNode = newNode as XElement;
                                    }

                                    break;

                                    case HTMLchunkType.CloseTag:

                                    if (m12chunk.bEndClosure) {

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    }
                                    }
                                    else {
                                    XElement nodeToClose = currentNode;

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                    nodeToClose = nodeToClose.Parent;

                                    if (nodeToClose != null)
                                    currentNode = nodeToClose.Parent;

                                    Debug.Assert(currentNode != null);
                                    }

                                    break;

                                    case HTMLchunkType.Script:

                                    newNode = new XElement("script", "REMOVED");
                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    break;

                                    case HTMLchunkType.Comment:

                                    newNodesParent = currentNode;

                                    if (m12chunk.sTag == "!--")
                                    newNode = new XComment(m12chunk.oHTML);
                                    else if (m12chunk.sTag == "![CDATA[")
                                    newNode = new XCData(m12chunk.oHTML);
                                    else
                                    throw new Exception("Unrecognized comment sTag");

                                    newNodesParent.Add(newNode);

                                    break;

                                    case HTMLchunkType.Text:

                                    currentNode.Add(m12chunk.oHTML);
                                    break;

                                    default:
                                    break;
                                    }
                                    }
                                    catch (Exception e) {
                                    var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                                    // the original html is copied for tracing/debugging purposes
                                    originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                                    .Take(m12chunk.iChunkLength)
                                    .Select(B => (char)B).ToArray());

                                    wrappedE.Data.Add("source", originalHtml);

                                    throw wrappedE;
                                    }
                                    }

                                    while (currentNode.Parent != null)
                                    currentNode = currentNode.Parent;

                                    return currentNode.Nodes();
                                    }

                                    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement discoveredParent = null;

                                    // Get a list of all ancestors
                                    List<XElement> ancestors = new List<XElement>();
                                    XElement ancestor = nextPotentialParent;
                                    while (ancestor != null) {
                                    ancestors.Add(ancestor);
                                    ancestor = ancestor.Parent;
                                    }

                                    // Check if the new tag implies a previous tag was closed.
                                    if ("form" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("td" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "tr" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("tr" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("thead" == m12chunkCleanedTag
                                    || "tbody" == m12chunkCleanedTag
                                    || "tfoot" == m12chunkCleanedTag) {


                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "table" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }

                                    return discoveredParent ?? nextPotentialParent;
                                    }

                                    static string CleanupTagName(string originalName, string originalHtml) {

                                    string tagName = originalName;

                                    tagName = tagName.TrimStart(new char { '?' }); // for nodes <?xml >

                                    if (tagName.Contains(':'))
                                    tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

                                    return tagName;
                                    }

                                    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

                                    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

                                    result = null;
                                    string attributeName = originalName;

                                    if (string.IsNullOrEmpty(originalName))
                                    return false;

                                    if (_startsAsNumeric.IsMatch(originalName))
                                    return false;

                                    //
                                    // transform xmlns attributes so they don't actually create any XML namespaces
                                    //
                                    if (attributeName.ToLower().Equals("xmlns")) {

                                    attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
                                    xmlnsIndex++;
                                    }
                                    else {
                                    if (attributeName.ToLower().StartsWith("xmlns:")) {
                                    attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
                                    }

                                    //
                                    // trim trailing "
                                    //
                                    attributeName = attributeName.TrimEnd(new char { '"' });

                                    attributeName = attributeName.Replace(":", "_");
                                    }

                                    result = attributeName;

                                    return true;
                                    }

                                    static Regex _weirdTag = new Regex(@"^<![.*]>$"); // matches "<![if !supportEmptyParas]>"
                                    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
                                    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

                                    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

                                    if (string.IsNullOrEmpty(m12chunk.sTag)) {

                                    if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                                    return new XElement("doctype");

                                    if (_weirdTag.IsMatch(originalHtml))
                                    return new XElement("REMOVED_weirdBlockParenthesisTag");

                                    if (_aspnetPrecompiled.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ASPNET_PrecompiledDirective");

                                    if (_shortHtmlComment.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ShortHtmlComment");

                                    // Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
                                    return null;
                                    }

                                    string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement result = new XElement(tagName);

                                    List<XAttribute> attributes = new List<XAttribute>();

                                    for (int i = 0; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sParams[i] == "<!--") {

                                    // an HTML comment was embedded within a tag. This comment and its contents
                                    // will be interpreted as attributes by Majestic-12... skip this attributes
                                    for (; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                                    break;
                                    }

                                    continue;
                                    }

                                    if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                                    continue;

                                    string attributeName = m12chunk.sParams[i];

                                    if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                                    continue;

                                    attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
                                    }

                                    // If attributes are duplicated with different values, we complain.
                                    // If attributes are duplicated with the same value, we remove all but 1.
                                    var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

                                    foreach (var duplicatedAttribute in duplicatedAttributes) {

                                    if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                                    throw new Exception("Attribute value was given different values");

                                    attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
                                    attributes.Add(duplicatedAttribute.First());
                                    }

                                    result.Add(attributes);

                                    return result;
                                    }

                                    static HTMLparser OpenParser() {
                                    HTMLparser oP = new HTMLparser();

                                    // The code+comments in this function are from the Majestic-12 sample documentation.

                                    // ...

                                    // This is optional, but if you want high performance then you may
                                    // want to set chunk hash mode to FALSE. This would result in tag params
                                    // being added to string arrays in HTMLchunk object called sParams and sValues, with number
                                    // of actual params being in iParams. See code below for details.
                                    //
                                    // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
                                    oP.SetChunkHashMode(false);

                                    // if you set this to true then original parsed HTML for given chunk will be kept -
                                    // this will reduce performance somewhat, but may be desireable in some cases where
                                    // reconstruction of HTML may be necessary
                                    oP.bKeepRawHTML = false;

                                    // if set to true (it is false by default), then entities will be decoded: this is essential
                                    // if you want to get strings that contain final representation of the data in HTML, however
                                    // you should be aware that if you want to use such strings into output HTML string then you will
                                    // need to do Entity encoding or same string may fail later
                                    oP.bDecodeEntities = true;

                                    // we have option to keep most entities as is - only replace stuff like &nbsp;
                                    // this is called Mini Entities mode - it is handy when HTML will need
                                    // to be re-created after it was parsed, though in this case really
                                    // entities should not be parsed at all
                                    oP.bDecodeMiniEntities = true;

                                    if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
                                    oP.InitMiniEntities();

                                    // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
                                    // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
                                    // this only works if auto extraction is enabled
                                    oP.bAutoExtractBetweenTagsOnly = true;

                                    // if true then comments will be extracted automatically
                                    oP.bAutoKeepComments = true;

                                    // if true then scripts will be extracted automatically:
                                    oP.bAutoKeepScripts = true;

                                    // if this option is true then whitespace before start of tag will be compressed to single
                                    // space character in string: " ", if false then full whitespace before tag will be returned (slower)
                                    // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
                                    // a waste of CPU cycles
                                    oP.bCompressWhiteSpaceBeforeTag = true;

                                    // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
                                    // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
                                    // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
                                    // or open
                                    oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

                                    return oP;
                                    }
                                    }
                                    }




                                    share

















                                    • 1




                                      btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                      – Frank Schwieterman
                                      Mar 8 '09 at 22:21










                                    • What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                      – user29439
                                      Aug 3 '11 at 22:42










                                    • I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                      – Frank Schwieterman
                                      Aug 4 '11 at 0:40















                                    up vote
                                    9
                                    down vote













                                    I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:



                                            IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

                                    foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

                                    if (anchorTag.Attribute("href") == null)
                                    continue;

                                    Console.WriteLine(anchorTag.Attribute("href").Value);
                                    }


                                    I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...



                                    So now that expectations are realistically low, here's the code :)



                                    using System;
                                    using System.Collections.Generic;
                                    using System.Linq;
                                    using System.Text;
                                    using Majestic12;
                                    using System.IO;
                                    using System.Xml.Linq;
                                    using System.Diagnostics;
                                    using System.Text.RegularExpressions;

                                    namespace Majestic12ToXml {
                                    public class Majestic12ToXml {

                                    static public IEnumerable<XNode> ConvertNodesToXml(byte htmlAsBytes) {

                                    HTMLparser parser = OpenParser();
                                    parser.Init(htmlAsBytes);

                                    XElement currentNode = new XElement("document");

                                    HTMLchunk m12chunk = null;

                                    int xmlnsAttributeIndex = 0;
                                    string originalHtml = "";

                                    while ((m12chunk = parser.ParseNext()) != null) {

                                    try {

                                    Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

                                    XNode newNode = null;
                                    XElement newNodesParent = null;

                                    switch (m12chunk.oType) {
                                    case HTMLchunkType.OpenTag:

                                    // Tags are added as a child to the current tag,
                                    // except when the new tag implies the closure of
                                    // some number of ancestor tags.

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;

                                    newNodesParent.Add(newNode);

                                    currentNode = newNode as XElement;
                                    }

                                    break;

                                    case HTMLchunkType.CloseTag:

                                    if (m12chunk.bEndClosure) {

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    }
                                    }
                                    else {
                                    XElement nodeToClose = currentNode;

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                    nodeToClose = nodeToClose.Parent;

                                    if (nodeToClose != null)
                                    currentNode = nodeToClose.Parent;

                                    Debug.Assert(currentNode != null);
                                    }

                                    break;

                                    case HTMLchunkType.Script:

                                    newNode = new XElement("script", "REMOVED");
                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    break;

                                    case HTMLchunkType.Comment:

                                    newNodesParent = currentNode;

                                    if (m12chunk.sTag == "!--")
                                    newNode = new XComment(m12chunk.oHTML);
                                    else if (m12chunk.sTag == "![CDATA[")
                                    newNode = new XCData(m12chunk.oHTML);
                                    else
                                    throw new Exception("Unrecognized comment sTag");

                                    newNodesParent.Add(newNode);

                                    break;

                                    case HTMLchunkType.Text:

                                    currentNode.Add(m12chunk.oHTML);
                                    break;

                                    default:
                                    break;
                                    }
                                    }
                                    catch (Exception e) {
                                    var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                                    // the original html is copied for tracing/debugging purposes
                                    originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                                    .Take(m12chunk.iChunkLength)
                                    .Select(B => (char)B).ToArray());

                                    wrappedE.Data.Add("source", originalHtml);

                                    throw wrappedE;
                                    }
                                    }

                                    while (currentNode.Parent != null)
                                    currentNode = currentNode.Parent;

                                    return currentNode.Nodes();
                                    }

                                    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement discoveredParent = null;

                                    // Get a list of all ancestors
                                    List<XElement> ancestors = new List<XElement>();
                                    XElement ancestor = nextPotentialParent;
                                    while (ancestor != null) {
                                    ancestors.Add(ancestor);
                                    ancestor = ancestor.Parent;
                                    }

                                    // Check if the new tag implies a previous tag was closed.
                                    if ("form" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("td" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "tr" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("tr" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("thead" == m12chunkCleanedTag
                                    || "tbody" == m12chunkCleanedTag
                                    || "tfoot" == m12chunkCleanedTag) {


                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "table" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }

                                    return discoveredParent ?? nextPotentialParent;
                                    }

                                    static string CleanupTagName(string originalName, string originalHtml) {

                                    string tagName = originalName;

                                    tagName = tagName.TrimStart(new char { '?' }); // for nodes <?xml >

                                    if (tagName.Contains(':'))
                                    tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

                                    return tagName;
                                    }

                                    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

                                    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

                                    result = null;
                                    string attributeName = originalName;

                                    if (string.IsNullOrEmpty(originalName))
                                    return false;

                                    if (_startsAsNumeric.IsMatch(originalName))
                                    return false;

                                    //
                                    // transform xmlns attributes so they don't actually create any XML namespaces
                                    //
                                    if (attributeName.ToLower().Equals("xmlns")) {

                                    attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
                                    xmlnsIndex++;
                                    }
                                    else {
                                    if (attributeName.ToLower().StartsWith("xmlns:")) {
                                    attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
                                    }

                                    //
                                    // trim trailing "
                                    //
                                    attributeName = attributeName.TrimEnd(new char { '"' });

                                    attributeName = attributeName.Replace(":", "_");
                                    }

                                    result = attributeName;

                                    return true;
                                    }

                                    static Regex _weirdTag = new Regex(@"^<![.*]>$"); // matches "<![if !supportEmptyParas]>"
                                    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
                                    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

                                    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

                                    if (string.IsNullOrEmpty(m12chunk.sTag)) {

                                    if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                                    return new XElement("doctype");

                                    if (_weirdTag.IsMatch(originalHtml))
                                    return new XElement("REMOVED_weirdBlockParenthesisTag");

                                    if (_aspnetPrecompiled.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ASPNET_PrecompiledDirective");

                                    if (_shortHtmlComment.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ShortHtmlComment");

                                    // Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
                                    return null;
                                    }

                                    string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement result = new XElement(tagName);

                                    List<XAttribute> attributes = new List<XAttribute>();

                                    for (int i = 0; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sParams[i] == "<!--") {

                                    // an HTML comment was embedded within a tag. This comment and its contents
                                    // will be interpreted as attributes by Majestic-12... skip this attributes
                                    for (; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                                    break;
                                    }

                                    continue;
                                    }

                                    if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                                    continue;

                                    string attributeName = m12chunk.sParams[i];

                                    if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                                    continue;

                                    attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
                                    }

                                    // If attributes are duplicated with different values, we complain.
                                    // If attributes are duplicated with the same value, we remove all but 1.
                                    var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

                                    foreach (var duplicatedAttribute in duplicatedAttributes) {

                                    if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                                    throw new Exception("Attribute value was given different values");

                                    attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
                                    attributes.Add(duplicatedAttribute.First());
                                    }

                                    result.Add(attributes);

                                    return result;
                                    }

                                    static HTMLparser OpenParser() {
                                    HTMLparser oP = new HTMLparser();

                                    // The code+comments in this function are from the Majestic-12 sample documentation.

                                    // ...

                                    // This is optional, but if you want high performance then you may
                                    // want to set chunk hash mode to FALSE. This would result in tag params
                                    // being added to string arrays in HTMLchunk object called sParams and sValues, with number
                                    // of actual params being in iParams. See code below for details.
                                    //
                                    // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
                                    oP.SetChunkHashMode(false);

                                    // if you set this to true then original parsed HTML for given chunk will be kept -
                                    // this will reduce performance somewhat, but may be desireable in some cases where
                                    // reconstruction of HTML may be necessary
                                    oP.bKeepRawHTML = false;

                                    // if set to true (it is false by default), then entities will be decoded: this is essential
                                    // if you want to get strings that contain final representation of the data in HTML, however
                                    // you should be aware that if you want to use such strings into output HTML string then you will
                                    // need to do Entity encoding or same string may fail later
                                    oP.bDecodeEntities = true;

                                    // we have option to keep most entities as is - only replace stuff like &nbsp;
                                    // this is called Mini Entities mode - it is handy when HTML will need
                                    // to be re-created after it was parsed, though in this case really
                                    // entities should not be parsed at all
                                    oP.bDecodeMiniEntities = true;

                                    if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
                                    oP.InitMiniEntities();

                                    // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
                                    // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
                                    // this only works if auto extraction is enabled
                                    oP.bAutoExtractBetweenTagsOnly = true;

                                    // if true then comments will be extracted automatically
                                    oP.bAutoKeepComments = true;

                                    // if true then scripts will be extracted automatically:
                                    oP.bAutoKeepScripts = true;

                                    // if this option is true then whitespace before start of tag will be compressed to single
                                    // space character in string: " ", if false then full whitespace before tag will be returned (slower)
                                    // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
                                    // a waste of CPU cycles
                                    oP.bCompressWhiteSpaceBeforeTag = true;

                                    // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
                                    // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
                                    // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
                                    // or open
                                    oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

                                    return oP;
                                    }
                                    }
                                    }




                                    share

















                                    • 1




                                      btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                      – Frank Schwieterman
                                      Mar 8 '09 at 22:21










                                    • What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                      – user29439
                                      Aug 3 '11 at 22:42










                                    • I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                      – Frank Schwieterman
                                      Aug 4 '11 at 0:40













                                    up vote
                                    9
                                    down vote










                                    up vote
                                    9
                                    down vote









                                    I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:



                                            IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

                                    foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

                                    if (anchorTag.Attribute("href") == null)
                                    continue;

                                    Console.WriteLine(anchorTag.Attribute("href").Value);
                                    }


                                    I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...



                                    So now that expectations are realistically low, here's the code :)



                                    using System;
                                    using System.Collections.Generic;
                                    using System.Linq;
                                    using System.Text;
                                    using Majestic12;
                                    using System.IO;
                                    using System.Xml.Linq;
                                    using System.Diagnostics;
                                    using System.Text.RegularExpressions;

                                    namespace Majestic12ToXml {
                                    public class Majestic12ToXml {

                                    static public IEnumerable<XNode> ConvertNodesToXml(byte htmlAsBytes) {

                                    HTMLparser parser = OpenParser();
                                    parser.Init(htmlAsBytes);

                                    XElement currentNode = new XElement("document");

                                    HTMLchunk m12chunk = null;

                                    int xmlnsAttributeIndex = 0;
                                    string originalHtml = "";

                                    while ((m12chunk = parser.ParseNext()) != null) {

                                    try {

                                    Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

                                    XNode newNode = null;
                                    XElement newNodesParent = null;

                                    switch (m12chunk.oType) {
                                    case HTMLchunkType.OpenTag:

                                    // Tags are added as a child to the current tag,
                                    // except when the new tag implies the closure of
                                    // some number of ancestor tags.

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;

                                    newNodesParent.Add(newNode);

                                    currentNode = newNode as XElement;
                                    }

                                    break;

                                    case HTMLchunkType.CloseTag:

                                    if (m12chunk.bEndClosure) {

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    }
                                    }
                                    else {
                                    XElement nodeToClose = currentNode;

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                    nodeToClose = nodeToClose.Parent;

                                    if (nodeToClose != null)
                                    currentNode = nodeToClose.Parent;

                                    Debug.Assert(currentNode != null);
                                    }

                                    break;

                                    case HTMLchunkType.Script:

                                    newNode = new XElement("script", "REMOVED");
                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    break;

                                    case HTMLchunkType.Comment:

                                    newNodesParent = currentNode;

                                    if (m12chunk.sTag == "!--")
                                    newNode = new XComment(m12chunk.oHTML);
                                    else if (m12chunk.sTag == "![CDATA[")
                                    newNode = new XCData(m12chunk.oHTML);
                                    else
                                    throw new Exception("Unrecognized comment sTag");

                                    newNodesParent.Add(newNode);

                                    break;

                                    case HTMLchunkType.Text:

                                    currentNode.Add(m12chunk.oHTML);
                                    break;

                                    default:
                                    break;
                                    }
                                    }
                                    catch (Exception e) {
                                    var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                                    // the original html is copied for tracing/debugging purposes
                                    originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                                    .Take(m12chunk.iChunkLength)
                                    .Select(B => (char)B).ToArray());

                                    wrappedE.Data.Add("source", originalHtml);

                                    throw wrappedE;
                                    }
                                    }

                                    while (currentNode.Parent != null)
                                    currentNode = currentNode.Parent;

                                    return currentNode.Nodes();
                                    }

                                    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement discoveredParent = null;

                                    // Get a list of all ancestors
                                    List<XElement> ancestors = new List<XElement>();
                                    XElement ancestor = nextPotentialParent;
                                    while (ancestor != null) {
                                    ancestors.Add(ancestor);
                                    ancestor = ancestor.Parent;
                                    }

                                    // Check if the new tag implies a previous tag was closed.
                                    if ("form" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("td" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "tr" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("tr" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("thead" == m12chunkCleanedTag
                                    || "tbody" == m12chunkCleanedTag
                                    || "tfoot" == m12chunkCleanedTag) {


                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "table" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }

                                    return discoveredParent ?? nextPotentialParent;
                                    }

                                    static string CleanupTagName(string originalName, string originalHtml) {

                                    string tagName = originalName;

                                    tagName = tagName.TrimStart(new char { '?' }); // for nodes <?xml >

                                    if (tagName.Contains(':'))
                                    tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

                                    return tagName;
                                    }

                                    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

                                    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

                                    result = null;
                                    string attributeName = originalName;

                                    if (string.IsNullOrEmpty(originalName))
                                    return false;

                                    if (_startsAsNumeric.IsMatch(originalName))
                                    return false;

                                    //
                                    // transform xmlns attributes so they don't actually create any XML namespaces
                                    //
                                    if (attributeName.ToLower().Equals("xmlns")) {

                                    attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
                                    xmlnsIndex++;
                                    }
                                    else {
                                    if (attributeName.ToLower().StartsWith("xmlns:")) {
                                    attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
                                    }

                                    //
                                    // trim trailing "
                                    //
                                    attributeName = attributeName.TrimEnd(new char { '"' });

                                    attributeName = attributeName.Replace(":", "_");
                                    }

                                    result = attributeName;

                                    return true;
                                    }

                                    static Regex _weirdTag = new Regex(@"^<![.*]>$"); // matches "<![if !supportEmptyParas]>"
                                    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
                                    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

                                    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

                                    if (string.IsNullOrEmpty(m12chunk.sTag)) {

                                    if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                                    return new XElement("doctype");

                                    if (_weirdTag.IsMatch(originalHtml))
                                    return new XElement("REMOVED_weirdBlockParenthesisTag");

                                    if (_aspnetPrecompiled.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ASPNET_PrecompiledDirective");

                                    if (_shortHtmlComment.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ShortHtmlComment");

                                    // Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
                                    return null;
                                    }

                                    string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement result = new XElement(tagName);

                                    List<XAttribute> attributes = new List<XAttribute>();

                                    for (int i = 0; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sParams[i] == "<!--") {

                                    // an HTML comment was embedded within a tag. This comment and its contents
                                    // will be interpreted as attributes by Majestic-12... skip this attributes
                                    for (; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                                    break;
                                    }

                                    continue;
                                    }

                                    if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                                    continue;

                                    string attributeName = m12chunk.sParams[i];

                                    if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                                    continue;

                                    attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
                                    }

                                    // If attributes are duplicated with different values, we complain.
                                    // If attributes are duplicated with the same value, we remove all but 1.
                                    var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

                                    foreach (var duplicatedAttribute in duplicatedAttributes) {

                                    if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                                    throw new Exception("Attribute value was given different values");

                                    attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
                                    attributes.Add(duplicatedAttribute.First());
                                    }

                                    result.Add(attributes);

                                    return result;
                                    }

                                    static HTMLparser OpenParser() {
                                    HTMLparser oP = new HTMLparser();

                                    // The code+comments in this function are from the Majestic-12 sample documentation.

                                    // ...

                                    // This is optional, but if you want high performance then you may
                                    // want to set chunk hash mode to FALSE. This would result in tag params
                                    // being added to string arrays in HTMLchunk object called sParams and sValues, with number
                                    // of actual params being in iParams. See code below for details.
                                    //
                                    // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
                                    oP.SetChunkHashMode(false);

                                    // if you set this to true then original parsed HTML for given chunk will be kept -
                                    // this will reduce performance somewhat, but may be desireable in some cases where
                                    // reconstruction of HTML may be necessary
                                    oP.bKeepRawHTML = false;

                                    // if set to true (it is false by default), then entities will be decoded: this is essential
                                    // if you want to get strings that contain final representation of the data in HTML, however
                                    // you should be aware that if you want to use such strings into output HTML string then you will
                                    // need to do Entity encoding or same string may fail later
                                    oP.bDecodeEntities = true;

                                    // we have option to keep most entities as is - only replace stuff like &nbsp;
                                    // this is called Mini Entities mode - it is handy when HTML will need
                                    // to be re-created after it was parsed, though in this case really
                                    // entities should not be parsed at all
                                    oP.bDecodeMiniEntities = true;

                                    if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
                                    oP.InitMiniEntities();

                                    // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
                                    // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
                                    // this only works if auto extraction is enabled
                                    oP.bAutoExtractBetweenTagsOnly = true;

                                    // if true then comments will be extracted automatically
                                    oP.bAutoKeepComments = true;

                                    // if true then scripts will be extracted automatically:
                                    oP.bAutoKeepScripts = true;

                                    // if this option is true then whitespace before start of tag will be compressed to single
                                    // space character in string: " ", if false then full whitespace before tag will be returned (slower)
                                    // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
                                    // a waste of CPU cycles
                                    oP.bCompressWhiteSpaceBeforeTag = true;

                                    // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
                                    // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
                                    // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
                                    // or open
                                    oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

                                    return oP;
                                    }
                                    }
                                    }




                                    share












                                    I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:



                                            IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

                                    foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

                                    if (anchorTag.Attribute("href") == null)
                                    continue;

                                    Console.WriteLine(anchorTag.Attribute("href").Value);
                                    }


                                    I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...



                                    So now that expectations are realistically low, here's the code :)



                                    using System;
                                    using System.Collections.Generic;
                                    using System.Linq;
                                    using System.Text;
                                    using Majestic12;
                                    using System.IO;
                                    using System.Xml.Linq;
                                    using System.Diagnostics;
                                    using System.Text.RegularExpressions;

                                    namespace Majestic12ToXml {
                                    public class Majestic12ToXml {

                                    static public IEnumerable<XNode> ConvertNodesToXml(byte htmlAsBytes) {

                                    HTMLparser parser = OpenParser();
                                    parser.Init(htmlAsBytes);

                                    XElement currentNode = new XElement("document");

                                    HTMLchunk m12chunk = null;

                                    int xmlnsAttributeIndex = 0;
                                    string originalHtml = "";

                                    while ((m12chunk = parser.ParseNext()) != null) {

                                    try {

                                    Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

                                    XNode newNode = null;
                                    XElement newNodesParent = null;

                                    switch (m12chunk.oType) {
                                    case HTMLchunkType.OpenTag:

                                    // Tags are added as a child to the current tag,
                                    // except when the new tag implies the closure of
                                    // some number of ancestor tags.

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;

                                    newNodesParent.Add(newNode);

                                    currentNode = newNode as XElement;
                                    }

                                    break;

                                    case HTMLchunkType.CloseTag:

                                    if (m12chunk.bEndClosure) {

                                    newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                                    if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    }
                                    }
                                    else {
                                    XElement nodeToClose = currentNode;

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                    nodeToClose = nodeToClose.Parent;

                                    if (nodeToClose != null)
                                    currentNode = nodeToClose.Parent;

                                    Debug.Assert(currentNode != null);
                                    }

                                    break;

                                    case HTMLchunkType.Script:

                                    newNode = new XElement("script", "REMOVED");
                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                    break;

                                    case HTMLchunkType.Comment:

                                    newNodesParent = currentNode;

                                    if (m12chunk.sTag == "!--")
                                    newNode = new XComment(m12chunk.oHTML);
                                    else if (m12chunk.sTag == "![CDATA[")
                                    newNode = new XCData(m12chunk.oHTML);
                                    else
                                    throw new Exception("Unrecognized comment sTag");

                                    newNodesParent.Add(newNode);

                                    break;

                                    case HTMLchunkType.Text:

                                    currentNode.Add(m12chunk.oHTML);
                                    break;

                                    default:
                                    break;
                                    }
                                    }
                                    catch (Exception e) {
                                    var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                                    // the original html is copied for tracing/debugging purposes
                                    originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                                    .Take(m12chunk.iChunkLength)
                                    .Select(B => (char)B).ToArray());

                                    wrappedE.Data.Add("source", originalHtml);

                                    throw wrappedE;
                                    }
                                    }

                                    while (currentNode.Parent != null)
                                    currentNode = currentNode.Parent;

                                    return currentNode.Nodes();
                                    }

                                    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

                                    string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement discoveredParent = null;

                                    // Get a list of all ancestors
                                    List<XElement> ancestors = new List<XElement>();
                                    XElement ancestor = nextPotentialParent;
                                    while (ancestor != null) {
                                    ancestors.Add(ancestor);
                                    ancestor = ancestor.Parent;
                                    }

                                    // Check if the new tag implies a previous tag was closed.
                                    if ("form" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("td" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "tr" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("tr" == m12chunkCleanedTag) {

                                    discoveredParent = ancestors
                                    .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }
                                    else if ("thead" == m12chunkCleanedTag
                                    || "tbody" == m12chunkCleanedTag
                                    || "tfoot" == m12chunkCleanedTag) {


                                    discoveredParent = ancestors
                                    .TakeWhile(XE => "table" != XE.Name)
                                    .Where(XE => m12chunkCleanedTag == XE.Name)
                                    .Take(1)
                                    .Select(XE => XE.Parent)
                                    .FirstOrDefault();
                                    }

                                    return discoveredParent ?? nextPotentialParent;
                                    }

                                    static string CleanupTagName(string originalName, string originalHtml) {

                                    string tagName = originalName;

                                    tagName = tagName.TrimStart(new char { '?' }); // for nodes <?xml >

                                    if (tagName.Contains(':'))
                                    tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

                                    return tagName;
                                    }

                                    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

                                    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

                                    result = null;
                                    string attributeName = originalName;

                                    if (string.IsNullOrEmpty(originalName))
                                    return false;

                                    if (_startsAsNumeric.IsMatch(originalName))
                                    return false;

                                    //
                                    // transform xmlns attributes so they don't actually create any XML namespaces
                                    //
                                    if (attributeName.ToLower().Equals("xmlns")) {

                                    attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
                                    xmlnsIndex++;
                                    }
                                    else {
                                    if (attributeName.ToLower().StartsWith("xmlns:")) {
                                    attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
                                    }

                                    //
                                    // trim trailing "
                                    //
                                    attributeName = attributeName.TrimEnd(new char { '"' });

                                    attributeName = attributeName.Replace(":", "_");
                                    }

                                    result = attributeName;

                                    return true;
                                    }

                                    static Regex _weirdTag = new Regex(@"^<![.*]>$"); // matches "<![if !supportEmptyParas]>"
                                    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
                                    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

                                    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

                                    if (string.IsNullOrEmpty(m12chunk.sTag)) {

                                    if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                                    return new XElement("doctype");

                                    if (_weirdTag.IsMatch(originalHtml))
                                    return new XElement("REMOVED_weirdBlockParenthesisTag");

                                    if (_aspnetPrecompiled.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ASPNET_PrecompiledDirective");

                                    if (_shortHtmlComment.IsMatch(originalHtml))
                                    return new XElement("REMOVED_ShortHtmlComment");

                                    // Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
                                    return null;
                                    }

                                    string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

                                    XElement result = new XElement(tagName);

                                    List<XAttribute> attributes = new List<XAttribute>();

                                    for (int i = 0; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sParams[i] == "<!--") {

                                    // an HTML comment was embedded within a tag. This comment and its contents
                                    // will be interpreted as attributes by Majestic-12... skip this attributes
                                    for (; i < m12chunk.iParams; i++) {

                                    if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                                    break;
                                    }

                                    continue;
                                    }

                                    if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                                    continue;

                                    string attributeName = m12chunk.sParams[i];

                                    if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                                    continue;

                                    attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
                                    }

                                    // If attributes are duplicated with different values, we complain.
                                    // If attributes are duplicated with the same value, we remove all but 1.
                                    var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

                                    foreach (var duplicatedAttribute in duplicatedAttributes) {

                                    if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                                    throw new Exception("Attribute value was given different values");

                                    attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
                                    attributes.Add(duplicatedAttribute.First());
                                    }

                                    result.Add(attributes);

                                    return result;
                                    }

                                    static HTMLparser OpenParser() {
                                    HTMLparser oP = new HTMLparser();

                                    // The code+comments in this function are from the Majestic-12 sample documentation.

                                    // ...

                                    // This is optional, but if you want high performance then you may
                                    // want to set chunk hash mode to FALSE. This would result in tag params
                                    // being added to string arrays in HTMLchunk object called sParams and sValues, with number
                                    // of actual params being in iParams. See code below for details.
                                    //
                                    // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
                                    oP.SetChunkHashMode(false);

                                    // if you set this to true then original parsed HTML for given chunk will be kept -
                                    // this will reduce performance somewhat, but may be desireable in some cases where
                                    // reconstruction of HTML may be necessary
                                    oP.bKeepRawHTML = false;

                                    // if set to true (it is false by default), then entities will be decoded: this is essential
                                    // if you want to get strings that contain final representation of the data in HTML, however
                                    // you should be aware that if you want to use such strings into output HTML string then you will
                                    // need to do Entity encoding or same string may fail later
                                    oP.bDecodeEntities = true;

                                    // we have option to keep most entities as is - only replace stuff like &nbsp;
                                    // this is called Mini Entities mode - it is handy when HTML will need
                                    // to be re-created after it was parsed, though in this case really
                                    // entities should not be parsed at all
                                    oP.bDecodeMiniEntities = true;

                                    if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
                                    oP.InitMiniEntities();

                                    // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
                                    // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
                                    // this only works if auto extraction is enabled
                                    oP.bAutoExtractBetweenTagsOnly = true;

                                    // if true then comments will be extracted automatically
                                    oP.bAutoKeepComments = true;

                                    // if true then scripts will be extracted automatically:
                                    oP.bAutoKeepScripts = true;

                                    // if this option is true then whitespace before start of tag will be compressed to single
                                    // space character in string: " ", if false then full whitespace before tag will be returned (slower)
                                    // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
                                    // a waste of CPU cycles
                                    oP.bCompressWhiteSpaceBeforeTag = true;

                                    // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
                                    // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
                                    // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
                                    // or open
                                    oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

                                    return oP;
                                    }
                                    }
                                    }





                                    share











                                    share


                                    share










                                    answered Mar 8 '09 at 22:11









                                    Frank Schwieterman

                                    19.5k1277115




                                    19.5k1277115








                                    • 1




                                      btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                      – Frank Schwieterman
                                      Mar 8 '09 at 22:21










                                    • What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                      – user29439
                                      Aug 3 '11 at 22:42










                                    • I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                      – Frank Schwieterman
                                      Aug 4 '11 at 0:40














                                    • 1




                                      btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                      – Frank Schwieterman
                                      Mar 8 '09 at 22:21










                                    • What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                      – user29439
                                      Aug 3 '11 at 22:42










                                    • I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                      – Frank Schwieterman
                                      Aug 4 '11 at 0:40








                                    1




                                    1




                                    btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                    – Frank Schwieterman
                                    Mar 8 '09 at 22:21




                                    btw HtmlAgilityPack has worked well for me in the past, I just prefer LINQ.
                                    – Frank Schwieterman
                                    Mar 8 '09 at 22:21












                                    What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                    – user29439
                                    Aug 3 '11 at 22:42




                                    What's the performance like when you add the LINQ conversion? Any idea how it compares with HtmlAgilityPack?
                                    – user29439
                                    Aug 3 '11 at 22:42












                                    I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                    – Frank Schwieterman
                                    Aug 4 '11 at 0:40




                                    I never did a performance comparison. These days I use HtmlAgilityPack, much less hassle. Unfortunately the code above has lots of special cases I didn't bother to write tests for, so I can't really maintain it.
                                    – Frank Schwieterman
                                    Aug 4 '11 at 0:40










                                    up vote
                                    7
                                    down vote













                                    The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.





                                    share

























                                      up vote
                                      7
                                      down vote













                                      The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.





                                      share























                                        up vote
                                        7
                                        down vote










                                        up vote
                                        7
                                        down vote









                                        The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.





                                        share












                                        The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.






                                        share











                                        share


                                        share










                                        answered Sep 19 '08 at 8:11









                                        Grimtron

                                        5,11131928




                                        5,11131928






















                                            up vote
                                            3
                                            down vote













                                            I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:



                                            SgmlReader





                                            share

























                                              up vote
                                              3
                                              down vote













                                              I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:



                                              SgmlReader





                                              share























                                                up vote
                                                3
                                                down vote










                                                up vote
                                                3
                                                down vote









                                                I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:



                                                SgmlReader





                                                share












                                                I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:



                                                SgmlReader






                                                share











                                                share


                                                share










                                                answered Sep 11 '08 at 11:12









                                                Frank Krueger

                                                44.1k39143198




                                                44.1k39143198






















                                                    up vote
                                                    2
                                                    down vote













                                                    No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net



                                                    using System;
                                                    using System.Collections.Generic;
                                                    using System.Text;
                                                    using System.Windows.Forms;
                                                    using System.Threading;

                                                    class ParseHTML
                                                    {
                                                    public ParseHTML() { }
                                                    private string ReturnString;

                                                    public string doParsing(string html)
                                                    {
                                                    Thread t = new Thread(TParseMain);
                                                    t.ApartmentState = ApartmentState.STA;
                                                    t.Start((object)html);
                                                    t.Join();
                                                    return ReturnString;
                                                    }

                                                    private void TParseMain(object html)
                                                    {
                                                    WebBrowser wbc = new WebBrowser();
                                                    wbc.DocumentText = "feces of a dummy"; //;magic words
                                                    HtmlDocument doc = wbc.Document.OpenNew(true);
                                                    doc.Write((string)html);
                                                    this.ReturnString = doc.Body.InnerHtml + " do here something";
                                                    return;
                                                    }
                                                    }


                                                    usage:



                                                    string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
                                                    Console.WriteLine("before:" + myhtml);
                                                    myhtml = (new ParseHTML()).doParsing(myhtml);
                                                    Console.WriteLine("after:" + myhtml);




                                                    share



























                                                      up vote
                                                      2
                                                      down vote













                                                      No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net



                                                      using System;
                                                      using System.Collections.Generic;
                                                      using System.Text;
                                                      using System.Windows.Forms;
                                                      using System.Threading;

                                                      class ParseHTML
                                                      {
                                                      public ParseHTML() { }
                                                      private string ReturnString;

                                                      public string doParsing(string html)
                                                      {
                                                      Thread t = new Thread(TParseMain);
                                                      t.ApartmentState = ApartmentState.STA;
                                                      t.Start((object)html);
                                                      t.Join();
                                                      return ReturnString;
                                                      }

                                                      private void TParseMain(object html)
                                                      {
                                                      WebBrowser wbc = new WebBrowser();
                                                      wbc.DocumentText = "feces of a dummy"; //;magic words
                                                      HtmlDocument doc = wbc.Document.OpenNew(true);
                                                      doc.Write((string)html);
                                                      this.ReturnString = doc.Body.InnerHtml + " do here something";
                                                      return;
                                                      }
                                                      }


                                                      usage:



                                                      string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
                                                      Console.WriteLine("before:" + myhtml);
                                                      myhtml = (new ParseHTML()).doParsing(myhtml);
                                                      Console.WriteLine("after:" + myhtml);




                                                      share

























                                                        up vote
                                                        2
                                                        down vote










                                                        up vote
                                                        2
                                                        down vote









                                                        No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net



                                                        using System;
                                                        using System.Collections.Generic;
                                                        using System.Text;
                                                        using System.Windows.Forms;
                                                        using System.Threading;

                                                        class ParseHTML
                                                        {
                                                        public ParseHTML() { }
                                                        private string ReturnString;

                                                        public string doParsing(string html)
                                                        {
                                                        Thread t = new Thread(TParseMain);
                                                        t.ApartmentState = ApartmentState.STA;
                                                        t.Start((object)html);
                                                        t.Join();
                                                        return ReturnString;
                                                        }

                                                        private void TParseMain(object html)
                                                        {
                                                        WebBrowser wbc = new WebBrowser();
                                                        wbc.DocumentText = "feces of a dummy"; //;magic words
                                                        HtmlDocument doc = wbc.Document.OpenNew(true);
                                                        doc.Write((string)html);
                                                        this.ReturnString = doc.Body.InnerHtml + " do here something";
                                                        return;
                                                        }
                                                        }


                                                        usage:



                                                        string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
                                                        Console.WriteLine("before:" + myhtml);
                                                        myhtml = (new ParseHTML()).doParsing(myhtml);
                                                        Console.WriteLine("after:" + myhtml);




                                                        share














                                                        No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net



                                                        using System;
                                                        using System.Collections.Generic;
                                                        using System.Text;
                                                        using System.Windows.Forms;
                                                        using System.Threading;

                                                        class ParseHTML
                                                        {
                                                        public ParseHTML() { }
                                                        private string ReturnString;

                                                        public string doParsing(string html)
                                                        {
                                                        Thread t = new Thread(TParseMain);
                                                        t.ApartmentState = ApartmentState.STA;
                                                        t.Start((object)html);
                                                        t.Join();
                                                        return ReturnString;
                                                        }

                                                        private void TParseMain(object html)
                                                        {
                                                        WebBrowser wbc = new WebBrowser();
                                                        wbc.DocumentText = "feces of a dummy"; //;magic words
                                                        HtmlDocument doc = wbc.Document.OpenNew(true);
                                                        doc.Write((string)html);
                                                        this.ReturnString = doc.Body.InnerHtml + " do here something";
                                                        return;
                                                        }
                                                        }


                                                        usage:



                                                        string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
                                                        Console.WriteLine("before:" + myhtml);
                                                        myhtml = (new ParseHTML()).doParsing(myhtml);
                                                        Console.WriteLine("after:" + myhtml);





                                                        share













                                                        share


                                                        share








                                                        edited Jun 6 '11 at 14:46

























                                                        answered Jun 5 '11 at 16:26









                                                        majmun

                                                        212




                                                        212






















                                                            up vote
                                                            1
                                                            down vote













                                                            The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.





                                                            share

















                                                            • 1




                                                              Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                              – pupeno
                                                              Dec 8 '09 at 12:56










                                                            • It should be, but people don't do it.
                                                              – DMan
                                                              Feb 16 '10 at 3:54










                                                            • @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                              – user166390
                                                              Dec 22 '10 at 4:13

















                                                            up vote
                                                            1
                                                            down vote













                                                            The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.





                                                            share

















                                                            • 1




                                                              Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                              – pupeno
                                                              Dec 8 '09 at 12:56










                                                            • It should be, but people don't do it.
                                                              – DMan
                                                              Feb 16 '10 at 3:54










                                                            • @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                              – user166390
                                                              Dec 22 '10 at 4:13















                                                            up vote
                                                            1
                                                            down vote










                                                            up vote
                                                            1
                                                            down vote









                                                            The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.





                                                            share












                                                            The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.






                                                            share











                                                            share


                                                            share










                                                            answered Sep 11 '08 at 9:47









                                                            Mark Ingram

                                                            42.5k44149211




                                                            42.5k44149211








                                                            • 1




                                                              Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                              – pupeno
                                                              Dec 8 '09 at 12:56










                                                            • It should be, but people don't do it.
                                                              – DMan
                                                              Feb 16 '10 at 3:54










                                                            • @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                              – user166390
                                                              Dec 22 '10 at 4:13
















                                                            • 1




                                                              Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                              – pupeno
                                                              Dec 8 '09 at 12:56










                                                            • It should be, but people don't do it.
                                                              – DMan
                                                              Feb 16 '10 at 3:54










                                                            • @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                              – user166390
                                                              Dec 22 '10 at 4:13










                                                            1




                                                            1




                                                            Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                            – pupeno
                                                            Dec 8 '09 at 12:56




                                                            Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
                                                            – pupeno
                                                            Dec 8 '09 at 12:56












                                                            It should be, but people don't do it.
                                                            – DMan
                                                            Feb 16 '10 at 3:54




                                                            It should be, but people don't do it.
                                                            – DMan
                                                            Feb 16 '10 at 3:54












                                                            @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                            – user166390
                                                            Dec 22 '10 at 4:13






                                                            @J. Pablo Not nearly as easy though (and hence the reason for a library :p)... for instance, <p> tags do not need to be explicitly closed under HTML4/5. Yikes!
                                                            – user166390
                                                            Dec 22 '10 at 4:13












                                                            up vote
                                                            1
                                                            down vote













                                                            I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.





                                                            share

























                                                              up vote
                                                              1
                                                              down vote













                                                              I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.





                                                              share























                                                                up vote
                                                                1
                                                                down vote










                                                                up vote
                                                                1
                                                                down vote









                                                                I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.





                                                                share












                                                                I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.






                                                                share











                                                                share


                                                                share










                                                                answered Sep 19 '08 at 8:03









                                                                Rahul

                                                                11.1k53963




                                                                11.1k53963






















                                                                    up vote
                                                                    0
                                                                    down vote













                                                                    You could use a HTML DTD, and the generic XML parsing libraries.





                                                                    share





















                                                                    • Can you clarify this?
                                                                      – Luke
                                                                      Sep 11 '08 at 9:44






                                                                    • 8




                                                                      Very few real-world HTML pages will survive an XML parsing library.
                                                                      – Frank Krueger
                                                                      Sep 11 '08 at 11:07















                                                                    up vote
                                                                    0
                                                                    down vote













                                                                    You could use a HTML DTD, and the generic XML parsing libraries.





                                                                    share





















                                                                    • Can you clarify this?
                                                                      – Luke
                                                                      Sep 11 '08 at 9:44






                                                                    • 8




                                                                      Very few real-world HTML pages will survive an XML parsing library.
                                                                      – Frank Krueger
                                                                      Sep 11 '08 at 11:07













                                                                    up vote
                                                                    0
                                                                    down vote










                                                                    up vote
                                                                    0
                                                                    down vote









                                                                    You could use a HTML DTD, and the generic XML parsing libraries.





                                                                    share












                                                                    You could use a HTML DTD, and the generic XML parsing libraries.






                                                                    share











                                                                    share


                                                                    share










                                                                    answered Sep 11 '08 at 9:39









                                                                    Corin Blaikie

                                                                    11.1k92937




                                                                    11.1k92937












                                                                    • Can you clarify this?
                                                                      – Luke
                                                                      Sep 11 '08 at 9:44






                                                                    • 8




                                                                      Very few real-world HTML pages will survive an XML parsing library.
                                                                      – Frank Krueger
                                                                      Sep 11 '08 at 11:07


















                                                                    • Can you clarify this?
                                                                      – Luke
                                                                      Sep 11 '08 at 9:44






                                                                    • 8




                                                                      Very few real-world HTML pages will survive an XML parsing library.
                                                                      – Frank Krueger
                                                                      Sep 11 '08 at 11:07
















                                                                    Can you clarify this?
                                                                    – Luke
                                                                    Sep 11 '08 at 9:44




                                                                    Can you clarify this?
                                                                    – Luke
                                                                    Sep 11 '08 at 9:44




                                                                    8




                                                                    8




                                                                    Very few real-world HTML pages will survive an XML parsing library.
                                                                    – Frank Krueger
                                                                    Sep 11 '08 at 11:07




                                                                    Very few real-world HTML pages will survive an XML parsing library.
                                                                    – Frank Krueger
                                                                    Sep 11 '08 at 11:07










                                                                    up vote
                                                                    0
                                                                    down vote













                                                                    Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]





                                                                    share

























                                                                      up vote
                                                                      0
                                                                      down vote













                                                                      Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]





                                                                      share























                                                                        up vote
                                                                        0
                                                                        down vote










                                                                        up vote
                                                                        0
                                                                        down vote









                                                                        Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]





                                                                        share












                                                                        Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]






                                                                        share











                                                                        share


                                                                        share










                                                                        answered Nov 12 '09 at 14:53









                                                                        Ruben Bartelink

                                                                        42.9k17139198




                                                                        42.9k17139198






















                                                                            up vote
                                                                            0
                                                                            down vote













                                                                            Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.





                                                                            share

























                                                                              up vote
                                                                              0
                                                                              down vote













                                                                              Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.





                                                                              share























                                                                                up vote
                                                                                0
                                                                                down vote










                                                                                up vote
                                                                                0
                                                                                down vote









                                                                                Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.





                                                                                share












                                                                                Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.






                                                                                share











                                                                                share


                                                                                share










                                                                                answered Jan 3 '10 at 9:04









                                                                                Mikos

                                                                                7,17553367




                                                                                7,17553367






















                                                                                    up vote
                                                                                    0
                                                                                    down vote













                                                                                    Try this script.



                                                                                    http://www.biterscripting.com/SS_URLs.html



                                                                                    When I use it with this url,



                                                                                    script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")


                                                                                    It shows me all the links on the page for this thread.



                                                                                    http://sstatic.net/so/all.css
                                                                                    http://sstatic.net/so/favicon.ico
                                                                                    http://sstatic.net/so/apple-touch-icon.png
                                                                                    .
                                                                                    .
                                                                                    .


                                                                                    You can modify that script to check for images, variables, whatever.





                                                                                    share

























                                                                                      up vote
                                                                                      0
                                                                                      down vote













                                                                                      Try this script.



                                                                                      http://www.biterscripting.com/SS_URLs.html



                                                                                      When I use it with this url,



                                                                                      script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")


                                                                                      It shows me all the links on the page for this thread.



                                                                                      http://sstatic.net/so/all.css
                                                                                      http://sstatic.net/so/favicon.ico
                                                                                      http://sstatic.net/so/apple-touch-icon.png
                                                                                      .
                                                                                      .
                                                                                      .


                                                                                      You can modify that script to check for images, variables, whatever.





                                                                                      share























                                                                                        up vote
                                                                                        0
                                                                                        down vote










                                                                                        up vote
                                                                                        0
                                                                                        down vote









                                                                                        Try this script.



                                                                                        http://www.biterscripting.com/SS_URLs.html



                                                                                        When I use it with this url,



                                                                                        script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")


                                                                                        It shows me all the links on the page for this thread.



                                                                                        http://sstatic.net/so/all.css
                                                                                        http://sstatic.net/so/favicon.ico
                                                                                        http://sstatic.net/so/apple-touch-icon.png
                                                                                        .
                                                                                        .
                                                                                        .


                                                                                        You can modify that script to check for images, variables, whatever.





                                                                                        share












                                                                                        Try this script.



                                                                                        http://www.biterscripting.com/SS_URLs.html



                                                                                        When I use it with this url,



                                                                                        script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")


                                                                                        It shows me all the links on the page for this thread.



                                                                                        http://sstatic.net/so/all.css
                                                                                        http://sstatic.net/so/favicon.ico
                                                                                        http://sstatic.net/so/apple-touch-icon.png
                                                                                        .
                                                                                        .
                                                                                        .


                                                                                        You can modify that script to check for images, variables, whatever.






                                                                                        share











                                                                                        share


                                                                                        share










                                                                                        answered Mar 22 '10 at 20:29









                                                                                        P M

                                                                                        1




                                                                                        1






















                                                                                            up vote
                                                                                            0
                                                                                            down vote













                                                                                            I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.



                                                                                            You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.



                                                                                            There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.





                                                                                            share



























                                                                                              up vote
                                                                                              0
                                                                                              down vote













                                                                                              I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.



                                                                                              You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.



                                                                                              There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.





                                                                                              share

























                                                                                                up vote
                                                                                                0
                                                                                                down vote










                                                                                                up vote
                                                                                                0
                                                                                                down vote









                                                                                                I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.



                                                                                                You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.



                                                                                                There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.





                                                                                                share














                                                                                                I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.



                                                                                                You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.



                                                                                                There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.






                                                                                                share













                                                                                                share


                                                                                                share








                                                                                                edited Dec 23 '10 at 18:19

























                                                                                                answered Dec 19 '10 at 18:13









                                                                                                Jonathan Wood

                                                                                                42.7k54189305




                                                                                                42.7k54189305















                                                                                                    Popular posts from this blog

                                                                                                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                                                                                                    Calculate evaluation metrics using cross_val_predict sklearn

                                                                                                    Insert data from modal to MySQL (multiple modal on website)