-
HTML parsing
Can anyone recommend a class to do HTML parsing.
One of the key differences with XML appears to be HTMLs sloppy endtags. I have tried to subclass the QXmlDefaultHandler but it dies on missing endtags. Even when I continue after intercepting fatal errors normal event reporting is discontinued.
I have thought about an 'insert' function (to insert endtags on the fly elided by HTML) in a subclass of the XmlSimpleReader but that also appears a major job.
Any suggestions how to get a proper DOM document from a HTML source?
Enno
-
Re: HTML parsing
Suggestings for reading:
http://www.qtcentre.org/forum/f-qt-p...html-4698.html
Tidy can be found here:
http://tidy.sourceforge.net/
a c++ wrapper here:
http://users.rcn.com/creitzel/tidy.html#cplusplus
By using tidy you should be able to get the data in a way so that you can use it with QDomDocument.
-
Re: HTML parsing
Yes, parsing real world (broken) HTML is not an easy task. It's true you could try using HTML Tidy but if you're already using Qt I would advise not to do so and to use something already available in Qt. Use QtWebKit and QWebElement which is new in Qt 4.6 and you have your DOM ready in 15 minutes.