Browsers have no problem with test.html:
<html>
<head>
<title>Title</title>
</HEAD>
<BODY>
<P>A para
<p>Another para
<p>None of it is <b>XML</p>
</Body>
<html>
<head>
<title>Title</title>
</HEAD>
<BODY>
<P>A para
<p>Another para
<p>None of it is <b>XML</p>
</Body>
To copy to clipboard, switch view to plain text mode
Using the tidy command line tool,
$ tidy -asxml test.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 8 column 22 - Warning: missing </b> before </p>
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>Title</title>
</head>
<body>
<p>A para</p>
<p>Another para</p>
<p>None of it is <b>XML</b></p>
</body>
</html>
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium
$ tidy -asxml test.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 8 column 22 - Warning: missing </b> before </p>
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>Title</title>
</head>
<body>
<p>A para</p>
<p>Another para</p>
<p>None of it is <b>XML</b></p>
</body>
</html>
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium
To copy to clipboard, switch view to plain text mode
I am sure the equivalent is possible through the tidy library.
Another approach might be to use the Qt WebKit Bridge to execute JavaScript in the browser to extract the elements you are after.
Bookmarks