EDIT: Problem has been solved thanks to QXmlStreamReader.
EDIT2: QXmlStreamReader doesn't work for all the documents so the problem is still open.
EDIT3: QXmlStreamReader/pugixml don't work because the HTML documents aren't perfect HTML, they're missing some closing tags.
I have this function that outputs the text elements from an HTML document with QWebView:
void MainWindow::traverseElements(const QWebElement& parentElement)
{
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QString text
= element.
toPlainText();
qDebug() << element.tagName() << element.attribute("class") << text.replace('\n', "\\n");
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence) || sentence.isEmpty()) {
continue;
}
ui->listWidget->addItem(sentence);
}
traverseElements(element);
element = element.nextSibling();
}
}
void MainWindow::traverseElements(const QWebElement& parentElement)
{
QRegExp ws("^\\s+$");
QWebElement element = parentElement.firstChild();
while (!element.isNull()) {
QString text = element.toPlainText();
qDebug() << element.tagName() << element.attribute("class") << text.replace('\n', "\\n");
QStringList sentences = text.split('\n');
for(const QString& sentence : sentences) {
if(ws.exactMatch(sentence) || sentence.isEmpty()) {
continue;
}
ui->listWidget->addItem(sentence);
}
traverseElements(element);
element = element.nextSibling();
}
}
To copy to clipboard, switch view to plain text mode
I call it like this (view is a QWebView):
QWebFrame *frame = view->page()->mainFrame();
QWebElement element = frame->documentElement();
traverseElements(element);
QWebFrame *frame = view->page()->mainFrame();
QWebElement element = frame->documentElement();
traverseElements(element);
To copy to clipboard, switch view to plain text mode
The problem is, given this HTML document:
<div class="one"><div class="two">foo<div class="three">hello</div>bar</div></div>
It outputs:
"HEAD" "" ""
"BODY" "" "foo\nhello\nbar"
"DIV" "one" "foo\nhello\nbar"
"DIV" "two" "foo\nhello\nbar"
"DIV" "three" "hello"
So in the code when I do ui->listWidget->addItem(sentence); it adds "foo\nhello\nbar" 3 times, and "hello" once.
What I want the code to do is addItem so that the ui->listWidget contains three elements: "foo", "hello", and "bar", in that order like they appear on a web page.
Note that I can't just take the text output from the BODY element, because I need to store EVERY element, their original text value, and a translation so that I can save the exact same HTML document but with the translations.
Thanks.
Bookmarks