Results 1 to 3 of 3

Thread: Extracting text from HTML documents

  1. #1
    Join Date
    Sep 2011
    Posts
    20
    Qt products
    Qt5
    Platforms
    Windows

    Default Extracting text from HTML documents

    EDIT: Problem has been solved thanks to QXmlStreamReader.
    EDIT2: QXmlStreamReader doesn't work for all the documents so the problem is still open.
    EDIT3: QXmlStreamReader/pugixml don't work because the HTML documents aren't perfect HTML, they're missing some closing tags.

    I have this function that outputs the text elements from an HTML document with QWebView:

    Qt Code:
    1. void MainWindow::traverseElements(const QWebElement& parentElement)
    2. {
    3. QRegExp ws("^\\s+$");
    4. QWebElement element = parentElement.firstChild();
    5. while (!element.isNull()) {
    6. QString text = element.toPlainText();
    7. qDebug() << element.tagName() << element.attribute("class") << text.replace('\n', "\\n");
    8. QStringList sentences = text.split('\n');
    9. for(const QString& sentence : sentences) {
    10. if(ws.exactMatch(sentence) || sentence.isEmpty()) {
    11. continue;
    12. }
    13.  
    14. ui->listWidget->addItem(sentence);
    15. }
    16.  
    17. traverseElements(element);
    18. element = element.nextSibling();
    19. }
    20. }
    To copy to clipboard, switch view to plain text mode 

    I call it like this (view is a QWebView):

    Qt Code:
    1. QWebFrame *frame = view->page()->mainFrame();
    2. QWebElement element = frame->documentElement();
    3. traverseElements(element);
    To copy to clipboard, switch view to plain text mode 

    The problem is, given this HTML document:

    <div class="one"><div class="two">foo<div class="three">hello</div>bar</div></div>

    It outputs:

    "HEAD" "" ""
    "BODY" "" "foo\nhello\nbar"
    "DIV" "one" "foo\nhello\nbar"
    "DIV" "two" "foo\nhello\nbar"
    "DIV" "three" "hello"

    So in the code when I do ui->listWidget->addItem(sentence); it adds "foo\nhello\nbar" 3 times, and "hello" once.

    What I want the code to do is addItem so that the ui->listWidget contains three elements: "foo", "hello", and "bar", in that order like they appear on a web page.

    Note that I can't just take the text output from the BODY element, because I need to store EVERY element, their original text value, and a translation so that I can save the exact same HTML document but with the translations.

    Thanks.
    Last edited by themagician; 25th June 2015 at 02:43. Reason: Was solved, then it wasn't

  2. #2
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Extracting text from HTML documents

    What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

    Cheers,
    _

  3. #3
    Join Date
    Sep 2011
    Posts
    20
    Qt products
    Qt5
    Platforms
    Windows

    Default Re: Extracting text from HTML documents

    Quote Originally Posted by anda_skoa View Post
    What if you extract the text from a copy of the element, a copy on which you call removeAllChildren() before extraction?

    Cheers,
    _
    So I tried the following code:

    Qt Code:
    1. void MainWindow::traverseElements(const QWebElement& parentElement)
    2. {
    3. QRegExp ws("^\\s*$");
    4. QWebElement element = parentElement.firstChild();
    5. while (!element.isNull()) {
    6. QWebElement copy = element.clone();
    7. copy.removeAllChildren();
    8. QString text = copy.toPlainText();
    9. qDebug() << copy.tagName() << copy.attribute("class") << text.replace('\n', "\\n");
    10. QStringList sentences = text.split('\n');
    11. for(const QString& sentence : sentences) {
    12. if(ws.exactMatch(sentence)) {
    13. continue;
    14. }
    15.  
    16. ui->listWidget->addItem(sentence);
    17. }
    18.  
    19. traverseElements(element);
    20. element = element.nextSibling();
    21. }
    22. }
    To copy to clipboard, switch view to plain text mode 

    And got this output:

    Qt Code:
    1. "HEAD" "" ""
    2. "BODY" "" ""
    3. "DIV" "one" ""
    4. "DIV" "two" ""
    5. "DIV" "three" ""
    To copy to clipboard, switch view to plain text mode 

    It seems to remove the text elements.

    So far I've tried QWebView and QDomDocument, but neither gives the correct output. I've also tried QXmlStreamReader, pugixml, and rapidxml, but they all fail because the HTML documents have some tags that aren't closed properly. I'm thinking of writing a simple parser myself now.

Similar Threads

  1. Replies: 0
    Last Post: 29th July 2010, 09:15
  2. QRegExp for extracting the string between two HTML tags...
    By tuthmosis in forum Qt Programming
    Replies: 3
    Last Post: 27th May 2010, 07:55
  3. Need a QT class to handle HTML documents...
    By tuthmosis in forum Qt Programming
    Replies: 7
    Last Post: 27th May 2010, 03:34
  4. Extracting text from QTableWidgetItem
    By bizmopeen in forum Newbie
    Replies: 3
    Last Post: 1st September 2009, 18:28
  5. Extracting text from QDomNodes
    By Matt Smith in forum Qt Programming
    Replies: 3
    Last Post: 25th February 2007, 21:27

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.