Results 1 to 8 of 8

Thread: Webkit: extract information from HTML

  1. #1
    Join Date
    Jan 2007
    Location
    Paris
    Posts
    459
    Thanks
    98
    Thanked 3 Times in 3 Posts
    Qt products
    Qt4 Qt5

    Default Webkit: extract information from HTML

    Greetings QtCentre,

    I would like to do the following:

    - Get the HTML from a webpage without loading all the medias and stuff.
    - Extract a given Div from its id.

    Is there a way to do that ?

    Thanks a lot !

    B.A.

  2. #2
    Join Date
    Jul 2009
    Posts
    74
    Thanks
    2
    Thanked 6 Times in 6 Posts

    Default Re: Webkit: extract information from HTML

    You are looking for a html parser... (I don't know any easy and reliable html parser for c++)
    anyway you can use qwebpage for that... (it's easy and reliable... but thats is not his purposal -> not efficient/not fast for that)

    - Get the HTML from a webpage without loading all the medias and stuff.

    QWebSettings * settings = QWebSettings::globalSettings();
    settings->setAttribute(QWebSettings::AutoLoadImages, false);
    settings->setAttribute(QWebSettings::JavascriptEnabled, false);
    settings->setAttribute(QWebSettings::JavaEnabled, false);
    settings->setAttribute(QWebSettings::PluginsEnabled, false);
    settings->setAttribute(QWebSettings::PrivateBrowsingEnabled , true);

    - Extract a given Div from its id.

    QWebFrame * frame;
    QWebElementCollection elems;
    QWebElement elem;

    frame = page.mainFrame();
    elem = frame->findFirstElement("div.tabContent h1"); // css selector !! extremely powerful.
    OR
    elems = frame->findAllElements("table#myId tbody tr");

    and then -> elem.toPlainText()
    Last edited by javimoya; 29th December 2010 at 15:52.

  3. #3
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Webkit: extract information from HTML

    Quote Originally Posted by bunjee View Post
    - Get the HTML from a webpage without loading all the medias and stuff.
    - Extract a given Div from its id.

    Is there a way to do that ?
    Use QNetworkAccessManager instead of webkit. For parsing you can use QXmlQuery if the page is a valid xml. If not then... well... probably QWebElement wouldn't work with it anyway. You can always use QRegExp, if you're only interested in a single tag, that's probably the best choice.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  4. #4
    Join Date
    Jul 2009
    Posts
    74
    Thanks
    2
    Thanked 6 Times in 6 Posts

    Default Re: Webkit: extract information from HTML

    Quote Originally Posted by wysota View Post
    ... QWebElement wouldn't work with it anyway ...
    I disagree !
    I had used many times... and it works. it's reliable in every html I've tested.
    if aqwebview can render it... qwebelement can parser it.

  5. #5
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Webkit: extract information from HTML

    Quote Originally Posted by javimoya View Post
    I disagree !
    I had used many times... and it works. it's reliable in every html I've tested.
    if aqwebview can render it... qwebelement can parser it.
    Maybe. Nevertheless the resulting tree might be different from what you would expect
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  6. #6
    Join Date
    Sep 2009
    Location
    UK
    Posts
    2,447
    Thanks
    6
    Thanked 348 Times in 333 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: Webkit: extract information from HTML

    I don't know why you want to do this, but if its a regular thing where you want to extract some information from a webpage at regular intervals, then a better choice might be Python coupled with something like Beautiful Soup. Its pretty easy to use even if you don't know Python (it took me about 15 minutes or so to parse a webpage in the exact way that I wanted and I've never use python before). You can tell it (copied from the website) "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text.". I find it perfect for screen scraping. It also doesn't choke on invalid XML.

  7. #7
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Webkit: extract information from HTML

    I guess using QtScript (or some else javascript engine) with jQuery is also an option if we talk about alternative approaches. Provided QtScript understands DOM. If not, then you have to wrap it all in QWebPage.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  8. #8
    Join Date
    Jan 2007
    Location
    Paris
    Posts
    459
    Thanks
    98
    Thanked 3 Times in 3 Posts
    Qt products
    Qt4 Qt5

    Default Re: Webkit: extract information from HTML

    Wow so many replies,

    Looks like I started a debate here .

    What I want to do is quite simple indeed. I want to extract a "metascore" from this website: http://www.metacritic.com/

    I suspect QNetworkAccessManager to be my best bet since the parsing required is rather simple.

Similar Threads

  1. Replies: 1
    Last Post: 29th April 2011, 23:50
  2. Replies: 4
    Last Post: 23rd September 2010, 15:20
  3. get html tag by clicking (webkit)
    By billconan in forum Qt Programming
    Replies: 0
    Last Post: 23rd June 2009, 22:07
  4. Qt 4.4 WebKit: WYSIWYG HTML editor?
    By 24pm in forum Qt Programming
    Replies: 6
    Last Post: 13th March 2009, 10:08
  5. QWebView Extract Information
    By tpf80 in forum Qt Programming
    Replies: 2
    Last Post: 23rd October 2008, 01:00

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.