Results 1 to 10 of 10

Thread: retrieve web content from url and parse it

  1. #1
    Join Date
    Jun 2012
    Posts
    38
    Thanks
    10
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default retrieve web content from url and parse it

    I am thinking about writing a QT application that will:


    1. download a html page (store the source in a QString)
    2. parse the QString with the HTML and create a QLIST with urls (links)
    3. download each and every link found in parallel in a seperate threads.


    I looked at QNetworkAccessManager and QWebView for doing this. But think both seem a bit overkill.

    Do i need write the asyncron, i mean can not I not just start the download in a seperate thread, when done send the result back to the main thread?
    -will this cause the main gui thread to lock up?

    what would be the most simplest way of doing this?

  2. #2
    Join Date
    Dec 2009
    Location
    New Orleans, Louisiana
    Posts
    791
    Thanks
    13
    Thanked 153 Times in 150 Posts
    Qt products
    Qt5
    Platforms
    MacOS X

    Default Re: retrieve web content from url and parse it

    QNetworkAccessManager, QNetworkRequest, and QNetworkReply are all very easy to use in my opinion and that's what I'd recommend that you use for this task.

    True that the nature of the Qt networking classes above is asynchronous and you can force a synchronous nature by using a QEventLoop, but that generally prompts the question "Why isn't using signals/slots acceptable?"

    Threading is much more complex if done correctly. Most people that wind up implementing threading in Qt cobble something together that works most of the time, but isn't actually implemented correctly. If you insist on trying to force a synchronous behaviour, then the snippet below should get you started:

    Qt Code:
    1. QNetworkAccessManager nam;
    2. QRequest req(QUrl("http://google.com"));
    3. QNetworkReply *reply = nam.get(req);
    4. connect(reply, &QNetworkReply::finished, &loop, &QEventLoop::quit);
    5. loop.exec();
    6. QByteArray buffer = reply->readAll();
    To copy to clipboard, switch view to plain text mode 

  3. #3
    Join Date
    Mar 2015
    Posts
    24
    Qt products
    Qt5
    Platforms
    Windows

    Default Re: retrieve web content from url and parse it

    You can send a GET request to the web page using QTCPSocket and receive the html code in return.
    Then do your own parsing.
    Just open up a socket and connect to the website (example.com)
    then send
    Qt Code:
    1. GET /dir.html HTTP/1.1\r\nHost: example.com\r\n\r\n
    To copy to clipboard, switch view to plain text mode 

    Then you'll receive the HTTP header back, along with the html code.

  4. #4
    Join Date
    Jun 2012
    Posts
    38
    Thanks
    10
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: retrieve web content from url and parse it

    Thank you all.

    I have decieded to go for the QNetworkAccessManager with Signal/Slots option. Still think it is overkill, but at least I learn by using it :-)

    The QTcpSocket was not feasable for what I wanted to do as not all sites would return a header..


    once again thanks for taking your time.

  5. #5
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: retrieve web content from url and parse it

    QNetworkAccessManager is almost the least you could do to handle connections and their failures. There is no point reinventing that wheel. If you use this approach then you still need to parse whatever marginally compliant HTML is returned (if indeed the returned content is HTML at all). That is by far the least trivial part of this exercise to do properly from first principles. There is a reason that browsers are beasts.

    You can use the very capable QWebPage and QWebFrame::findAllElements() together to do the link extraction work without having to worry about the networking level so much. The example in the QWebPage detailed description shows how fetch the content and where to parse it

    BTW: every HTTP server must return a header before the content

  6. #6
    Join Date
    Jun 2012
    Posts
    38
    Thanks
    10
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: retrieve web content from url and parse it

    Hi again,

    I finally had time to look into this. For the time being I chose to try out the QNetworkAccessManager. It works, but I have a problem, which might not be realted to QT at all, but would be great of someone here could confirm.

    example: using a firefox browser I go to google images and search for anything. If I right click and pick view page source I see lots of urls some looking like these:

    ....imgurl=http://www.test.de/1.jpg&..... and so on.

    if I do open the google images with the same search conditions in QT via QNetworkAccessManager and print the data of the qbytearray the source is different. There are no links with the keywork imgurl.


    I am not an expert, but I believe the google images site is using some java scripts that hides, or is needs to be executed. Or?

    I basically need to trick the web page to believe I am a firefox browser :-) but this apparently is not easy at least not with QNetworkAccessManager. Will QWebView be any better at this?

    I know this is probably to much to ask, but it would really nice for my soul to get someone elses opinion on this.

    thanks.

  7. #7
    Join Date
    Dec 2009
    Location
    New Orleans, Louisiana
    Posts
    791
    Thanks
    13
    Thanked 153 Times in 150 Posts
    Qt products
    Qt5
    Platforms
    MacOS X

    Default Re: retrieve web content from url and parse it

    Have you tried setting the "User-Agent" header to the same value used by your browser? You may need other headers like "Accepts" as well (and possibly others).

    Edit: added hypen to User-Agent and as an example, here's the User-Agent from my firebox brower on the mac:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Firefox/31.0

  8. #8
    Join Date
    Nov 2012
    Posts
    47
    Thanks
    5
    Thanked 1 Time in 1 Post
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows Android

    Default Re: retrieve web content from url and parse it

    Quote Originally Posted by gig-raf View Post
    Hi again,

    I finally had time to look into this. For the time being I chose to try out the QNetworkAccessManager. It works, but I have a problem, which might not be realted to QT at all, but would be great of someone here could confirm.

    example: using a firefox browser I go to google images and search for anything. If I right click and pick view page source I see lots of urls some looking like these:

    ....imgurl=http://www.test.de/1.jpg&..... and so on.

    if I do open the google images with the same search conditions in QT via QNetworkAccessManager and print the data of the qbytearray the source is different. There are no links with the keywork imgurl.


    I am not an expert, but I believe the google images site is using some java scripts that hides, or is needs to be executed. Or?

    I basically need to trick the web page to believe I am a firefox browser :-) but this apparently is not easy at least not with QNetworkAccessManager. Will QWebView be any better at this?

    I know this is probably to much to ask, but it would really nice for my soul to get someone elses opinion on this.

    thanks.
    What source are you getting?
    I'm a newbie. Don't trust me

  9. #9
    Join Date
    Jun 2012
    Posts
    38
    Thanks
    10
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: retrieve web content from url and parse it

    I will try to setting the header, I will let you know. I did a some test with the fancybrowser example (qwebview), there the pages are presentet correctly as javascripts are executed correctly. So another way would be to use that class instead.

    I will let you know what I come up with. I tried to do this in Python some years ago, and I never succeeded, google does what they can to make it hard to scrape their images.

    But I dont want to give up, I know it is possible and I want to learn how to do it. :-)

  10. #10
    Join Date
    Jun 2012
    Posts
    38
    Thanks
    10
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: retrieve web content from url and parse it

    Thank you so much!!!

    after setting the user-agent I got what I wanted!! I am so happy!

    Qt Code:
    1. QNetworkRequest request;
    2. request.setUrl(QUrl(url));
    3. request.setRawHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Firefox/31.0");
    4. reply = qnam.get(request);
    5. ..
    6. ..
    To copy to clipboard, switch view to plain text mode 

Similar Threads

  1. Using SAX to parse XML
    By sudha in forum Newbie
    Replies: 10
    Last Post: 19th January 2012, 23:12
  2. parse xml
    By vinayaka in forum Newbie
    Replies: 1
    Last Post: 3rd October 2011, 16:42
  3. parse xml
    By rmagro in forum Qt Programming
    Replies: 2
    Last Post: 2nd July 2009, 22:27
  4. XML parse
    By arunredi in forum Newbie
    Replies: 2
    Last Post: 27th April 2008, 01:22
  5. Best way to parse a string
    By Lele in forum Qt Programming
    Replies: 3
    Last Post: 20th August 2007, 12:33

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Qt is a trademark of The Qt Company.