Hi,

I would like to write a little program which is able to find some informations of a website.
It could find something like metadefinitions, headlines oder links.

So I started to find a-tags and its referencing to internally or external sites.
Used HTML code was saved in a tempfile on my disk before.

At first it works really fine. Later I tried to test it with some other sites ...
sometimes there are results and sometimes nothing.
So I compared the sources line by line and I have found the problem.

If the website needs external sources like javascript or pictures my code doesnt work.
For example:
<link href='http://fonts.googleapis.com/css?family=Questrial' rel='stylesheet' type='text/css'>
<img src="http://www.website.de/images/slider/apparatur.jpg" title="Zahntechnisches Labor" alt="Zahntechnisches Labor">

So I delete all these external ressources from the sourcecode in my tempfile and then it works.

Therefore my question is, how I could surpress such external accesses. For my small console application
graphical representations doesnt matter.

QT 5.2 code:

Qt Code:
  1. void Crawler::crawl_Page()
  2. {
  3.  
  4. QWebPage frame;
  5.  
  6. QFile* file = new QFile("D:/temp.html");
  7.  
  8. if(file->open(QIODevice::ReadOnly | QIODevice::Text))
  9. {
  10. qDebug() << "Open tempfile ";
  11.  
  12. QString htmlContent = file->readAll();
  13. qDebug() << "Count html :: " << htmlContent.count();
  14.  
  15. frame.mainFrame()->setHtml(htmlContent);
  16. qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize();
  17.  
  18. QWebElement doc = frame.mainFrame()->documentElement();
  19.  
  20. QWebElementCollection linkCollection = doc.findAll("a");
  21. qDebug() << "Found " << linkCollection.count() << " links";
  22.  
  23. foreach (QWebElement link, linkCollection) {
  24. qDebug() << "found link " << link.attribute("href");
  25. }
  26. }
  27. }
To copy to clipboard, switch view to plain text mode 

Results if it works:
result.jpg

There are no errormessage if it doesnt work (sites with external ressources),
only qDebug ... "Found 0 links" ...

Thx