Hi,
I would like to write a little program which is able to find some informations of a website.
It could find something like metadefinitions, headlines oder links.
So I started to find a-tags and its referencing to internally or external sites.
Used HTML code was saved in a tempfile on my disk before.
At first it works really fine. Later I tried to test it with some other sites ...
sometimes there are results and sometimes nothing.
So I compared the sources line by line and I have found the problem.
If the website needs external sources like javascript or pictures my code doesnt work.
For example:
<link href='http://fonts.googleapis.com/css?family=Questrial' rel='stylesheet' type='text/css'>
<img src="http://www.website.de/images/slider/apparatur.jpg" title="Zahntechnisches Labor" alt="Zahntechnisches Labor">
So I delete all these external ressources from the sourcecode in my tempfile and then it works.
Therefore my question is, how I could surpress such external accesses. For my small console application
graphical representations doesnt matter.
QT 5.2 code:
void Crawler::crawl_Page()
{
QWebPage frame;
{
qDebug() << "Open tempfile ";
QString htmlContent
= file
->readAll
();
qDebug() << "Count html :: " << htmlContent.count();
frame.mainFrame()->setHtml(htmlContent);
qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize();
QWebElement doc = frame.mainFrame()->documentElement();
QWebElementCollection linkCollection = doc.findAll("a");
qDebug() << "Found " << linkCollection.count() << " links";
foreach (QWebElement link, linkCollection) {
qDebug() << "found link " << link.attribute("href");
}
}
}
void Crawler::crawl_Page()
{
QWebPage frame;
QFile* file = new QFile("D:/temp.html");
if(file->open(QIODevice::ReadOnly | QIODevice::Text))
{
qDebug() << "Open tempfile ";
QString htmlContent = file->readAll();
qDebug() << "Count html :: " << htmlContent.count();
frame.mainFrame()->setHtml(htmlContent);
qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize();
QWebElement doc = frame.mainFrame()->documentElement();
QWebElementCollection linkCollection = doc.findAll("a");
qDebug() << "Found " << linkCollection.count() << " links";
foreach (QWebElement link, linkCollection) {
qDebug() << "found link " << link.attribute("href");
}
}
}
To copy to clipboard, switch view to plain text mode
Results if it works:
result.jpg
There are no errormessage if it doesnt work (sites with external ressources),
only qDebug ... "Found 0 links" ...
Thx
Bookmarks