HTML mit hilfe von libxml2 parsen

Mr Bean

Hallo

Ich möchte gerne eine Website mit Hilfe der libxml/libcurl parsen. Das Empfangen der Website funktioniert auch schon ganz gut. Ich kann auch schon einen HTML Tree mit der libxml2 erzeugen und die einzelnen Tags und ihre Attribute ausgeben. Aber ich komme irgendwie nicht an den Content zwischen den Tags ran. Könnt ihr mir hier weiter helfen?
Hier mal mein bisheriger Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/types.h>
#include <curl/curl.h>
#include <curl/easy.h>

#include <libxml/HTMLparser.h>

#include "dataMiner.h"

// Define our struct for accepting LCs output
struct BufferStruct
{
	char * buffer;
	unsigned int size;
};

int main()
{
	curl_global_init( CURL_GLOBAL_ALL );
	CURL * myHandle;
	CURLcode result;							// We’ll store the result of CURL’s webpage retrieval, for simple error checking.
	struct BufferStruct output;

	htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
	//htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);

	output.buffer = NULL;
	output.size = 0;

	myHandle = curl_easy_init ( ) ;				// Notice the lack of major error checking, for brevity

	curl_easy_setopt(myHandle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);		// Passing the function pointer to LC
	curl_easy_setopt(myHandle, CURLOPT_WRITEDATA, (void *)&output); 			// Passing our BufferStruct to LC
	curl_easy_setopt(myHandle, CURLOPT_URL, URL_DEF);
	result = curl_easy_perform( myHandle );
	if(result!=0)
	{
		printf("curl_easy_perform error result: %d\n", result);
	}
	curl_easy_cleanup( myHandle );
	//printf("empfangene Seite: %s", output.buffer);
	printf("empfangene Bytes: %d", output.size);
	htmlParseChunk(parser, output.buffer, output.size, 0);
	htmlParseChunk(parser, NULL, NULL, 1);

	walkTree(xmlDocGetRootElement(parser -> myDoc));	

	return 0;
}

// This is the function we pass to LC, which writes the output to a BufferStruct
static unsigned int WriteMemoryCallback (void *ptr, unsigned int size, unsigned int nmemb, void *data)
{
	unsigned int realsize = size * nmemb;

	struct BufferStruct * mem = (struct BufferStruct *) data;

	mem->buffer = realloc(mem->buffer, mem->size + realsize + 1);

	if ( mem->buffer )
	{
		memcpy( &( mem->buffer[ mem->size ] ), ptr, realsize );
		mem->size += realsize;
		mem->buffer[ mem->size ] = 0;
	}
	return realsize;
}

void walkTree(xmlNode * a_node)
{
	xmlNode *cur_node = NULL;
	xmlAttr *cur_attr = NULL;

	for (cur_node = a_node; cur_node; cur_node = cur_node->next) // do something with that node information, like... printing the tag's name and attributes
	{	
		printf("Got tag : %s\n", cur_node->name);
		for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next) 
		{
			printf("  -> with attribute : %s\n", cur_attr->name);
			printf("  -> and value: %s\n", a_node->textContent);
		}
		walkTree(cur_node->children);
	}
}

In der dataMiner.h steht eigentlich nur die Definition URL_DEF.

Wie komme ich hier weiter?

Danke schonmal!

Gruß
Bean

?
warum willst du mit libxml html parsen?

xml != http

oder hab ich hier irgendwas nicht mitbekommen?

libxxxml schrieb:

?
warum willst du mit libxml html parsen?

xml != http

oder hab ich hier irgendwas nicht mitbekommen?

ich meinte natürlich xml != html

Mr Bean

Stimmt schon, eigentlich ist die lib ja für XML. HTML und XML ist ja aber ähnlich. Es gibt von der libxml2 direkt einen HTML parser.

http://www.xmlsoft.org/html/libxml-HTMLparser.html

Hast Du mir eine bessere Idee wie ich an den Content einer HTML Seite ran komme? Bin nicht auf die libxml festgelegt. Das hab ich halt bir jetzt gefunden.

Gruß

Bean

Wutz

Nodes haben in dem Sinn keinen Inhalt, sie haben children und die haben wiederum tag+text.
Du kannst an deinem Node abwärts über den content der children iterieren mit xmlNodeGetContent, der gibt dir dann einen char* zurück, der dich ja dann wohl interessiert, also in etwa:

void walkTree(xmlNode * a_node)
{
    xmlNode *cur_node;

    for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
        if (cur_node->type == XML_ELEMENT_NODE && !stricmp(cur_node->name,"body")) {
            puts(xmlNodeGetContent(cur_node)); /* hier muss wohl auch noch 1x free hin */
            return;
        }

        walkTree(cur_node->children);
    }

}

Mr Bean

Hallo Wutz!

Danke für die Hilfe! So bin ich weiter gekommen.
Welchen Speicher muss ich mit free noch leeren?

Gruß

Bean