I have successfully created my first C Python module!
The purpose of the module is to extract just the text from an HTML document. The problem with existing solutions, is that they are all based on semantically parsing the HTML into some kind of tree structure. This is much more complicated than necessary, and also brittle (i.e. an unclosed tag can cause the parsing to fail — even though this has no impact on extracting the text.)
So, what this module does is simply go left to right down the string and starts copying everything that is not a tag to the beginning of the string. This does an extremely efficient in-place “compaction”.
I thought about implementing this method in Python, however the problem that I could not get around was the immutability of strings in Python. The fact that strings must all be maintained seems like it would require far too many copies for a C-style character by character processing.
Here is the core algorithm. As always, feel free to include this in whatever you are working on.
#include <wchar.h>
#include <wctype.h>
//
// does what name implies: case insensitive wide-character string comparison out to the count’th character
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
int i;
for(i=0; i<count; i++)
if(towupper(str1[i]) != towupper(str2[i]))
return 0;
return 1;
}
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string (minus null terminator, as standard strlen does)
//
size_t clean(wchar_t* html)
{
size_t copy_to = 0; //the location inside string html that characters are being copied to
size_t copy_from = 0; //the location inside string html that characters are being copied from
int intext = 0; //are we currently looking at text or attributes inside a tag
int inscript = 0; //are we currently inside a javascript tag?
for( ; html[copy_from] != L’\0′ && !wcsincmp(html+copy_from, L"<body", wcslen(L"<body")); copy_from++); //find the start of the body
for( ; html[copy_from] != L’\0′ && html[copy_from] != ’>’; copy_from++); //find the end of the body tag
if(copy_from == 0)
return html; //loop below requires that copy_from has advanced at least 1 character
for( ; html[copy_from] != ’\0′ && !wcsincmp(html+copy_from, L"</body>", wcslen(L"</body")); copy_from++)
{ //finally, go through the body until we find the end of body tag or the end of the file
if(wcsincmp(html+copy_from, L"<script", wcslen(L"script")))
inscript = 1;
if(inscript)
{ //inside a script, just look for the end of the script
if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
inscript = 0;
}
else
{ //otherwise, are we in a tag or not in a tag?
if(html[copy_from-1] == L’>’)
intext = 1; //if LAST character is ’>’, start copying again
if(html[copy_from] == L’<’)
intext = 0; //if CURRENT character is ’<’, stop copying
if(intext)
html[copy_to++] = html[copy_from]; //copy, then increment pointer for next time
}
}
html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
return copy_to;
}