I’ve fixed a bug in the previous HTML crushing function, and greatly improved the performance. Also, I will explain the way the algorithm works in a bit more detail.
The basic concept of the algorithm is to do a single (destructive) pass over the HTML string, without allocating any new memory. To achieve this, it keeps track of two offsets inside the HTML string: copy_from and copy_to. Each run through the loop, copy_from is incremented. Then, there is a bunch of code for checking the current syntactic state of the copy_from location within the HTML (and set some flags depending on that state). Depending on how those flags have been set, the character at offset copy_from may or may not be copied to offset copy_to.
This operation can be visualized as “crushing” the string, by moving all of the text parts of the HTML next to each other on the left side of the string. The final result is that the shorter string consisting of only the text part of the HTML is written over the original HTML string.
The following improvements were made for this version of the code:
- If no <body> is present, will simply grab everything (useful for snippets of HTML instead of only whole pages).
- Only checks for special tags (<script>, <body>, <style>, </script>, </body> </style>) in the case that a < character has been detected. This reduces the number of comparisons that must be made on every character down to three.
- Fixed a major bug where it checked for the closure of the previous tag after it checked for the opening of the current tag. If the end of one tag touched the beginning of the other, all of the second tag would be included as if it were text. (e.g. "</p></b>" would be crushed to "/b>" instead of "")
// does what name implies: case insensitive wide-character string comparison out to the count’th character
// NOTE: wcsincmp is safe to call even if one or the other string terminates before count, but is not safe
// to call if BOTH strings may terminate before count
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
int i;
for(i=0; i<count; i++)
if(towupper(str1[i]) != towupper(str2[i]))
return 0;
return 1;
}
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string
//
size_t clean(wchar_t* html)
{
size_t copy_to = 0; //the location inside string html that characters are being copied to
size_t copy_from = 0; //the location inside string html that characters are being copied from
int intext = 1; //are we currently looking at text or attributes inside a tag
int inscript = 0; //are we currently inside a javascript tag?
int instyle = 0; //are we currently inside a css style tag?
int found_body = 0;
//keep going until we reach the end of the string (some conditions inside may cause early termination of the loop)
for( ; html[copy_from] != L’\0′; copy_from++)
{
if(copy_from && html[copy_from-1] == L’>’) // skips this one if copy_from is zero
{
intext = 1; //if LAST character is ’>’, we are out of a tag
indiv_attr = 0; //if we are in text, we are not in the attribute section of a div
}
if(html[copy_from] == L’<’)
{
intext = 0; //if CURRENT character is ’<’, we are back in a tag
//since we are entering a tag, check to see if it is one of the special tags
if(wcsincmp(html+copy_from, L"<script", wcslen(L"<script"))) //NOTE: decent compiler will move wcslen call out of loop
inscript = 1;
if(wcsincmp(html+copy_from, L"<style", wcslen(L"<style")))
instyle = 1;
if(wcsincmp(html+copy_from, L"<body", wcslen(L"<body")))
{
copy_to = 0; //if we find an opening body tag, reset to the beginning of the string
found_body = 1;
}
if(found_body && wcsincmp(html+copy_from, L"</body>", wcslen(L"</body")))
break; //if we find the end body tag, we are done; exit the loop
if(inscript)
{ //inside a script, look for the end of the script
if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
inscript = 0;
}
if(instyle)
{ //inside a style, look for the end of the style
if(wcsincmp(html+copy_from, L"</style>", wcslen(L"</style>")))
instyle = 0;
}
}
if(intext && !inscript && !instyle) //skip non-text, script, and style
html[copy_to++] = html[copy_from];//copy, then increment pointer for next time
}
html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
return copy_to; //length is equal to the last index + 1, aka the index of the null terminator
}