Archive for June, 2009

Extending Python (improved)…

June 26th, 2009

I’ve fixed a bug in the previous HTML crushing function, and greatly improved the performance. Also, I will explain the way the algorithm works in a bit more detail.

The basic concept of the algorithm is to do a single (destructive) pass over the HTML string, without allocating any new memory. To achieve this, it keeps track of two offsets inside the HTML string: copy_from and copy_to. Each run through the loop, copy_from is incremented. Then, there is a bunch of code for checking the current syntactic state of the copy_from location within the HTML (and set some flags depending on that state). Depending on how those flags have been set, the character at offset copy_from may or may not be copied to offset copy_to.

This operation can be visualized as “crushing” the string, by moving all of the text parts of the HTML next to each other on the left side of the string. The final result is that the shorter string consisting of only the text part of the HTML is written over the original HTML string.

The following improvements were made for this version of the code:

  • If no <body> is present, will simply grab everything (useful for snippets of HTML instead of only whole pages).
  • Only checks for special tags (<script>, <body>, <style>, </script>, </body> </style>) in the case that a < character has been detected. This reduces the number of comparisons that must be made on every character down to three.
  • Fixed a major bug where it checked for the closure of the previous tag after it checked for the opening of the current tag. If the end of one tag touched the beginning of the other, all of the second tag would be included as if it were text. (e.g. "</p></b>" would be crushed to "/b>" instead of "")
//
// does what name implies: case insensitive wide-character string comparison out to the count’th character
// NOTE: wcsincmp is safe to call even if one or the other string terminates before count, but is not safe
// to call if BOTH strings  may terminate before count
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
    int i;
    for(i=0; i<count; i++)
       if(towupper(str1[i]) != towupper(str2[i]))
          return 0;
    return 1;
}
 
 
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string
//
size_t clean(wchar_t* html)
{
   size_t copy_to = 0; //the location inside string html that characters are being copied to
   size_t copy_from = 0; //the location inside string html that characters are being copied from
   int intext = 1; //are we currently looking at text or attributes inside a tag
   int inscript = 0; //are we currently inside a javascript tag?
   int instyle = 0; //are we currently inside a css style tag?
   int found_body = 0;
   //keep going until we reach the end of the string (some conditions inside may cause early termination of the loop)
   for( ; html[copy_from] != L’\0′; copy_from++)
   { 
      if(copy_from && html[copy_from-1] == L’>’) // skips this one if copy_from is zero
      {
         intext = 1; //if LAST character is ’>’, we are out of a tag
         indiv_attr = 0; //if we are in text, we are not in the attribute section of a div
      }
      if(html[copy_from] == L’<’)
      {
         intext = 0; //if CURRENT character is ’<’, we are back in a tag
         //since we are entering a tag, check to see if it is one of the special tags
         if(wcsincmp(html+copy_from, L"<script", wcslen(L"<script"))) //NOTE: decent compiler will move wcslen call out of loop
            inscript = 1;                                             
         if(wcsincmp(html+copy_from, L"<style", wcslen(L"<style")))
            instyle = 1;
         if(wcsincmp(html+copy_from, L"<body", wcslen(L"<body")))
         {
            copy_to = 0; //if we find an opening body tag, reset to the beginning of the string
            found_body = 1;
         }
         if(found_body && wcsincmp(html+copy_from, L"</body>",  wcslen(L"</body")))
            break; //if we find the end body tag, we are done; exit the loop
         if(inscript)
         { //inside a script, look for the end of the script
            if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
               inscript = 0;
         }
         if(instyle)
         { //inside a style, look for the end of the style
            if(wcsincmp(html+copy_from, L"</style>", wcslen(L"</style>")))
               instyle = 0;
         }
      }
      if(intext && !inscript && !instyle) //skip non-text, script,&nbspand style
         html[copy_to++] = html[copy_from];//copy, then increment pointer for next time
   }
   html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
   return copy_to; //length is equal to the last index + 1, aka the index of the null terminator
}

Posted in Uncategorized | Comments (0)

Extending Python…

June 15th, 2009

I have successfully created my first C Python module!

The purpose of the module is to extract just the text from an HTML document. The problem with existing solutions, is that they are all based on semantically parsing the HTML into some kind of tree structure. This is much more complicated than necessary, and also brittle (i.e. an unclosed tag can cause the parsing to fail — even though this has no impact on extracting the text.)

So, what this module does is simply go left to right down the string and starts copying everything that is not a tag to the beginning of the string. This does an extremely efficient in-place “compaction”.

I thought about implementing this method in Python, however the problem that I could not get around was the immutability of strings in Python. The fact that strings must all be maintained seems like it would require far too many copies for a C-style character by character processing.

Here is the core algorithm. As always, feel free to include this in whatever you are working on.

 
#include <wchar.h>
#include <wctype.h>
 
//
// does what name implies: case insensitive wide-character string comparison out to the count’th character
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
        int i;
   for(i=0; i<count; i++)
   if(towupper(str1[i]) != towupper(str2[i]))  
      return 0;  
   return 1;
}
 
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string (minus null terminator, as standard strlen does)
//
size_t clean(wchar_t* html)
{
   size_t copy_to = 0; //the location inside string html that characters are being copied to
   size_t copy_from = 0; //the location inside string html that characters are being copied from
   int intext = 0; //are we currently looking at text or attributes inside a tag
   int inscript = 0; //are we currently inside a javascript tag?
   for( ; html[copy_from] != L’\0′ && !wcsincmp(html+copy_from, L"<body", wcslen(L"<body")); copy_from++); //find the start of the body
   for( ; html[copy_from] != L’\0′ && html[copy_from] != ’>’; copy_from++); //find the end of the body tag
   if(copy_from == 0)
      return html; //loop below requires that copy_from has advanced at least 1 character
   for( ; html[copy_from] != ’\0′ && !wcsincmp(html+copy_from, L"</body>", wcslen(L"</body")); copy_from++)
   { //finally, go through the body until we find the end of body tag or the end of the file
      if(wcsincmp(html+copy_from, L"<script", wcslen(L"script")))
         inscript = 1;
      if(inscript)
      { //inside a script, just look for the end of the script
         if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
            inscript = 0;
      }
      else
      { //otherwise, are we in a tag or not in a tag?
         if(html[copy_from-1] == L’>’)
            intext = 1; //if LAST character is ’>’, start copying again
         if(html[copy_from] == L’<’)
            intext = 0; //if CURRENT character is ’<’, stop copying
         if(intext)
            html[copy_to++] = html[copy_from]; //copy, then increment pointer for next time           
      }
   }
   html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
   return copy_to; 
}

Posted in Uncategorized | Comments (0)