Extending Python (improved)…

June 26th, 2009
by Kurt

I’ve fixed a bug in the previous HTML crushing function, and greatly improved the performance. Also, I will explain the way the algorithm works in a bit more detail.

The basic concept of the algorithm is to do a single (destructive) pass over the HTML string, without allocating any new memory. To achieve this, it keeps track of two offsets inside the HTML string: copy_from and copy_to. Each run through the loop, copy_from is incremented. Then, there is a bunch of code for checking the current syntactic state of the copy_from location within the HTML (and set some flags depending on that state). Depending on how those flags have been set, the character at offset copy_from may or may not be copied to offset copy_to.

This operation can be visualized as “crushing” the string, by moving all of the text parts of the HTML next to each other on the left side of the string. The final result is that the shorter string consisting of only the text part of the HTML is written over the original HTML string.

The following improvements were made for this version of the code:

  • If no <body> is present, will simply grab everything (useful for snippets of HTML instead of only whole pages).
  • Only checks for special tags (<script>, <body>, <style>, </script>, </body> </style>) in the case that a < character has been detected. This reduces the number of comparisons that must be made on every character down to three.
  • Fixed a major bug where it checked for the closure of the previous tag after it checked for the opening of the current tag. If the end of one tag touched the beginning of the other, all of the second tag would be included as if it were text. (e.g. "</p></b>" would be crushed to "/b>" instead of "")
//
// does what name implies: case insensitive wide-character string comparison out to the count’th character
// NOTE: wcsincmp is safe to call even if one or the other string terminates before count, but is not safe
// to call if BOTH strings  may terminate before count
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
    int i;
    for(i=0; i<count; i++)
       if(towupper(str1[i]) != towupper(str2[i]))
          return 0;
    return 1;
}
 
 
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string
//
size_t clean(wchar_t* html)
{
   size_t copy_to = 0; //the location inside string html that characters are being copied to
   size_t copy_from = 0; //the location inside string html that characters are being copied from
   int intext = 1; //are we currently looking at text or attributes inside a tag
   int inscript = 0; //are we currently inside a javascript tag?
   int instyle = 0; //are we currently inside a css style tag?
   int found_body = 0;
   //keep going until we reach the end of the string (some conditions inside may cause early termination of the loop)
   for( ; html[copy_from] != L’\0′; copy_from++)
   { 
      if(copy_from && html[copy_from-1] == L’>’) // skips this one if copy_from is zero
      {
         intext = 1; //if LAST character is ’>’, we are out of a tag
         indiv_attr = 0; //if we are in text, we are not in the attribute section of a div
      }
      if(html[copy_from] == L’<’)
      {
         intext = 0; //if CURRENT character is ’<’, we are back in a tag
         //since we are entering a tag, check to see if it is one of the special tags
         if(wcsincmp(html+copy_from, L"<script", wcslen(L"<script"))) //NOTE: decent compiler will move wcslen call out of loop
            inscript = 1;                                             
         if(wcsincmp(html+copy_from, L"<style", wcslen(L"<style")))
            instyle = 1;
         if(wcsincmp(html+copy_from, L"<body", wcslen(L"<body")))
         {
            copy_to = 0; //if we find an opening body tag, reset to the beginning of the string
            found_body = 1;
         }
         if(found_body && wcsincmp(html+copy_from, L"</body>",  wcslen(L"</body")))
            break; //if we find the end body tag, we are done; exit the loop
         if(inscript)
         { //inside a script, look for the end of the script
            if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
               inscript = 0;
         }
         if(instyle)
         { //inside a style, look for the end of the style
            if(wcsincmp(html+copy_from, L"</style>", wcslen(L"</style>")))
               instyle = 0;
         }
      }
      if(intext && !inscript && !instyle) //skip non-text, script,&nbspand style
         html[copy_to++] = html[copy_from];//copy, then increment pointer for next time
   }
   html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
   return copy_to; //length is equal to the last index + 1, aka the index of the null terminator
}

Posted in Uncategorized | Comments (0)

Extending Python…

June 15th, 2009
by Kurt

I have successfully created my first C Python module!

The purpose of the module is to extract just the text from an HTML document. The problem with existing solutions, is that they are all based on semantically parsing the HTML into some kind of tree structure. This is much more complicated than necessary, and also brittle (i.e. an unclosed tag can cause the parsing to fail — even though this has no impact on extracting the text.)

So, what this module does is simply go left to right down the string and starts copying everything that is not a tag to the beginning of the string. This does an extremely efficient in-place “compaction”.

I thought about implementing this method in Python, however the problem that I could not get around was the immutability of strings in Python. The fact that strings must all be maintained seems like it would require far too many copies for a C-style character by character processing.

Here is the core algorithm. As always, feel free to include this in whatever you are working on.

 
#include <wchar.h>
#include <wctype.h>
 
//
// does what name implies: case insensitive wide-character string comparison out to the count’th character
//
int wcsincmp(const wchar_t* str1, const wchar_t* str2, size_t count)
{
        int i;
   for(i=0; i<count; i++)
   if(towupper(str1[i]) != towupper(str2[i]))  
      return 0;  
   return 1;
}
 
//
// crushes the HTML string down, leaving only text behind; returns the length of the new string (minus null terminator, as standard strlen does)
//
size_t clean(wchar_t* html)
{
   size_t copy_to = 0; //the location inside string html that characters are being copied to
   size_t copy_from = 0; //the location inside string html that characters are being copied from
   int intext = 0; //are we currently looking at text or attributes inside a tag
   int inscript = 0; //are we currently inside a javascript tag?
   for( ; html[copy_from] != L’\0′ && !wcsincmp(html+copy_from, L"<body", wcslen(L"<body")); copy_from++); //find the start of the body
   for( ; html[copy_from] != L’\0′ && html[copy_from] != ’>’; copy_from++); //find the end of the body tag
   if(copy_from == 0)
      return html; //loop below requires that copy_from has advanced at least 1 character
   for( ; html[copy_from] != ’\0′ && !wcsincmp(html+copy_from, L"</body>", wcslen(L"</body")); copy_from++)
   { //finally, go through the body until we find the end of body tag or the end of the file
      if(wcsincmp(html+copy_from, L"<script", wcslen(L"script")))
         inscript = 1;
      if(inscript)
      { //inside a script, just look for the end of the script
         if(wcsincmp(html+copy_from, L"</script>", wcslen(L"</script>")))
            inscript = 0;
      }
      else
      { //otherwise, are we in a tag or not in a tag?
         if(html[copy_from-1] == L’>’)
            intext = 1; //if LAST character is ’>’, start copying again
         if(html[copy_from] == L’<’)
            intext = 0; //if CURRENT character is ’<’, stop copying
         if(intext)
            html[copy_to++] = html[copy_from]; //copy, then increment pointer for next time           
      }
   }
   html[copy_to] = L’\0′; //pointer was left one past the last valid character by loop
   return copy_to; 
}

Posted in Uncategorized | Comments (0)

A better way to get random numbers

May 30th, 2009
by Kurt

First, the punchline. There is a method to generate a sequence of pseudo-random numbers six times faster than the standard C rand(). (138 million random integers per second over 26 million on my machine.) Even better, this method only requires seven lines of code.


int lfsr() //linear feedback shift register
{
   const int taps = 0×8000000C; //this is a magic number
   static int rnd_num = 1;
   if(0×80000000 & rnd_num)
      rnd_num = (rnd_num << 1) ^ taps;
   else
      rnd_num = rnd_num << 1;
   return rnd_num;
}

A Linear Feedback Shift Register is a method used by hardware engineers to generate pseudo random numbers at extremely high speed.  As in 3Gbps speed.  When implemented in hardware, an LFSR actually takes fewer transistors than a counter. I’m not going to go into the theory of operation.  If you’d like to explore that, wikipedia has a pretty good article.

Feel free to cut and paste the above function into whatever project you are working on.

The only downside of an LFSR is that it is totally unsuited to cryptography. A maximal period LFSR (like the one I provided) will go through every possible 32 bit integer. This means if you know the taps and the last number, you know all future numbers that will be generated. The more common function used for rand()–the Mersenne Twister algorithm–has more “hidden state” and one has to observe hundreds of outputs before being able to make this prediction.

So, don’t use this method if your purpose is cryptography. However, if you are trying to do a Montecarlo simulation or generating random inputs for unit tests, a Linear Feedback Shift Register is superior to the default C rand().

Edit:
You should only do this substitution in C. If you are working in a higher level language, the language overhead will probably cost more than using the built-in call to the default C implementation.

Posted in Uncategorized | Comments (0)

Javascript Execution Environment

January 23rd, 2009
by Kurt

I’ve set up a Javascript execution environment.

It isn’t much — there are two elements in the page. The bottom part is a Javascript command line which executes whatever single line of Javascript code is entered. The top part is an edit area to write larger pieces of Javascript code. The larger piece of code can be executed by calling the “run()” function from the command line.

Also, a command history is maintained on the command line. Use the up and down arrows to scroll through previously entered commands. The contents of the top editor area can be stored by using the File->Save Cookie and File->Load Cookie functions.

Go ahead and try it, you can start with alert(”Hello World!”) in the command line.

Posted in Uncategorized | Comments (0)

Learning Linux: syntax highlighting in nano

January 21st, 2009
by Kurt

In case you are not familiar with it, nano is a basic text editor that comes with many Linux distributions.  It has a lower learning curve than vi or emacs, but also fewer features.

However, you can spruce nano up a bit by enabling features in its configuration file, “.nanorc” under your home directory.  Specifically, you can provide regular expression based syntax highlighting rules.

For your entertainment, I present meta-syntax highlighting rules for nano. These syntax rules, when put into your .nanrc file, will cause nano to syntax highlight the .nanorc file.

syntax "nanorc" "\.nanorc$"
color brightred "(color|syntax|set|start=|end=)"
color green "(red|brightred|green|brightgreen|blue|brightblue|cyan|brightcyan|white|brightwhite|yellow|brightyellow)"
color brightcyan "#+(.*)"
color brightyellow "\"(([^']|\\\")*|\[(^]]|\\\]|\^\])*\])*\""
color brightblue "[^\\\^]\[(\^\])?([^]]|[\\]{1,3,5,7,9}\])*\]"
color brightgreen "\(" "\)" "[|]"
color brightred "\^"
color red "\\."

Here is what this looks like when applied (to itself):

Posted in Uncategorized | Comments (0)

How to make a cool photo quality 3′x3′ Hubble image poster for under $5…

August 13th, 2008
by Kurt
  1. Download a high quality image from here: Best of Hubble Large Images.
  2. Cut the picture into 1800×1200 pixel chunks (this will look good printed out on standard 4×6 photos). There are probably many ways to do this, the easiest (free) method I have found so far is to use Irfanview. Use “Edit->Create Custom Crop Selection(Shift+C)”. This will let you select an exact rectangle of pixels. Then cut and paste your selection into a separate image file and save that one. Work your way across the image 1800 pixels at a time till you reach the end, then go 1200 pixels down and repeat.
  3. Upload the set of smaller images to a professional photo printer. The best I have found so far is Clark Color. Printing the photos shouldn’t cost more than $0.08 each plus shipping.
  4. Tape the photos up to a flat, clean wall. (Or you could get fancy with glue and frames, but I promised this would be under $5).

Posted in Uncategorized | Comments (0)

Photonics talk

August 8th, 2008
by Kurt

I went to a talk put on by IEEE on Tuesday.
Abstract
Research Group Info

That has a lot of information, but assumes you are already basically familiar with the field. Here is a layman’s summary: using lithographic techniques, one can carve laser devices out of a thin dielectric (non-conducting) surface. The reason this is possible is that modern lithography technology can etch features smaller than visible light wavelengths.

Image Credit USC Viterbi School of Engineering

If the structure is carved out in just the right shape, then sort of like a pendulum it will have a natural oscillation frequency. If the film is excited with energy, it will radiate coherent light at a frequency determined by its geometry. Currently, that energy is in the form of light but hopefully in the future it will be an electric current.

The complicated part is figuring out what geometry will get the light to come out efficiently. That is, in the desired direction and frequency. This takes super computer simulations and intuition to determine.

Fascinating stuff, some of the people in the audience asked some great questions. Dr. O’Brien was mostly interested in getting the technology to the point of being able to integrate with semiconductors.

Tags: ,
Posted in Technical Talks, Uncategorized | Comments (0)

Hello World!

August 6th, 2008
by Kurt

Hello anyone who is actually reading this right now.

I’m not sure if I’m going to use this for anything, but it was neat to set up.

Posted in Uncategorized | Comments (0)

Hello World!

August 6th, 2008
by admin

Hello friends and family that would actually be reading this.

Posted in Uncategorized | Comments (0)