Intro to Event-Driven Programming: Using a SAX Parser

January 26, 2010 — Code

Event-driven programming is not a new paradigm: the concepts were developed in the late 1970s and most UI programming is event-driven. Until recently, however, it has not been a major topic in the back-end web programming world. Now, thanks in part to the proliferation of AJAX and excellent libraries like Twisted and node.js, it is beginning to enter the consciousness of Ruby, Python, and PHP developers. As an introduction to event-driven programming outside of a UI context I will present a solution to a common problem: parsing a large XML file without running out of memory.

The Conventional Method

What you’ve probably been doing is basically batch programming. In batch programming, the flow of execution is completely determined by the programmer via method calls, object instantiation, etc. If you write a batch script to parse an XML file, the sequence of events is something like this:

  1. Load XML file (preferably with a file/IO object).
  2. Parse file, building a huge data structure in memory.
  3. Iterate through each element in the structure and do your business.

If your XML file is very large, this approach is going to fail (or at least become very slow) at step #2 when the data structure won’t fit in memory. In Ruby code, using the Nokogiri gem, the code might look like this:


# Load file.
file = File.new("users.xml")

# Parse file.
doc = Nokogiri::XML(file)

# Do some business with each element.
doc.css("user").each do |user|
  id = user.attribute("Id").to_s
  # more business...
end

The parser used here uses the DOM parsing method, in which the entire Document Object Model is loaded into memory in one step. Fortunately there is another technique for reading an XML document, known as SAX parsing (Simple API for XML), which allows us to deal with elements in the document as they come off the grill, rather than waiting for the whole enchilada to finish cooking.

The Event-Driven Solution

The event-driven solution, using a SAX parser, looks something like this:

  1. Create event handlers which take parsed XML elements and perform business on them.
  2. Give your parser the event handler and the XML file, and set it in motion.

Instead of creating an enormous data structure, the parser reads the file from disk and passes each XML element to the event handler as it is encountered. The event handler deals with the data and then becomes idle again, waiting for the next event.

Program execution is therefore somewhat different than what you are probably used to: the parser runs linearly, but the event handler (which contains the important business logic) is being called willy nilly as the parser goes through the file. Here’s some code demonstrating this technique with Nokogiri’s SAX parser:


##
# Class representing a parse-able XML document
# with a known set of elements.
#
class UsersData < Nokogiri::XML::SAX::Document
  
  ##
  # Handle the start_element event:
  # process element if it's a <user>.
  #
  def start_element(element, attributes)
    if element == 'user'
      a = Hash[*attributes]
      # more business...
    end
  end
end

# Create a parser and give it an event handler object.
parser = Nokogiri::XML::SAX::Parser.new(UsersData.new)

# Turn the parser loose on the data file.
parser.parse_file(File.join("users.xml"))

This design is a little counter-intuitive at first but, in addition to being memory-efficient, it can help organize your code. For example, as an OOP programmer, I like the way a class is created to model the XML document. Once you’re comfortable with the idea of an event it makes a lot of sense to do it this way, especially when parsing multiple files of the same format. In fact, I would argue that creating a class representing the document is a good design even when parsing with the traditional approach.

I haven’t gone into the theory behind event-driven programming, but hopefully this has helped you get your feet wet and made you feel a little more comfortable with event-driven design.


comments powered by Disqus