|Last Update:||2009-12-19 19:29:28 -0600|
For this release, I added a parser called TagSoup. The name is taken from the jargon term used for HTML written without any regard to the rules of HTML structure, i.e. HTML with many common authoring mistakes in.
TagSoup is a very small and very dumb parser which implements the stream API of REXML. The test code compares it against REXML for some simple example XML and makes sure it calls the same callbacks in the same order with the same parameters.
Note that hacking together your own XML parser is, generally speaking, the wrong thing to do. Using TagSoup as a general replacement for REXML is very definitely the wrong thing to do. Please don’t do it.
A real XML parser does all kinds of things that TagSoup doesn’t, like pay attention to DTDs, handle quoted special characters in element attributes, handle whitespace in a documented standard way, and so on. The fact that TagSoup is defective in many areas is intentional. It’s designed to be used as a last resort, for parsing web syndication feeds which are invalid.
As discussed in the README, this is really my fourth attempt at writing RSS parsing code. For the record, I thought I’d list the approaches I tried and abandoned. In a way, that’s more interesting than the one I picked...
First I used hashes for storage and just looked for matching tags. That
approach works, kinda, but it doesn’t really understand nested
elements at all. As a result, it becomes really hard to deal with Atom
feeds, where an
Next I wrote a classic stack-based parser, with a container stack and a text buffer stack. That worked well for RSS; I got it parsing every RSS variant, and even went as far as a test suite. However, as I tried extending it to deal with Atom, I realized that the parser code was becoming hard to follow, as the state machine gained more and more special cases.
For a third iteration, I tried to generalize the knowledge represented by the state machine, by placing it in the context stack. That is, I would have a smart stack that knew which XML elements could go inside other elements. Actually, there would have been four context stacks, for containers, attributes, tags and textual data.
That design never made it past the paper stage, because I realized that I could move all the knowledge into the classes used to create the objects of the final parse tree. With the new model—the one used in this code—the parser really doesn’t know anything about Atom or RSS. It just forwards events to a tree of objects, which construct child objects as appropriate to grow the tree and represent the feed.