More (dirty) XML Secrets

As we all know, XML can be easily parsed with an XML parser. Right? Right… So what happens when XML is not really XML. Well, as we all know, when XML is not XML you resort to text and regular expressions. It’s one of the dirty secrets of XML. And hey, I’m not the only one who uses regex to parse XML. There’s also the speed/memory issue, but right now I’m just concerned with the not-really-XML part of it.

The Universal Feed Parser tries to use XML, and if that fails, does the regex dance.

If XML parsing fails due to well-formedness errors in the feed… …it will automatically fall back to the 2.x-style parser based on regular expressions.

If you’ve processed a form of XML commonly known as RSS, you might have run into these issues before, because there are feeds that are not well-formed, and therefor invalid, and if you want to be picky, they aren’t really XML… Perl needs a module that does the “try it as XML, and fall back on regex if it ain’t” module. Why? Because once again I figured I could just use something like XML::DOM to deal with an RSS file, which is supposed to be XML, but when you’ve got an & instead of an & in there, it all blows up. (Hmmm, perhaps we should go the other way around, create a pre-filter that takes in XML, fixes all the errors making it valid XML, and then passes it on to the XML parser! Could this be done?)

I guess I’ll blame the developers creating the software that creates the invalid XML/RSS. Want more secrets? I’m probably one of them. Most of the code that creates my RSS feeds, and Atom feed is a bunch of perl with home-brewed templates, and regular expressions… Why? Why don’t I use the proper tools? Laziness, lack of… whatever, it doesn’t matter. People are going to do it this way, and even though you would think RSS is simple and you could create valid markup, we don’t always do that. Sure, I’ve implemented feed checking into my system, as I don’t want to be a wonk that outputs garbage, but I still have to deal with the garbage out there, and damn is it frustrating.

To rephrase “Be liberal in what you accept, and conservative in what you send” I’d say: “Garbage in” is bad but “garbage out” is worse…

Is there hope? Well, there’s always hope, right? Will Atom save the day, doing what RSS can’t always do? It would be nice, but I’m just not sure… Should we rely on software that requires well-formed XML, and can fall back on plain old regular expressions if needed? I don’t know… I tend to think that’s a hack we shouldn’t need, but only time will tell…