Preface
Well… XML has been around for quite a while, since 1998 to be precise. Yet, I still find quite a few folks in the soft. industry to not getting it. Why XML? What really made it such a revolution for the web we still relish till today. One comment I’ve heard recently made it over the top: “Let’s do it XML, because it is easy for the computer to parse”. Now, XML is possibly the single worst format for a computer to parse. FKS comma separated file is even more efficient. In case an efficiency of data format is the most sought for feature, nothing beats the binary. However, it also lacks any structure, and is commonly referred to as BLOB. Usually a description to the data, a.k.a. metadata, is maintained in the form of format header with fixed size and format in the header, or maybe footer, of the file/BLOB. Where the format of the header you can usually find from some documentation.
So far this is all common knowledge, all well known. XML is descriptive language and we can easily format structured data, keeping both data and metadata tight together. However, this alone is only part of what makes XML so great. To summarize, the reasons for a developer to consider XML goes as follow:
- open standard specification
- structured data
- the XML stack (XML, XSD, XSLT)
The real power about XML comes from the third bullet point. That is the XML stack, or more precisely XML and the two other technologies, the schema definition language XSD and the XML transformation language XSLT. Unless, you are planning on using XML with any/both of the other two you might as well be much better using JSON. For being more efficient, less verbose, serving nicely the need of structured data alone.
The XML Stack
The XSD schema language is where one can describe a particular format for an XML file for his/her application. The benefit being that you can validate any XML against the XSD schema usually with one line of code. That feature alone makes classic SOAP Web Services still common today where any consumer of the service can validate its input data. However, combined that with the possibility to transform, that is to change the format of any XML from one to another, using XSLT and you end up with quite powerful stack.
The fact XML is standardized and it’s been around for more than a decade, gone through several revisions, makes it mature and quite well adopted. There are stable libs available for almost any platform you can think of.
Good Intermediate Format
Have you realised the power of the full XML stack, you can easily see that it helps you create CORRECT solutions to a common scenario. The scenario being: data shuffling and mapping. Which often encompass the bulk of most IT solutions out there.
It is easy to derive domain specific dialect. Making it, extremely easy to create open specifications. The most infamous example I can think of is HTML. Surprisingly many people are not aware that HTML is nothing more than specialized XML.
Although, the XML stack helps derive a correct solution, not necessary it is an efficient one.
Optimizing XML
One of the most common drawback of using XML boils down to the following points:
- too verbose
- size overhead
- XSLTs are hard
- XSDs are limited
I will cover only bullet point 2, size overhead. Since the rest I consider more or less, it is what it is, leave it or live with it. Still, the size overhead mostly has to do with the verbose expressions in XML.
There are plenty of things one can do to keep the size of XML to reasonable size. Starting from least intrusive, switching compression on HTTP transport level, to more intrusive like creating a lookup table of tags. Let me elaborate a bit more on the second option. The idea is to have on both end of communication, the same lookup table which maps verbose to HEX , and vice versa. Most cases will require less than 256 big dictionaries of tags, thus the 1byte size will suffice, in the range 00 – FF.
Keep in mind. I’ve not seen this second approach to mitigate hefty XML sizes, neither I’ve attempted myself doing so. I guess first approach is easy and good enough.
Structured Binary Data
In case performance is not mission critical and interoperability of binary data is of any concern, there is the public BSON binary format. Short for binary JSON. It is lightweight, traversable and efficient. http://bsonspec.org/
Another alternative is Google’s Protocl Buffers, http://code.google.com/p/protobuf/
Conclusions
To wrap it up, here is the moral of the story. Formatting the data is as important, if not even more, as the data itself. I’ve seen over and over again where developers makes the decision quickly on the fly. Often the wrong one!







