Skip to main content

A post about generating stream reading and writing libraries from specifications

This post is about generating a file reading/writting library from some sort of specifications, in this case in the form of an xml file that details the internal binary structure of the file.

Using this approach, we can define the format in an xml file and create some code generating tools to write code for libraries which read and write in that format for a given language such as C++ or python. These tools will also create some models in the respective programming language that are read and populated by the generated library.

This has the advantage in case the format changes we just run the tool again with the updated format from the xml file. The code for reading the format will be updated and all that we need to do is to recompile it. It also allows people who are not familiar with programming to edit the format without writing code.

I am going to use an open source project called niflib which reads and write 3d models in files for games like Elder Scrolls Skyrim. You can find it on github: https://github.com/Alecu100/niflib.

The format stores all the information in separate blocks inside a file with links between them. Each of these blocks has a type and an index inside the file. The index is used to link a block to a another block. So a block can have a defined field which contains a number representing an index to another block which is used by the current block. Also a block can contain normal fields with represent various properties stored in the file. A field can even be an array. The binary field layout is defined in the xml file.

Some fields of a block can be grouped into a compound type. In this case the compound type will be another field in the block. A compound type cannot be referenced by other blocks and it is contained in a block.

A definition of a block with it's fields looks like this in the xml file:


What's so special is the fact  that there are no circular references between blocks.There are actually two types of links between blocks. References which are counted for garbage collection and direct links which are not used for garbage collection. The link types are defined in the format so that there cannot be any sort of circular references. Only direct links can be circular since they are not used by garbage collection.

Inside the code we handle these blocks by a reference. When a reference runs out of scope, it is deleted and the reference counter for the corresponding block is decremented. When it reaches 0 that block can be safely deleted.

Initially when we read from a file we get a single reference to the root block. This root block contain downward references to other child blocks. The references have a tree like structure. So we have only one references to the root node and through it we have references to the rest of the blocks. In this case if for example we stop referencing the root node then it will be deleted since it's reference count will be 0. When the root node is deleted it's references to child nodes are also deleted. Since the child nodes are only referenced through the root node, their reference count will go from 1 to 0 so they will be deleted too.

The tool which generates the code is actually a python script. It has 2 main scripts. A scripts for reading the xml specifications for the format and a helper script that represents a code file with functions to write specific elements into that file such as methods or fields for classes.

This is just a small part of a python method that generates a method for reading, writing etc of a block

Bellow you can see the generated code for the base type used for all blocks that provides reference counting:


And bellow is the partial definition of a block. You can see some special comments "//--BEGIN MISC CUSTOM CODE--//" and "//--END MISC CUSTOM CODE--//"  that delimit custom written user code that won't be modified when the format changes:

In this case I presented I used an example which reads and writes from a file. But you can also extend this to read from a network source instead of file. This would work really well in case you have really big distributed and complicated systems with many components that communicate between them but are written in different programming languages.

I think Google uses a technology called "Protocol Buffers" to generate reading and writing libraries from a network.

That's about it. Sorry if the code examples are really long but I could not find anything shorter and I wanted to show real world examples.

Comments

Popular posts from this blog

Some software development common sense ideas

 I haven't really written here in a long time so it's time to write some new things. These are just some random common sense things that seemed to short to write about individually but seem really interesting and useful to me or other people 1. There is nothing fixed in software development, all things vary by circumstances and time Remember the documentation that didn't seem that important when you started the project, well after a couple of years one the application has grown and become really complicated, no one actually knows everything about the application anymore. So now you really need that documentation. What happens if you suddenly need much more people to develop the application because of some explosive growth? Without documentation, new developers will just look at the application like they look at a painting. This actually happened to me. Maybe in the beginning of a project, a technology really helped you a lot but as the project grew, it started making things...

Some things which are often blindly applied and followed by developers which are not always good

This is probably one of the most controversial things that I have written so far but I am just tired of hearing about these things and discussing these things. Other developers that I know share part of my feelings. I would rather hear more about how people built things, overcame challenges or what new interesting ideas and concepts they implemented. Those things are really interesting and innovative, not hearing about the same theoretical things over and over again. I can just read and learn those things from 100 sources on the internet. Firstly, one of the most discussed and promoted things is agile/scrum development. I think I have been through 5-8 workshops about agile development methodology. And each time, some things differed. There is no 100% standard approach to this. Everyone uses their own version of this development methodology and seem to argue a lot that their approach is right and everyone else is doing it wrong. You go to an interview, this will be one of the first 10 t...

Protected variations in software engineering explained and extended beyond the common usages

While digging through some standard programming principles like low coupling and high cohesion I stumbled upon the fact that they are part of a larger series of principles called "GRASP" principles. After reading a bit about them, they seem just as important if not more important than the "SOLID" principles And one particular principle from that series stuck with me: protected variations. According to this principle, variations and changes in parts of the application should be contained only in them and not trigger further changes in the application. In general terms points of inflection should be established between the parts that change and the rest of the application which act like a boundary and stop additional changes from propagating to the rest of the application. For example, one of the most common parts that might change in an application, is the data access and storage methods. For example instead of using direct sql to read and write to a database, an...