This post is about generating a file reading/writting library from some sort of specifications, in this case in the form of an xml file that details the internal binary structure of the file.
Using this approach, we can define the format in an xml file and create some code generating tools to write code for libraries which read and write in that format for a given language such as C++ or python. These tools will also create some models in the respective programming language that are read and populated by the generated library.
This has the advantage in case the format changes we just run the tool again with the updated format from the xml file. The code for reading the format will be updated and all that we need to do is to recompile it. It also allows people who are not familiar with programming to edit the format without writing code.
I am going to use an open source project called niflib which reads and write 3d models in files for games like Elder Scrolls Skyrim. You can find it on github: https://github.com/Alecu100/niflib.
The format stores all the information in separate blocks inside a file with links between them. Each of these blocks has a type and an index inside the file. The index is used to link a block to a another block. So a block can have a defined field which contains a number representing an index to another block which is used by the current block. Also a block can contain normal fields with represent various properties stored in the file. A field can even be an array. The binary field layout is defined in the xml file.
Some fields of a block can be grouped into a compound type. In this case the compound type will be another field in the block. A compound type cannot be referenced by other blocks and it is contained in a block.
A definition of a block with it's fields looks like this in the xml file:
What's so special is the fact that there are no circular references between blocks.There are actually two types of links between blocks. References which are counted for garbage collection and direct links which are not used for garbage collection. The link types are defined in the format so that there cannot be any sort of circular references. Only direct links can be circular since they are not used by garbage collection.
Inside the code we handle these blocks by a reference. When a reference runs out of scope, it is deleted and the reference counter for the corresponding block is decremented. When it reaches 0 that block can be safely deleted.
Initially when we read from a file we get a single reference to the root block. This root block contain downward references to other child blocks. The references have a tree like structure. So we have only one references to the root node and through it we have references to the rest of the blocks. In this case if for example we stop referencing the root node then it will be deleted since it's reference count will be 0. When the root node is deleted it's references to child nodes are also deleted. Since the child nodes are only referenced through the root node, their reference count will go from 1 to 0 so they will be deleted too.
The tool which generates the code is actually a python script. It has 2 main scripts. A scripts for reading the xml specifications for the format and a helper script that represents a code file with functions to write specific elements into that file such as methods or fields for classes.
This is just a small part of a python method that generates a method for reading, writing etc of a block
Bellow you can see the generated code for the base type used for all blocks that provides reference counting:
And bellow is the partial definition of a block. You can see some special comments "//--BEGIN MISC CUSTOM CODE--//" and "//--END MISC CUSTOM CODE--//" that delimit custom written user code that won't be modified when the format changes:
In this case I presented I used an example which reads and writes from a file. But you can also extend this to read from a network source instead of file. This would work really well in case you have really big distributed and complicated systems with many components that communicate between them but are written in different programming languages.
I think Google uses a technology called "Protocol Buffers" to generate reading and writing libraries from a network.
That's about it. Sorry if the code examples are really long but I could not find anything shorter and I wanted to show real world examples.
Using this approach, we can define the format in an xml file and create some code generating tools to write code for libraries which read and write in that format for a given language such as C++ or python. These tools will also create some models in the respective programming language that are read and populated by the generated library.
This has the advantage in case the format changes we just run the tool again with the updated format from the xml file. The code for reading the format will be updated and all that we need to do is to recompile it. It also allows people who are not familiar with programming to edit the format without writing code.
I am going to use an open source project called niflib which reads and write 3d models in files for games like Elder Scrolls Skyrim. You can find it on github: https://github.com/Alecu100/niflib.
The format stores all the information in separate blocks inside a file with links between them. Each of these blocks has a type and an index inside the file. The index is used to link a block to a another block. So a block can have a defined field which contains a number representing an index to another block which is used by the current block. Also a block can contain normal fields with represent various properties stored in the file. A field can even be an array. The binary field layout is defined in the xml file.
Some fields of a block can be grouped into a compound type. In this case the compound type will be another field in the block. A compound type cannot be referenced by other blocks and it is contained in a block.
A definition of a block with it's fields looks like this in the xml file:
What's so special is the fact that there are no circular references between blocks.There are actually two types of links between blocks. References which are counted for garbage collection and direct links which are not used for garbage collection. The link types are defined in the format so that there cannot be any sort of circular references. Only direct links can be circular since they are not used by garbage collection.
Inside the code we handle these blocks by a reference. When a reference runs out of scope, it is deleted and the reference counter for the corresponding block is decremented. When it reaches 0 that block can be safely deleted.
Initially when we read from a file we get a single reference to the root block. This root block contain downward references to other child blocks. The references have a tree like structure. So we have only one references to the root node and through it we have references to the rest of the blocks. In this case if for example we stop referencing the root node then it will be deleted since it's reference count will be 0. When the root node is deleted it's references to child nodes are also deleted. Since the child nodes are only referenced through the root node, their reference count will go from 1 to 0 so they will be deleted too.
The tool which generates the code is actually a python script. It has 2 main scripts. A scripts for reading the xml specifications for the format and a helper script that represents a code file with functions to write specific elements into that file such as methods or fields for classes.
This is just a small part of a python method that generates a method for reading, writing etc of a block
Bellow you can see the generated code for the base type used for all blocks that provides reference counting:
And bellow is the partial definition of a block. You can see some special comments "//--BEGIN MISC CUSTOM CODE--//" and "//--END MISC CUSTOM CODE--//" that delimit custom written user code that won't be modified when the format changes:
In this case I presented I used an example which reads and writes from a file. But you can also extend this to read from a network source instead of file. This would work really well in case you have really big distributed and complicated systems with many components that communicate between them but are written in different programming languages.
I think Google uses a technology called "Protocol Buffers" to generate reading and writing libraries from a network.
That's about it. Sorry if the code examples are really long but I could not find anything shorter and I wanted to show real world examples.
Comments
Post a Comment