Skip to main content

Part 3 of writting an experimental code analyzer and sort of interpreter for C#

This is the final part of the series about the C# code analyzer and interpreter that I created a while back. The actual execution and interpretation of the C# code are explained in this post. It is the last post in this series.

To begin with, in very simple terms each word in the C# code can have an associated action that needs to be executed when the code is read by the analyzer in the proper order. For example, a word in the code might be a variable name. In this case, the reference corresponding to that variable is located and stored in a temporary buffer once the word is read. After that, it can be used from the buffer for another operation like invoking a method on the object/objects that the reference references. In case it has to call a method, it needs to locate the body of the method and jump to the method implementation and start reading it. With constructors, it is pretty much the same thing with the only difference being the fact that it always returns a single object and that it always has the same of the object's type. So all in all, the interpreter just reads the code line by line in a specific order and for each word, it executes an action.

Still, it's not enough to execute actions. The interpreter needs to store the program state somewhere which contains the results of these actions. This is where the heap and stack come into play. In general the stack stores only the local results corresponding to the current execution context and the heap stores the global results available in all contexts.

Now the stack is just an order list of stack frames. Each stack frame corresponds to a method call which can be normal methods, constructor or property getter/setter. The actual code for the stack frame can be found bellow:



You can see some important properties in the stack frame. Firstly there is the "ThisReference" reference which stores the object that the current method is being called on. Secondly, there is another property "LocalReferences" which stores all the references to the current local variables. These are scanned when an indentifier is found in the code to see if there is a local variable which has the same name as the identifier. An identifier can be anything such as a variable name, a method name, property name and so on.

Additionally, there is a special property called "MemberAccessReference". This is used as a buffer to store the results of actions such as fetching a local variable when an identifier with the same name is found in the code. Or storing the result of a method call. This special property is also scanned when another identifier is found in the code. An identifier can be anything which has a name, a variable name, a method name and so on.

So there are actually 3 places where the interpreter looks for possible variables, methods or anything else when it finds an identifier in the code:

- "MemberAccessReference": This is the first place that it is always scanned and more like a buffer. Thus it has the highest priority. It looks inside the objects stored here and checks to see if they have any fields, properties or methods with the same name as the identifier it found in the code. It is also used to store the result after getting an object using an identifier in the code. It can actually store anything like a method or property references. It is also used to store the results of access operations or method invocation which are objects themselves. So the values and objects stored here are always overwritten by various operations. This allows to chain operations together, like method chaining for example. Or another example would be accessing a field from an object stored in another field of another object.

- "LocalReferences": As mentioned earlier this contains all the local variables, including their name. The interpreter looks here after looking in the buffer mentioned above if it doesn't find anything with the same name as the identifier.

- "ThisReference": This is the last place the interpreter looks if it doesn't find anything in the places mentioned above. It looks for fields, methods or anything else which has the same name as the identifier found in the code.

Now when the interpreter runs into a method call, it will create a new stack frame and push it on the stack. Actually this is how the CLR works and in general any stack-based language like C, C#, Java and so on. Bellow is the code for the stack:




The "_staticWorkflowEvaluatorExecutionFrames" is pretty much the most important field. It stores all the stack frames. When a method is finished, the interpreter reads and executes the last line in the method's body, it's corresponding stack frame will be removed from this list. The latest stack frame is always on top of the stack so it will remove the stack frame from the top.

Still, it's not enough just to add or remove stack frames from the stack. They somehow must communicate among themselves. For example, when calling a method, the arguments to the method call must be passed to its stack frame. Or for example, when a method finishes executing, the return values must be passed to the previous stack frame in the stack so that the returned results can be further used in the parent context. This is where the properties "PassedMethodParameters" and "ReturningMethodParameters" from the stack frame class come in handy allowing stack frame to communicated between them.

Moving forward, for the interpreter to actually read the code and execute various actions, it needs to understand it and see what it actually means. For example, when it reads the "class" keyword it must know that a class declaration will follow. In order to do this, it relies on the C# compiler with its toolset called Roslyn to analyze the actual structure of the code. They are open source and can be downloaded from the internet if you search for "Microsoft Compiler Tools".

For each code file, a data structure is made by the C# compiler which contains information about the structure of the code. The data structure is actually a syntax tree which is a concept used a lot in compilers. It contains a series of interconnected nodes which actually are part of the syntax tree. For each structure inside the code, a corresponding node is added in the syntax tree. For example, if the code file contains a class declaration then in the syntax tree it will add a node of the type "ClassDeclarationSyntax" which is just a class/type declared in the C# compiler library/package. If it finds a method, then it creates a node of type "MethodDeclarationSyntax" in the tree. Absolutely every coding structure in the code file with have a corresponding node in the syntax tree.

These nodes are actually organized and there is a strict relationship between them. For example, a node of type "ClassDeclarationSyntax" has a list of nodes of type "MethodDeclarationSyntax" which represent all the methods declared inside the class. Or for example, the node which corresponds to the body of a method is of type "BlockSyntax" has a list of individual nodes of type "StatementSyntax" which are the actual statements contained in that block of code. When it executes that method, it just goes in order through all the statements in that list and executes an action for each of them.

If you need further explanations about syntax trees and nodes here is a good tutorial: https://github.com/dotnet/roslyn/wiki/Getting-Started-C%23-Syntax-Analysis

But still, it can't interpret and execute the code in just one read. First, it reads all the code in all the files and builds a table and detailed type information for all the classes, methods, properties and all other things. The actual class is called "EvaluatedTypesInfoTable" and it has a bit of lines of code so I won't be including it here. You can find it here: https://github.com/Alecu100/CodeAnalyzer/blob/master/CodeEvaluator.Evaluation/Members/EvaluatedTypesInfoTable.cs



After it finishes creating the detailed interpreted structure of the code, it then can begin to read and interpret the code on the second run. For each syntax node type that I mentioned earlier, a corresponding action exists that is executed when it reads the node. For example, if it reads a method call node, "InvocationExpressionSyntax", it identifies all the overloaded methods. Then it resolves the arguments of the call and based on the arguments, it selects a corresponding method from the list of overloaded methods. Then it creates a new stack frame for that methods, passes the arguments to the stack frame and the method. And finally, after that, it starts to read the body of the method line by line and execute further actions. The actual code and class which is responsible for calling methods is quite long and can be found below:



All the other actions which are executed for various syntax nodes can be found at the following address: https://github.com/Alecu100/CodeAnalyzer/tree/master/CodeEvaluator.Evaluation/Evaluators

These actions that are executed for each node can trigger other "child" actions to be executed for the child nodes. For example, an if statement has a couple of child nodes that need to have actions executed for them. These child nodes are the actual condition in the if statement, the main block that is executed if the condition is true and the optional block which corresponds to the "else" branch.

That's about it. In the end, everything boils down to creating and understanding the structure of the code and then moving step by step in the created structure, executing various actions at each step.

Comments

Popular posts from this blog

Some software development common sense ideas

 I haven't really written here in a long time so it's time to write some new things. These are just some random common sense things that seemed to short to write about individually but seem really interesting and useful to me or other people 1. There is nothing fixed in software development, all things vary by circumstances and time Remember the documentation that didn't seem that important when you started the project, well after a couple of years one the application has grown and become really complicated, no one actually knows everything about the application anymore. So now you really need that documentation. What happens if you suddenly need much more people to develop the application because of some explosive growth? Without documentation, new developers will just look at the application like they look at a painting. This actually happened to me. Maybe in the beginning of a project, a technology really helped you a lot but as the project grew, it started making things...

Some things which are often blindly applied and followed by developers which are not always good

This is probably one of the most controversial things that I have written so far but I am just tired of hearing about these things and discussing these things. Other developers that I know share part of my feelings. I would rather hear more about how people built things, overcame challenges or what new interesting ideas and concepts they implemented. Those things are really interesting and innovative, not hearing about the same theoretical things over and over again. I can just read and learn those things from 100 sources on the internet. Firstly, one of the most discussed and promoted things is agile/scrum development. I think I have been through 5-8 workshops about agile development methodology. And each time, some things differed. There is no 100% standard approach to this. Everyone uses their own version of this development methodology and seem to argue a lot that their approach is right and everyone else is doing it wrong. You go to an interview, this will be one of the first 10 t...

Protected variations in software engineering explained and extended beyond the common usages

While digging through some standard programming principles like low coupling and high cohesion I stumbled upon the fact that they are part of a larger series of principles called "GRASP" principles. After reading a bit about them, they seem just as important if not more important than the "SOLID" principles And one particular principle from that series stuck with me: protected variations. According to this principle, variations and changes in parts of the application should be contained only in them and not trigger further changes in the application. In general terms points of inflection should be established between the parts that change and the rest of the application which act like a boundary and stop additional changes from propagating to the rest of the application. For example, one of the most common parts that might change in an application, is the data access and storage methods. For example instead of using direct sql to read and write to a database, an...