This is the final part of the series about the C# code analyzer and interpreter that I created a while back. The actual execution and interpretation of the C# code are explained in this post. It is the last post in this series.
To begin with, in very simple terms each word in the C# code can have an associated action that needs to be executed when the code is read by the analyzer in the proper order. For example, a word in the code might be a variable name. In this case, the reference corresponding to that variable is located and stored in a temporary buffer once the word is read. After that, it can be used from the buffer for another operation like invoking a method on the object/objects that the reference references. In case it has to call a method, it needs to locate the body of the method and jump to the method implementation and start reading it. With constructors, it is pretty much the same thing with the only difference being the fact that it always returns a single object and that it always has the same of the object's type. So all in all, the interpreter just reads the code line by line in a specific order and for each word, it executes an action.
Still, it's not enough to execute actions. The interpreter needs to store the program state somewhere which contains the results of these actions. This is where the heap and stack come into play. In general the stack stores only the local results corresponding to the current execution context and the heap stores the global results available in all contexts.
Now the stack is just an order list of stack frames. Each stack frame corresponds to a method call which can be normal methods, constructor or property getter/setter. The actual code for the stack frame can be found bellow:
You can see some important properties in the stack frame. Firstly there is the "ThisReference" reference which stores the object that the current method is being called on. Secondly, there is another property "LocalReferences" which stores all the references to the current local variables. These are scanned when an indentifier is found in the code to see if there is a local variable which has the same name as the identifier. An identifier can be anything such as a variable name, a method name, property name and so on.
Additionally, there is a special property called "MemberAccessReference". This is used as a buffer to store the results of actions such as fetching a local variable when an identifier with the same name is found in the code. Or storing the result of a method call. This special property is also scanned when another identifier is found in the code. An identifier can be anything which has a name, a variable name, a method name and so on.
So there are actually 3 places where the interpreter looks for possible variables, methods or anything else when it finds an identifier in the code:
- "MemberAccessReference": This is the first place that it is always scanned and more like a buffer. Thus it has the highest priority. It looks inside the objects stored here and checks to see if they have any fields, properties or methods with the same name as the identifier it found in the code. It is also used to store the result after getting an object using an identifier in the code. It can actually store anything like a method or property references. It is also used to store the results of access operations or method invocation which are objects themselves. So the values and objects stored here are always overwritten by various operations. This allows to chain operations together, like method chaining for example. Or another example would be accessing a field from an object stored in another field of another object.
- "LocalReferences": As mentioned earlier this contains all the local variables, including their name. The interpreter looks here after looking in the buffer mentioned above if it doesn't find anything with the same name as the identifier.
- "ThisReference": This is the last place the interpreter looks if it doesn't find anything in the places mentioned above. It looks for fields, methods or anything else which has the same name as the identifier found in the code.
Now when the interpreter runs into a method call, it will create a new stack frame and push it on the stack. Actually this is how the CLR works and in general any stack-based language like C, C#, Java and so on. Bellow is the code for the stack:
The "_staticWorkflowEvaluatorExecutionFrames" is pretty much the most important field. It stores all the stack frames. When a method is finished, the interpreter reads and executes the last line in the method's body, it's corresponding stack frame will be removed from this list. The latest stack frame is always on top of the stack so it will remove the stack frame from the top.
Still, it's not enough just to add or remove stack frames from the stack. They somehow must communicate among themselves. For example, when calling a method, the arguments to the method call must be passed to its stack frame. Or for example, when a method finishes executing, the return values must be passed to the previous stack frame in the stack so that the returned results can be further used in the parent context. This is where the properties "PassedMethodParameters" and "ReturningMethodParameters" from the stack frame class come in handy allowing stack frame to communicated between them.
Moving forward, for the interpreter to actually read the code and execute various actions, it needs to understand it and see what it actually means. For example, when it reads the "class" keyword it must know that a class declaration will follow. In order to do this, it relies on the C# compiler with its toolset called Roslyn to analyze the actual structure of the code. They are open source and can be downloaded from the internet if you search for "Microsoft Compiler Tools".
For each code file, a data structure is made by the C# compiler which contains information about the structure of the code. The data structure is actually a syntax tree which is a concept used a lot in compilers. It contains a series of interconnected nodes which actually are part of the syntax tree. For each structure inside the code, a corresponding node is added in the syntax tree. For example, if the code file contains a class declaration then in the syntax tree it will add a node of the type "ClassDeclarationSyntax" which is just a class/type declared in the C# compiler library/package. If it finds a method, then it creates a node of type "MethodDeclarationSyntax" in the tree. Absolutely every coding structure in the code file with have a corresponding node in the syntax tree.
These nodes are actually organized and there is a strict relationship between them. For example, a node of type "ClassDeclarationSyntax" has a list of nodes of type "MethodDeclarationSyntax" which represent all the methods declared inside the class. Or for example, the node which corresponds to the body of a method is of type "BlockSyntax" has a list of individual nodes of type "StatementSyntax" which are the actual statements contained in that block of code. When it executes that method, it just goes in order through all the statements in that list and executes an action for each of them.
If you need further explanations about syntax trees and nodes here is a good tutorial: https://github.com/dotnet/roslyn/wiki/Getting-Started-C%23-Syntax-Analysis
But still, it can't interpret and execute the code in just one read. First, it reads all the code in all the files and builds a table and detailed type information for all the classes, methods, properties and all other things. The actual class is called "EvaluatedTypesInfoTable" and it has a bit of lines of code so I won't be including it here. You can find it here: https://github.com/Alecu100/CodeAnalyzer/blob/master/CodeEvaluator.Evaluation/Members/EvaluatedTypesInfoTable.cs
After it finishes creating the detailed interpreted structure of the code, it then can begin to read and interpret the code on the second run. For each syntax node type that I mentioned earlier, a corresponding action exists that is executed when it reads the node. For example, if it reads a method call node, "InvocationExpressionSyntax", it identifies all the overloaded methods. Then it resolves the arguments of the call and based on the arguments, it selects a corresponding method from the list of overloaded methods. Then it creates a new stack frame for that methods, passes the arguments to the stack frame and the method. And finally, after that, it starts to read the body of the method line by line and execute further actions. The actual code and class which is responsible for calling methods is quite long and can be found below:
All the other actions which are executed for various syntax nodes can be found at the following address: https://github.com/Alecu100/CodeAnalyzer/tree/master/CodeEvaluator.Evaluation/Evaluators
These actions that are executed for each node can trigger other "child" actions to be executed for the child nodes. For example, an if statement has a couple of child nodes that need to have actions executed for them. These child nodes are the actual condition in the if statement, the main block that is executed if the condition is true and the optional block which corresponds to the "else" branch.
That's about it. In the end, everything boils down to creating and understanding the structure of the code and then moving step by step in the created structure, executing various actions at each step.
To begin with, in very simple terms each word in the C# code can have an associated action that needs to be executed when the code is read by the analyzer in the proper order. For example, a word in the code might be a variable name. In this case, the reference corresponding to that variable is located and stored in a temporary buffer once the word is read. After that, it can be used from the buffer for another operation like invoking a method on the object/objects that the reference references. In case it has to call a method, it needs to locate the body of the method and jump to the method implementation and start reading it. With constructors, it is pretty much the same thing with the only difference being the fact that it always returns a single object and that it always has the same of the object's type. So all in all, the interpreter just reads the code line by line in a specific order and for each word, it executes an action.
Still, it's not enough to execute actions. The interpreter needs to store the program state somewhere which contains the results of these actions. This is where the heap and stack come into play. In general the stack stores only the local results corresponding to the current execution context and the heap stores the global results available in all contexts.
Now the stack is just an order list of stack frames. Each stack frame corresponds to a method call which can be normal methods, constructor or property getter/setter. The actual code for the stack frame can be found bellow:
You can see some important properties in the stack frame. Firstly there is the "ThisReference" reference which stores the object that the current method is being called on. Secondly, there is another property "LocalReferences" which stores all the references to the current local variables. These are scanned when an indentifier is found in the code to see if there is a local variable which has the same name as the identifier. An identifier can be anything such as a variable name, a method name, property name and so on.
Additionally, there is a special property called "MemberAccessReference". This is used as a buffer to store the results of actions such as fetching a local variable when an identifier with the same name is found in the code. Or storing the result of a method call. This special property is also scanned when another identifier is found in the code. An identifier can be anything which has a name, a variable name, a method name and so on.
So there are actually 3 places where the interpreter looks for possible variables, methods or anything else when it finds an identifier in the code:
- "MemberAccessReference": This is the first place that it is always scanned and more like a buffer. Thus it has the highest priority. It looks inside the objects stored here and checks to see if they have any fields, properties or methods with the same name as the identifier it found in the code. It is also used to store the result after getting an object using an identifier in the code. It can actually store anything like a method or property references. It is also used to store the results of access operations or method invocation which are objects themselves. So the values and objects stored here are always overwritten by various operations. This allows to chain operations together, like method chaining for example. Or another example would be accessing a field from an object stored in another field of another object.
- "LocalReferences": As mentioned earlier this contains all the local variables, including their name. The interpreter looks here after looking in the buffer mentioned above if it doesn't find anything with the same name as the identifier.
- "ThisReference": This is the last place the interpreter looks if it doesn't find anything in the places mentioned above. It looks for fields, methods or anything else which has the same name as the identifier found in the code.
Now when the interpreter runs into a method call, it will create a new stack frame and push it on the stack. Actually this is how the CLR works and in general any stack-based language like C, C#, Java and so on. Bellow is the code for the stack:
The "_staticWorkflowEvaluatorExecutionFrames" is pretty much the most important field. It stores all the stack frames. When a method is finished, the interpreter reads and executes the last line in the method's body, it's corresponding stack frame will be removed from this list. The latest stack frame is always on top of the stack so it will remove the stack frame from the top.
Still, it's not enough just to add or remove stack frames from the stack. They somehow must communicate among themselves. For example, when calling a method, the arguments to the method call must be passed to its stack frame. Or for example, when a method finishes executing, the return values must be passed to the previous stack frame in the stack so that the returned results can be further used in the parent context. This is where the properties "PassedMethodParameters" and "ReturningMethodParameters" from the stack frame class come in handy allowing stack frame to communicated between them.
Moving forward, for the interpreter to actually read the code and execute various actions, it needs to understand it and see what it actually means. For example, when it reads the "class" keyword it must know that a class declaration will follow. In order to do this, it relies on the C# compiler with its toolset called Roslyn to analyze the actual structure of the code. They are open source and can be downloaded from the internet if you search for "Microsoft Compiler Tools".
For each code file, a data structure is made by the C# compiler which contains information about the structure of the code. The data structure is actually a syntax tree which is a concept used a lot in compilers. It contains a series of interconnected nodes which actually are part of the syntax tree. For each structure inside the code, a corresponding node is added in the syntax tree. For example, if the code file contains a class declaration then in the syntax tree it will add a node of the type "ClassDeclarationSyntax" which is just a class/type declared in the C# compiler library/package. If it finds a method, then it creates a node of type "MethodDeclarationSyntax" in the tree. Absolutely every coding structure in the code file with have a corresponding node in the syntax tree.
These nodes are actually organized and there is a strict relationship between them. For example, a node of type "ClassDeclarationSyntax" has a list of nodes of type "MethodDeclarationSyntax" which represent all the methods declared inside the class. Or for example, the node which corresponds to the body of a method is of type "BlockSyntax" has a list of individual nodes of type "StatementSyntax" which are the actual statements contained in that block of code. When it executes that method, it just goes in order through all the statements in that list and executes an action for each of them.
If you need further explanations about syntax trees and nodes here is a good tutorial: https://github.com/dotnet/roslyn/wiki/Getting-Started-C%23-Syntax-Analysis
But still, it can't interpret and execute the code in just one read. First, it reads all the code in all the files and builds a table and detailed type information for all the classes, methods, properties and all other things. The actual class is called "EvaluatedTypesInfoTable" and it has a bit of lines of code so I won't be including it here. You can find it here: https://github.com/Alecu100/CodeAnalyzer/blob/master/CodeEvaluator.Evaluation/Members/EvaluatedTypesInfoTable.cs
After it finishes creating the detailed interpreted structure of the code, it then can begin to read and interpret the code on the second run. For each syntax node type that I mentioned earlier, a corresponding action exists that is executed when it reads the node. For example, if it reads a method call node, "InvocationExpressionSyntax", it identifies all the overloaded methods. Then it resolves the arguments of the call and based on the arguments, it selects a corresponding method from the list of overloaded methods. Then it creates a new stack frame for that methods, passes the arguments to the stack frame and the method. And finally, after that, it starts to read the body of the method line by line and execute further actions. The actual code and class which is responsible for calling methods is quite long and can be found below:
All the other actions which are executed for various syntax nodes can be found at the following address: https://github.com/Alecu100/CodeAnalyzer/tree/master/CodeEvaluator.Evaluation/Evaluators
These actions that are executed for each node can trigger other "child" actions to be executed for the child nodes. For example, an if statement has a couple of child nodes that need to have actions executed for them. These child nodes are the actual condition in the if statement, the main block that is executed if the condition is true and the optional block which corresponds to the "else" branch.
That's about it. In the end, everything boils down to creating and understanding the structure of the code and then moving step by step in the created structure, executing various actions at each step.
Comments
Post a Comment