How to get the CoreCLR source code, compile it and some really interesting things found inside

A couple of months ago I had the chance to look a bit into the famous CoreCLR source code and actually compile it. It took me a while to understand it and how use it. This is probably some really advanced stuff that most of the time you don't need to know. It is actually composed of 2 main parts. One is the actual virtual machine which takes and runs the intermediate language code. The other part is the actual core .net library called "System.Private.CoreLib.dll" with the base types such as the string type or the numeric types. The second These 2 parts are strictly related together. They need to have the same version and build type, release or debug.

The second part that contains the source code for the core type in .Net is actually a separate assembly written in C#. The unusual thing here is that people expect the types to be in "System.Core.dll" but it's not. That assembly is just a stub for the real assembly containing the actual implementation for the base types. It just forwards the type declarations to "System.Private.CoreLib.dll" telling the runtime to look there for them instead.

Now on to the source code.

The main source code can be found at the address: https://github.com/dotnet/coreclr

You can clone the git repository locally on your machine using the git client or you can download an existing released version code from the address: https://github.com/dotnet/coreclr/releases

I recommend to download the code of a released version because you need a Visual Studio to develop applications for it. The latest unreleased versions don't have support in Visual Studio for this. And they might be buggy or they might not compile on all platforms.

After you download the source code, you need some additional applications. First of all you should download Python 2.7 and install it. Be careful to tick the add python to environment PATH variable option. This is used to detect that python is installed and where it is installed. It is really popular and a google search would give the first link which will be the download link.

Another essential program to install is CMake. This application is used to generate projects for various code editors like Visual Studio from a general configuration file. It is really useful when multiple people work on the same codebase but with different IDEs. Again using google or bing to find it will give the first result which will be the download link.

Finally you need the latest Windows SDK for your specific windows version. With Windows 10 this is extra buggy and complicated because you might have a newer version of Windows 10 that doesn't have an SDK released yet. Or your SDK might be older than the Windows version. You should have the same version of Windows as the SDK. Also you should have only one Windows SDK installed. And it must be properly set in the Windows PATH environment variable.

In the end the path variable in your system should look like this:

You see in the screenshots above entries for the programs mentioned earlier, CMake, Python and Windows SDK.

Once you download the source code, in the root of the source code directory there should be a file called "build.cmd". It's a script to build the CoreCLR and generate the Visual Studio solution with the projects. You can double click to run it. It will take a while for it to finish, maybe an hour or even more. Also you need at least 20 GB of free space on your disk.

Once the build is finished a solution with all the projects will be generated. In general it is located in "source code directory\bin\obj\Windows_NT.x64.Debug" and it is called "CoreCLR.sln". You can actually build the CoreCLR from the solution too but I would recommend to use the build script command.

The resulting Visual Studio solution will be pretty big and will look something like this:

The most important project is the ALL_BUILD project. When you compile that project it will build the whole CoreCLR.

Once you compile the CLR virtual machine you need to actually debug it and make a program which used it. You can't debug it directly, you need to run a .net core application with it and then debug it. So instead of debugging the application you will debug the CLR itself. So you need to create a CoreCLR project in Visual Studio. The thing is that you need to create the project with the same version as the CoreCLR that you compiled.

After creating the project you will need to change some settings in it. Edit the created Visual Studio project within Visual Studio and add an extra line to it like bellow under the ItemGroup node :

This change the way the CoreCLR application is packaged to come bundled along with the CoreCLR runtime and virtual machine. And we will replace the existing ones with our own compiled versions. When we compiled the source code some packages were generated with the runtime and virtual machine. We can reference these nugget packages in our CoreCLR application project. When it will compile it, it will fetch our own custom compilation and bundle everything in an output directory. The resulting application will be independent and it won't need the .Net framework.

To do this, you need to add a new package source reference and use the "runtime.win-x64.Microsoft.NETCore.Runtime.CoreCLR" package. The name of this package might change with each release of the .Net core distribution. You just need to right click on your project of your .Net application and select the Manage Nuget Packages option and a window will appear like bellow. You can also see the package sources. In this case we need to configure a custom package source which will be local and it will be the folder where the CoreCLR build resides in as a package.

But before we can run our application inside our own custom virtual machine, we need to also replace the core library that contains the core types, "System.Private.CoreLib.dll". Otherwise it won't work. Both parts of the CLR need to have exactly the same version and build type. You can find this after you build and compile the application in the "project folder\bin\Release\PublishOutput" directory. The compiled version of this dll that we just did can be found here: "coreclr repository\bin\obj\Windows_NT.x64.Debug\System.Private.CoreLib".

To debug the CoreCLR you need to set the target of the debug of the ALL_BUILD project to the CoreCLR application that we just made earlier:

Once you navigate in the code you will come across some interesting design concepts. I never really studied a complex piece of software like the CoreCLR and I found some surprising things.

Firstly, there is actually a special data structure called the stack which every thread has. In the code, the stack is actually composed of multiple stack frames and there is actually a C++ class called "Frame" which all the stack frame types inherit from. You can do a solution wide search for "Frame" and you will come across the class definition in the search results for sure. Or there might be a "Frame" code file. Each frame corresponds to a method call. So in each frame the local variable references are stored. When you assign a new value to a reference type, this structure corresponding to the currently called method is updated. Actually it may be a list of pointers and the corresponding pointer is updated with the new value. But also this structure is used for garbage collection providing a way for the garbage collector to acces the references. We can see that this structure is actually updated and used by both the compiled managed code and the native C++ code of the virtual machine. I think when the actual binary code is emitted for a method, the intermediate managed code is compiled against this structure into binary code. I am not 100% about these things though.

Secondly the frame structures are used by the runtime to handle exceptions. When an exception is thrown inside the code then the runtime has to delete all the stack frames until a method with a stack frame that can handle that exception is found. While doing this all the references from the deleted stack frames are removed too so that the garbage collector can know some objects are not used anymore and need to be deleted. I am not too sure about this but as far as I know this process is called stack unwinding. And it has a lot of lines of code and complex logic, more than 8 thousand lines of code which take a while to execute. This is why you should avoid exceptions and only use them in exceptional cases.

Thirdly, there is actually a base class for objects called "Object" different from the object class from the .Net framework. I don't quite understand how this is used. Maybe it's just a façade for the native code to access and interpret the data stored in the managed objects easily. I don't know. There are a couple of other types derived from it for arrays and strings. All these classes are contained inside a special folder called "vm" which contains useful constructs that refer to the virtual machine aspect of the CoreCLR environment. There are plenty of interesting classes in that folder like the "BaseAppDomain" class. It turns out that every CoreCLR application has an extra application domain called a "SystemDomain" which is only one per process. I think internal things that need to be shared across all application domains in the same process are stored here but I am not too sure about this.

And the way intermediate language is compiled to binary code is also pretty interesting. It turns out that before being compiled, a actual tree or graph is composed with edges and nodes being the instructions in the intermediate language. Each of the edges has an estimated cost. For example a function call inside a loop will probably have a higher cost than a simple function call that is done once. In the first case the compiler might inline the method directly in the for loop to avoid having to pass arguments to the function each time the for loop is executed. Or for example there is a special optimization called "folding" which refers to getting rid of redundant instructions. For example if inside a piece of code, some addition is performed between several values and two of them are constants, then the constant values will be added together by the compiler and replaced with the sum between the two constants to remove a redundant addition. All these functionalities are located in the "jit" folder. The nodes in the graph mentioned earlier are represented by "GenTree" types.

The advantage of the approach above is that it applies optimizations regardless of the original language the application was written in. Even if it was in C# or Visual Basic .Net, the result was the same intermediate language instructions which then get handled by the same optimization mechanism.

There is still a lot of stuff to be said. The class system in CoreCLR seems to have an interesting optimization. The type system seems to be separated into 2 major componens: class type definitions and class type instances. This is for the generic mechanism implemented in .Net I think. When a generic instance is created the common type info is shared across all the generic type instances by using class type instances behind the scenes which refer to the same class type definition. I know there was an issue with generics in .Net in which too many generic types were generated which slowed the application.

On a higher level there is a central important functionality called the execution engine. This is pretty much responsible for running the application. It can suspend the application so that the garbage collector can run. I think it also manages the threads. The garbage collector has a special interface with which it communicates with the execution engine called "IGCToCLR".

The actual garbage collector is actually pretty interesting. The runtime is separated from the garbage collector and when it is initialized it tries to locate and load a garbage collector. There is an interface which the garbage collector has to implement. Yes, you can actually implement your own garbage collector if you are really hardcore. I think the actual garbage collector can be in a separate dll which is loaded dynamically at runtime. Also something really cool is that there are actually 2 different garbage collection methods. One method is a bit faster and only looks at recently created objects which are in the first generation. And the second method is the slower and more general which looks at all the objects. The actual method which does the first way of garbage collecting is called "gc1".

There are still a lot of things left to be said but this post is already pretty long and loaded with information so I will cut it here. Maybe I will update this in the future with more information.

Untitled Coding Blog

Search This Blog

How to get the CoreCLR source code, compile it and some really interesting things found inside

Labels

Comments

Post a Comment

Popular posts from this blog

Some software development common sense ideas

Some things which are often blindly applied and followed by developers which are not always good

Protected variations in software engineering explained and extended beyond the common usages