Two compilers, one cup?

Started by fernewelten, Wed 25/07/2018 01:12:35

Previous topic - Next topic

fernewelten

At first blush, this project seems to have two compilers:

  • The one in Editor/AGS.CScript.Compiler
  • The one in Compiler/script/cs_parser.cpp (the main entry seems to be the function __cc_compile_file())

How are they related? Which should I delve into first?

Crimson Wizard

#1
The one in C# was not finished by Chris Jones, and AFAIK the library is only used for limited parsing operations such as macros preprocessing (IPreprocessor interface) and autocomplete.

Previously discussed, one of the potential future paths is to fully move the compiler to C#, which would also let port it to Mono as a part of the Editor (since Mono does not allow having mixed assemblies).

On the other hand, another route is to implement stand-alone script compiler, launched as a separate process. This solution might give more advantage in the end. In this case the choice of language is no longer restricted. But even then it would be very much desired to rewrite/refactor the old compiler, because currently its code is perhaps the messiest of all AGS.

For the reference, the people who worked with the compiler latest were Gurok and Nick Sonneveld. Personally I know only the script format, since I was dealing with its runtime interpreter, but have very little idea about compiler's code.

We have a description of the script format in the repository's wiki:
https://github.com/adventuregamestudio/ags/wiki/Script-memory
https://github.com/adventuregamestudio/ags/wiki/Script-execution
These are written from the runtime POV, but may give general idea of how script is created.

fernewelten

#2
Thanks for the fast response!

Quote from: Crimson Wizard on Wed 25/07/2018 01:32:04
because currently [the] code [of the compiler] is perhaps the messiest of all AGS.

Er, I've noticed that. :-D

I'm new to the project and don't know any details, so I can't give an informed opinion. But in general, I would prefer refactoring the compiler first and porting it to C# next: Usually, messy code isn't written that way from the get-go. It usually starts out with a clear, understandable concept. But life is complicated, and programmers learn the little snags and details step by step as they show up. Thus, the tangles come in and accumulate more and more. They've all been bought with sweat and tears, one by one.

In such a situation, if the code is simply rewritten from scratch, we do regain a nice-looking, structured code. But it is impossible to understand the tangled old one, so it is impossible to port all the bells and whistles of the old code into the new one, and tons of information get lost.

Trouble is, life still is complicated, and all the lost bells and whistles will have to be re-bought a second time with sweat and tears, one by one. This might be much more than estimated originally.

So it might be worth the pain to refactor the code and keep the bells and whistles in. When the code is in shape, it shouldn't be too hard to port it to C# in a second step.

This is just a general opinion; I don't know whether it really applies here. You've probably had lots of discussions about it already in the team.

The compiler already has a rudimentary test suite, which is a good thing. As far as I can see, it is fully functional (except for that one test): You just compile it, link it against the compiler object file and let it run.

This could be extended into a safety net for refactoring: After each rewriting step, the tests could be run "at the push of a button" to ensure that the new output is exactly like the old one. Thus any errors or omissions that creep in would be caught as early as possible. With luck, the compiler would never leave a state where it is fully usable. This would be very desirable IMO since it is a central core component.


Quote from: Crimson Wizard on Wed 25/07/2018 01:32:04
We have a description of the script format in the repository's wiki:
https://github.com/adventuregamestudio/ags/wiki/Script-memory
https://github.com/adventuregamestudio/ags/wiki/Script-execution

I've looked at them and found them quite helpful in understanding the code.

I've also seen that there's a "fixup" step involved. There seem to be two parallel arrays, one for the bytecode proper and another for "fixups". Each bytecode has an associated 8-bit field in that other array that denotes whether the integer is left as is or whether it is translated in one of several ways. I've preliminary decided that this must be some sort of one-pass linking.

Crimson Wizard

#3
Quote from: fernewelten on Wed 25/07/2018 03:16:10
I've also seen that there's a "fixup" step involved. There seem to be two parallel arrays, one for the bytecode proper and another for "fixups". Each bytecode has an associated 8-bit field in that other array that denotes whether the integer is left as is or whether it is translated in one of several ways. I've preliminary decided that this must be some sort of one-pass linking.

Right, this is something that was not documented in the wiki for some reason (I could swear I've seen it there, but that may be a false memory).
The linking is done at the runtime when the engine loads up the scripts. It then resolves global variables, string literals and script exports.
This is where "Script link" errors may also come from (most commonly when some plugin is missing or room script was not recompiled in sync with the script modules).

fernewelten

So, I've checked with Gurok, and he doesn't think it probable either that I'd step on another programmer's toes.

So I'm having a crack at refactoring cs_parser.cpp, at first trying my hardest to not change anything except for the "tangledness". It's a strictly localized effort -- 1 file --, so it will be easy to roll back or ignore if it doesn't work out. I'll try to stress-test that continually as much as possible.

If it does work out, the next step might be to try and make the code more "c++"sy. That is, replacing homegrown structures by established ones wherever they exist while still not changing anything in the behaviour or output. Perhaps better object boundaries emerge, perhaps concerns can be better separated.

If that works out, too, then we can see where we might take the code from there.

Crimson Wizard

#5
I wanted to say this earlier, but since you've already made up your mind on the refactoring, I'll put it here only for the sake of argument.

Quote from: fernewelten on Wed 25/07/2018 03:16:10
In such a situation, if the code is simply rewritten from scratch, we do regain a nice-looking, structured code. But it is impossible to understand the tangled old one, so it is impossible to port all the bells and whistles of the old code into the new one, and tons of information get lost.

Trouble is, life still is complicated, and all the lost bells and whistles will have to be re-bought a second time with sweat and tears, one by one. This might be much more than estimated originally.

I tend to in some way disagree with this in relation to the script compiler. When it comes to the engine and the editor it is difficult to track down and document all the big and small features added over the years, and this is why carefully untangling the old code is a must if we aim fully keeping their behavior.

With compiler, imho, you need to follow strict specification, and we may have one since we know both script syntax and compiled format very well. All the deviating behavior comes most certainly from mistakes, oversights and lack of optimization in the old code.

Of course it may be easier to use existing code than write new one from the scratch, but my main point here is that it's not really wanted to keep "non-standard" behavior for its own sake, and may not be neccessary to care about all the nuances of the old code so long as you have a clear idea of how to do parsing and create proper compiled format.

fernewelten

You're quite right: It might easily be that this tangle hides something strange somewhere - some construct that the code does NOT forbid although it ought to be forbidden, or the opposite. Also, we might find out that some parts of the interface have strange quirks in special cases, for instance that some externs interact in unwanted ways in special cases. When we find out how some code is generated, we might not like that generated code.

Whenever that happens we can still decide to flag those user inputs as errors or to change the interface or at least to provide some additional safeguarding or to change the code generation.

This is just meant as an intermediate step to make the code easier to work with and to make any weak points that need fixing more conspicuous to see.

fernewelten

#7
Well, speak of the devil and he'll come along. Here's a prime example:

Bug: Messing with the compiler line numbering

Seems that the tokenizer is written in such a way that any constant string of the form "__NEWSCRIPTSTART_" anywhere in any context whatsoever (except for comments) will reset the line numbering of the compiler. :-/ I haven't definitely ruled out yet that the parser _needs_ that for some strange reason, but preliminary analysis says it doesn't. There's no place in all the AGS code where either that string or the #define for it is referenced a second time.  It's only in the tokenizer. It's never automatically generated.

Crimson Wizard

#8
AFAIK the "__NEWSCRIPTSTART_" is used to divide dialog scripts, which are compiled in one bunch. See DialogScriptConverter.


I have one general request, could you please open these issues in github repository, since you are working with that anyway?

This is a problem to have two trackers. The issue tracker was created on forums even before source was opened by Chris Jones. At first we were filling it extensively, but now I think it is convenient to keep bug reports on github, because this is where developers look in first.
Today the board tracker is used very rarely, regular users usually just open a forum topic or post in release thread. I recall some people suggested to move everything to github eventually. Maybe that was a mistake to keep it so long.

SMF spam blocked by CleanTalk