How is the "proper" right-to-left text written, and wrapped?

Started by Crimson Wizard, Sun 09/04/2023 01:42:22

Previous topic - Next topic

Crimson Wizard

This question is partially about syntax and partially about a technical issue in AGS.

There is a number of languages that are written right-to-left (Arabic, Persian, Hebrew, few others).

AGS has a built-in "Right to left" text mode, but it looks like it was meant strictly for ASCII method, and is not fully usable with Unicode texts. The thing is that unicode texts may already be written right-to-left (they may contain control characters for this, if I understand correctly).

The problem with AGS though is that the way it wraps the text assumes the text was originally written left-to-right (in script or translation files). In the Right-to-left mode the engine will wrap the text same way as for left-to-right, but then reverse each line separately.

This does not work at all with "proper" R-to-L Unicode texts. As they already are reversed, the line splitting may result in beginning of the sentence appearing on the last line.

My first thought was that a fix should simply be to revert the order of the split lines instead.
Or, more correctly: do the line splitting by scanning the text in reverse.

But then I started considering things like multiple sentences in the same string, or even multiple paragraphs in the same string, and got very confused.

My question is: supposing you have 2 sentences in, say, Arabic or Persian. How are those 2 sentences are written normally? Is their order also reversed? If yes, then the above fix will work... if not, then it won't work, because you will need to have first sentence arranged first (which may take multiple lines), then the second, and so on.

And then there's a case when you have a number of paragraphs in one string. Obviously, the order of paragraphs must not be reversed. But if a single sentence inside a paragraph is wrapped, then wrapping of that particular sentence will have to be reversed...
In other words, there may be groups of lines wrapped in reverse order, but groups themselves have to be arranged in original order...

So, thinking about this, I was coming to an idea that there's no trivial generic solution here.

Or rather, the only "trivial" generic solution is to write R-to-L languages completely left-to-right in the source texts, and let the engine revert them after splitting in lines.

Without this, the engine would have to parse the text's syntax using punctuation, finding where the sentences are (separated with fullstops), and where the paragraphs are (separated with manual linebreaks).
Well, this is not impossible, but at the same time things like that was not done in the AGS engine before.

Does anybody have ideas on this?

Snarky

I think you've confused yourself there. I am 99% sure that in RTL scripts, everything works about the same as in LTR scripts, except mirrored.

Crimson Wizard

#2
I was discussing this late at night on Discord, so may have had strange ideas. I will try to explain my confusion with the images.

If you type Arabic text, using google translation for example, this is how the text appears:


In English this sais "I love beaches. The water is warm."
Notice there are 2 sentences. Not only the words & letters inside a sentence are revertse R->L, but also sentences themselves are reverse R->L.

Now, if I put this full text into the AGS script as-is, it will look like this:


Is this how it's supposed to be in script and translation file?

OR, should we require the sentence in the L->R order, and only text R->L? Like:



Now, let's assume that there are paragraphs in this string.

In English this sais: "I love beaches. The water is warm. \n The sun is high. \n"

How do we treat these sections separated by \n? In R->L or L->R order?





In regards to the line-splitting and text wrapping. If everything is treated R->L (sentences are R->L and paragraphs are R->L), then we basically just need to have a reverse splitting algorithm, that is: run the string from back to front and split like that.
That's the easiest solution.



After this there will still be a logical problem of string concatenation, e.g.:
Code: ags
String s ="احب الشواطئ."; // I love beaches.
s = s.Append("الماء الحار."); // The water is warm.
s = s.Append("الشمس عالية."); // The sun is high.
This will be displayed in a wrong order on screen.
But I guess we cannot do anything about it, unless we introduce a R->L aware "String.Concatenate" function?

Snarky

These sounds like problems with the way AGS and the AGS editor treats RTL strings. To me it seems like the correct behavior is quite clear in each case:

1. A RTL string has its beginning on the right and its end on the left, so obviously the first sentence goes first (that is, on the right).

2. And it is processed from beginning (right) to end (left), just like a LTR string is (from beginning to end, so in that case left to right). OK, so there is a question of how the '/n' separator is parsed, but the way you have written it here, I would guess that there must be an invisible control code that switches from RTL to LTR for that bit, so that it's read:

[5:LTR "/n"],[4:RTL "The sun is high."],[3:LTR "/n"],[2:RTL "The water is warm."],[1:RTL "I love beaches."]

If there is no such control character, and it's all part of the RTL string, I would guess you'd type '/' and then 'n' as normal, but your cursor would be moving "backwards" so that it would appear onscreen as "n/".

3. Here I think you're making the incorrect assumption that a RTL string still begins on the left and ends on the right, and that its internal representation has simply been reversed, so that something concatenated to it is added to its right side. I believe if the strings have been correctly set to RTL, concatenating them will make the added part show up on the left. Similarly, if you took s.Chars[0] in your first example it should return the rightmost character (the one that looks like a bar, '|').

Now, it could very well be that the AGS editor doesn't properly support RTL strings in string literals, so that this bears little resemblance to how it actually works in AGS, but intuitively this is how I think it should treat them. Though it's probably worth verifying that with someone who actually uses a RTL script and is familiar with how it is embedded and edited in otherwise LTR text.

Crimson Wizard

So, the important question is, how the characters of these strings are positioned in char array.

Snarky

OK, so I've done a minimal amount if research to check my intuitions, and it does look like I was right.

The correct order in the array is always the reading order. String.Char[0] is the start of the text, and whether that first symbol is displayed on the left or right end of the string depends on the script. If the text is totally RTL, this is no more complex than in LTR: you can use the same layout logic only mirrored. It does get more complex if you have a mix of text chunks going in different directions (bidirectional text), such as if RTL strings are embedded in LTR code. Unicode has an algorithm for determining the layout in those cases: http://unicode.org/faq/bidi.html

If we hack RTL by just reversing the strings so that .Chars[0] is the end of the text, that is going to break ligatures etc., as @Mehrdad reports.

Crimson Wizard

#6
EDIT: scrap this, I need to check something out when I have more spare time.

Snarky

Quote from: Snarky on Mon 10/04/2023 08:31:14The correct order in the array is always the reading order.

Before you deleted it, @Crimson Wizard, you asked what this was based on and if I had tested it in AGS. I have not tested in AGS because I'm talking generally about how it is meant to work (specifically in Unicode), whether or not that it how it is currently implemented in AGS.

The Unicode document I linked to states it explicitly or implicitly in a few places (e.g. "No matter how the layout is resolved the order of characters in memory essentially follows the order they are typed."), but I also think it's kind of true more or less by definition: A text string starts at the beginning and ends at the end, so that the first symbol (apart from any metadata like control characters) is the one at index 0. The layout of the string for display is a secondary consideration or formatting convention that is not strictly part of the string itself, comparable to things like font face and line-wrapping. (And like for line-wrapping, it is sometimes necessary or desirable to embed hints/overrides within the string representation to get the correct/preferred appearance, e.g. non-breaking spaces, optional hyphens, explicit line breaks.)

Now, that said, for all I know it is possible that some tools we want to support (perhaps indeed the Scintilla editor) do provide "RTL" text by just reversing the string and treating that as a LTR string. If so, I think some (compile-time?) process to convert it back to the correct representation is required, if we want to do it right.

BTW, does TextBox support RTL input?

Crimson Wizard

#8
@Snarky, I did more actual tests and you are correct. When I copy a "proper" unicode RTL text into the editor, it's saved LTR, in other words, first "syntactical" character appears in Char[0], and rest follow. In other words, it only looks RTL when drawn in the application, but the data is LTR.

I did the test by simply printing Char[0] on a separate label.

What this means is, that if you print a "proper unicode" RTL text, then everything just works, so long as you set "Right-to-left" mode in the settings.

There's still a letter-linking problem, but that's an issue on its own.



The letter linking may be solved by a program called "Parsi Negar" that comes with its own fonts, suited for this particular purpose.
https://leomoon.com/downloads/desktop-apps/leomoon-parsinegar/
This is pointed out by @Mehrdad.

This program somehow converts the text. I am not 100% certain, but I think that it merges the linking letters into a single glyph with special number, and in combination with the special fonts it allows the Persian (and maybe Arabic?) text to be drawn visually correctly in AGS.

The problem with ParsiNegar is that by default it actually stores the text RTL, meaning its already reversed. This does not work with AGS linewrapping, neither LTR not RTL one.

I found that it's possible to workaround by disabling the "reverse" option in this program. If you do, then it seemingly works well in combination with AGS's RTL mode.
The nuance is that the text possibly looks "weird" in the script & translation file. I suspect that it will probably be hard to read by the native language speakers (as it looks literally reversed).

I am currently in the process of checking this out with Mehrdad on Discord. If he thinks that the workaround I found is not convenient for him, then the only option for us would be to introduce a 3rd selection to the "text direction" option, that would assume the text is only reverted in data.

Snarky

I'm glad to hear you're figuring out the ins and outs of it. If I understand what you're saying correctly, I believe there is still one part of your thinking that isn't quite right (at least, it confuses me), and it might be good to clear it up right away:

Quote from: Crimson Wizard on Tue 11/04/2023 20:21:51it only looks RTL when drawn in the application, but the data is LTR.

I think by this you mean that the string is stored with its beginning (i.e. the letter you're supposed to start reading at) at Char[0]. Great, that sounds like how it's meant to be. But it's wrong to say that this means "the data is LTR": the memory doesn't have any particular direction. We are used to arranging consecutive characters from left to right, but that's just habit, not anything inherent in the string or data (or properly designed Unicode string rendering libraries).

If I understand the Unicode documentation correctly, there is normally no explicit information stored in the string about whether it should be displayed LTR or RTL. Instead, each character (code point) is associated with a directionality flag, depending on the convention of the script it belongs to: letters in the Latin, Greek, Cyrillic, etc. alphabets are LTR, while letters in Arabic and Hebrew scripts are RTL. Punctuation characters are typically neutral (can go either way), and take their directionality from the surrounding characters. There are override codes that can force a particular directionality, which can be used for example to clarify if punctuation between a LTR and a RTL text chunk belongs to the preceding or following chunk (if it's the end of the LTR chunk, it should be placed at the right end of that; if it's the start of the RTL chunk, it should be placed at the right end of that).

tl;dr: LTR/RTL is not a property of the way strings are stored in memory, and it gets confusing if we talk as if LTR is the true and natural way memory is ordered. For accuracy and clarity, the question is whether the strings are reversed, so that Char[0] represents the end (as it would be read in that script).

Snarky

Maybe it would be useful to make a table, to keep track of how AGS treats text that is:

-LTR (we know this works)
-RTL (Unicode)
-bidirectional (Unicode)
-pseudo-RTL (reversed, non-Unicode?)

In the case of text that is:

-Copied into IDE from an external editor
-Entered/edited in IDE
-Entered in translation file (Unicode)

For:
-Display in IDE
-Display in-game

As well as:
-Text entry in game (TextBox)

With AGS set to:
 -RTL mode
-"Normal" mode

Edit: And by "how AGS treats," I mean:
-Do the characters appear in the right order/direction?
-Are lines of text correctly aligned?
-Are ligatures displayed correctly?
-Are string operations performed correctly? (.Append(), .Substring(), etc., as well as string substition tags like %s)
-Does editing work correctly? (in IDE and TextBox), i.e. caret movement on entering/deleting a character

Mehrdad

@Crimson Wizard 
Uncheck reverse is a nice idea and works perfectly in the game. But it doesn't correctly show in Editor and I can't read texts
My official site: http://www.pershaland.com/

eri0o

A small memory dump, Freetype can draw text, but it needs Harfbuzz for ligatures - and they have a circular dependency where both libraries depend on each other, so if you aren't building both from source together, you have to do a weird thing that is build FreeType, build Harfbuzz pointing to FreeType, and then build FreeType again pointing to Harfbuzz. Ligatures also have special functions for emojis - changing color or type of a emoji.

Now, neither support the characters that inverts text, so you need a bidi library. There's also care when doing this to look for the library with the correct license.

Here's the issue for the feature of bidi in SDL_ttf : https://github.com/libsdl-org/SDL_ttf/issues/135

I remember there was also some drawing library (Cairo?) that did have support and it was what chromium used.

Crimson Wizard

#13
From the engine's perspective it will of course be desired that there's a minimal amount of settings and "types" of data, and most conversions were done by compilers. For example, if compiler could inverse the "reversed" RTL text before packaging the game.

But in our case this will either be impossible or difficult to achieve, as the game script is allowed to have strings in multiple languages, and scripting language allows to switch Text Direction at runtime (see SetGameOption(OPT_RIGHTTOLEFT)). This assumes that script strings may contain text of all kinds.

Resolving this at game compilation time would likely require a new preprocessor feature, using some kind of annotations added to the string literals. And that would definitely bring more issues with texts loaded from custom files.

Considering the above, the easiest option at the moment is to add a new mode to OPT_RIGHTTOLEFT, and fix all places where OPT_RIGHTTOLEFT has a meaning, accounting for this new mode.

I.e, OPT_RIGHTTOLEFT will be:
0 - normal LTR;
1 - normal RTL (that is - characters stored in the order of natural reading);
2 - pre-reversed RTL (that is - characters stored against the order of natural reading, and in order of how they are displayed left-to-right).

How the engine handles the new mode 2 is purely internal to the engine. If new font libraries are added later, which may somehow affect the situation, this behavior may be adjusted accordingly.

Crimson Wizard

Also, I found something weird again; apparently there are gui controls that never reverse the text in RTL mode: Buttons and ListBoxes.

Basically, only text that supports wrapping does this: Labels, speech, display boxes.

I'm not sure if dialog options or drawing text on drawing surface does this... DrawTextWrapped probably does, since it wraps.

This is like that at least since AGS 3.2.1.

Crimson Wizard

The preliminary (unoptimized) version of "reverse RTL" solution code may be found here:
https://github.com/ivan-mogilko/ags-refactoring/tree/361--reversertl

It's unoptimized, because it does double text reverse when splitting (once before the split, and then second time after). I would have to somehow rewrite the splitting algorithm and make it capable of reading forwards and backwards...

eri0o

Just to try to understand, all the context of what you are talking here is ags3? Because on ags4 (we are unicode only there, right?) we could use a bidi library - it's not simple to integrate and requires some refactoring, but it's possible. (There's also the issue of vertical text with horizontal typesetting as a next hard thing, which requires a bit more thought)

Crimson Wizard

#17
Quote from: eri0o on Fri 14/04/2023 01:23:14Just to try to understand, all the context of what you are talking here is ags3?

Yes, because it needs enhancing unicode feature to let use these certain languages properly; and because other solutions require major changes to the font rendering, to which we do not have any estimate yet.

Quote from: eri0o on Fri 14/04/2023 01:23:14Because on ags4 (we are unicode only there, right?) we could use a bidi library

The case I'm trying to resolve is the "incorrect" unicode RTL text, which is stored in memory in reverse way compared to what it should normally be (it does not have these direction control characters). Because of that it has to be drawn as if it were LTR, but wrapped differently.

The reason why this kind of text data is used is because our font renderer does not have a correct Arabic, Persian etc ligature support. So user is relying on special converter that generates a text represented by a different kind of data, meant to be displayed with very particular fonts (I mentioned this  in a comment above).
If the font renderer was drawing these languages correctly, with proper ligatures, then this converter hack would likely not even be necessary. The proper solution would require at least a replacement of a font renderer to something modern (I guess that's "Harfbuzz" that you mentioned). But until then users of these languages have to resolve to this hack, whether that is ags3 or ags4.

Currently I'm looking for the minimalistic solution to the problem, with as little changes as possible.

Snarky

Quote from: Crimson Wizard on Fri 14/04/2023 01:32:42The case I'm trying to resolve is the "incorrect" unicode RTL text, which is stored in memory in reverse way compared to what it should normally be
Quote from: Crimson Wizard on Fri 14/04/2023 01:32:42The proper solution would require at least a replacement of a font renderer to something modern (I guess that's "Harfbuzz" that you mentioned). But until then users of these languages have to resolve to this hack, whether that is ags3 or ags4.

In that case, I would strongly urge you to consider how much effort to put into a stopgap solution built on a hack (and introducing even more hacks). Because if I understand correctly, this is not a regression or really even a bug: it's just that the very limited hacky support AGS has long had for RTL scripts imposes some inconveniences on devs that makes text formatting difficult (e.g. not being able to rely on automatic wrapping, which simply means they'll have to do so explicitly). So is it really worth it, or would it be better to devote that energy to a proper solution down the road?

Is it possible that some of the inconveniences can even be solved in script without any engine changes, simply by reversing the strings, doing any manipulations on them (including, potentially, finding the linebreak points, reordering the lines and inserting explicit linebreaks), and re-reversing them? (Or not re-reversing them, in the case of some GUI Controls apparently.)

Crimson Wizard

#19
Quote from: Snarky on Fri 14/04/2023 09:29:05So is it really worth it, or would it be better to devote that energy to a proper solution down the road?

Indeed, it would be better to aim a proper solution. The thing is, I don't know when this will be done, in which version etc, and at least few users already asked for this to work (and some been asking for years prior).

Before this whole conversation I was not fully aware of how RTL is stored internally normally, so I needed to test this out anyway.

In regards to the quick solution that I've been testing, so far it is this:
* expand the existing "Right to left mode", adding a new choice (Right-to-left reversed);
* if this option is on, then when doing text splitting scan the text from right to left.
* also set "alignment to right" as with normal RTL.
* everything else stays the same.

I spent 3-4 hours yesterday adding this option and testing things, but got stuck at the line splitting, because the algorithm is hardcoded for left-to-right scanning. So I wanted to look if I will be able to write a "generic" one instead.

Meanwhile, I also found and fixed 2 actual bugs in 3.6.0, so that was not a fully wasted time...


Quote from: Snarky on Fri 14/04/2023 09:29:05Is it possible that some of the inconveniences can even be solved in script without any engine changes, simply by reversing the strings, doing any manipulations on them (including, potentially, finding the linebreak points, reordering the lines and inserting explicit linebreaks), and re-reversing them? (Or not re-reversing them, in the case of some GUI Controls apparently.)

Afaik this is what Mehrdad is currently doing: he puts linebreaks himself, everywhere where necessary.

Another workaround, which I mentioned above, is to keep using this converter program, but turn off the option to "reverse" text during conversion. In this case it works in the game (with normal RTL setting), but it makes all the texts look reverse in script and translation file, obviously making it inconvenient to the user (I am unaware if human can normally read Arabic and Persian in reverse same way as e.g. English can be read in reverse).

SMF spam blocked by CleanTalk