[Lazarus] String vs WideString

Discussion:

[Lazarus] String vs WideString

Marcos Douglas B. Santos via Lazarus

2017-08-12 19:46:09 UTC

Hi,

I have a "old" system that was coded in FPC 2.6.5.
Today I had to change something in the code and now I need to update
to FPC 3.0 and Lazarus 1.9.

This system uses a COM object. I made a class to wrap the configuration.

So, all string arguments in this class is WideString based.
The SetLicence method will receive a WideString but the source is a "string".

Look:

Lib.SetLicense(
IniFile.ReadString('TheLib', 'license', '')
);

As you know, IniFile.ReadString returns a "string" and some internal
conversion is happening and the licence is not valid anymore.

If I put the licence as a string directly, it works:

Lib.SetLicense(
'my_licence_here'
);

How can I change my code to work properly using Ini files, strings and
WideString?

Best regards,
Marcos Douglas
--

Mattias Gaertner via Lazarus

2017-08-12 20:32:47 UTC

On Sat, 12 Aug 2017 16:46:09 -0300

Post by Marcos Douglas B. Santos via Lazarus
[...]
Lib.SetLicense(
IniFile.ReadString('TheLib', 'license', '')
);

What encoding has the ini file?

Mattias
--

Marcos Douglas B. Santos via Lazarus

2017-08-12 20:43:29 UTC

On Sat, Aug 12, 2017 at 5:32 PM, Mattias Gaertner via Lazarus

Post by Mattias Gaertner via Lazarus
On Sat, 12 Aug 2017 16:46:09 -0300

Post by Marcos Douglas B. Santos via Lazarus
[...]
Lib.SetLicense(
IniFile.ReadString('TheLib', 'license', '')
);

What encoding has the ini file?

ANSI. A simple text file on Windows with only ANSI chars.

But I'm so sorry Mattias, it was my fault.
The program was reading the wrong file version (problem in paths...).

It works now, but I have one question:
What is the right way to code to do not see this warning?

Warning: Implicit string type conversion from "AnsiString" to "WideString"

Regards,
Marcos Douglas
--

Mattias Gaertner via Lazarus

2017-08-12 20:49:56 UTC

On Sat, 12 Aug 2017 17:43:29 -0300

Post by Marcos Douglas B. Santos via Lazarus
[...]

Post by Mattias Gaertner via Lazarus
What encoding has the ini file?

ANSI. A simple text file on Windows with only ANSI chars.

Which one? Do you mean Windows CP-1252?

Post by Marcos Douglas B. Santos via Lazarus
[...]
Warning: Implicit string type conversion from "AnsiString" to "WideString"

Explicit type cast:

Lib.SetLicense(
WideString(IniFile.ReadString('TheLib', 'license', ''))
);

Mattias
--

Marcos Douglas B. Santos via Lazarus

2017-08-12 20:56:58 UTC

On Sat, Aug 12, 2017 at 5:49 PM, Mattias Gaertner via Lazarus

Post by Mattias Gaertner via Lazarus
On Sat, 12 Aug 2017 17:43:29 -0300

Post by Marcos Douglas B. Santos via Lazarus
[...]

Post by Mattias Gaertner via Lazarus
What encoding has the ini file?

ANSI. A simple text file on Windows with only ANSI chars.

Which one? Do you mean Windows CP-1252?

Yes...
But would it make any difference?

Post by Mattias Gaertner via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
[...]
Warning: Implicit string type conversion from "AnsiString" to "WideString"

Lib.SetLicense(
WideString(IniFile.ReadString('TheLib', 'license', ''))
);

Wow... everywhere? :(

Regards,
Marcos Douglas
--

Bo Berglund via Lazarus

2017-08-12 22:21:55 UTC

On Sat, 12 Aug 2017 17:56:58 -0300, "Marcos Douglas B. Santos via

Post by Marcos Douglas B. Santos via Lazarus

Post by Mattias Gaertner via Lazarus
Which one? Do you mean Windows CP-1252?

Yes...
But would it make any difference?

I recently had a problem with an application that was converted from
old string type to AnsiString and seemingly worked in the new Unicode
environment.
However, we received reports that it had failed in some Asian
countries (Korea, China, Thailand) and upon checking it turned out
that the data inside a string used as buffer was changed because of
locale differences....

After switching out the affected variable declarations from AnsiString
to RawByteString the application seemingly started to work again also
on these locations.

So AnsiString is not safe either....

And after this I have spent some time to totally rework the use of
strings as buffers to instead use TBytes. Lots of work but guaranteed
to not sneak in unexpected conversions.

--
Bo Berglund
Developer in Sweden

--

Marcos Douglas B. Santos via Lazarus

2017-08-13 02:42:43 UTC

On Sat, Aug 12, 2017 at 7:21 PM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
On Sat, 12 Aug 2017 17:56:58 -0300, "Marcos Douglas B. Santos via

Post by Marcos Douglas B. Santos via Lazarus

Post by Mattias Gaertner via Lazarus
Which one? Do you mean Windows CP-1252?

Yes...
But would it make any difference?

I recently had a problem with an application that was converted from
old string type to AnsiString and seemingly worked in the new Unicode
environment.
However, we received reports that it had failed in some Asian
countries (Korea, China, Thailand) and upon checking it turned out
that the data inside a string used as buffer was changed because of
locale differences....
After switching out the affected variable declarations from AnsiString
to RawByteString the application seemingly started to work again also
on these locations.
So AnsiString is not safe either....
And after this I have spent some time to totally rework the use of
strings as buffers to instead use TBytes. Lots of work but guaranteed
to not sneak in unexpected conversions.

Is not simpler to use RawByteString instead TBytes?

Regards,
Marcos Douglas
--

Bo Berglund via Lazarus

2017-08-13 09:19:35 UTC

On Sat, 12 Aug 2017 23:42:43 -0300, "Marcos Douglas B. Santos via

Post by Marcos Douglas B. Santos via Lazarus

Post by Bo Berglund via Lazarus
After switching out the affected variable declarations from AnsiString
to RawByteString the application seemingly started to work again also
on these locations.
So AnsiString is not safe either....
And after this I have spent some time to totally rework the use of
strings as buffers to instead use TBytes. Lots of work but guaranteed
to not sneak in unexpected conversions.

Is not simpler to use RawByteString instead TBytes?

Well, initially just changing the declarations would seem to be
simpler. But given how the conversion problem sneaked up behind my
back, I thought it wiser to move all serial comm buffers from various
string types (string->AnsiString->RawByteString) to TBytes since that
is really guaranteed to be "the real thing".

Whenever there is a need for displaying the data or putting them into
a string type variable I have added a few utility functions to do the
conversions using the Move() procedure. Likewise I made a PosBin() for
searching for patterns like Pos() for strings etc.

--
Bo Berglund
Developer in Sweden

--

Juha Manninen via Lazarus

2017-08-13 10:51:19 UTC

On Sun, Aug 13, 2017 at 1:21 AM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
So AnsiString is not safe either....

That is a little misleading.
Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.
The right solution is to use Unicode everywhere. Windows codepages can
be seen as a historical remain, retained for backwards compatibility.
Now is year 2017, Unicode has been used for decades. Everybody should
use it by now.

Marcos Douglas, please change the encoding in your text file to UTF-8.
Every decent text editor, including the editor in Lazarus, has a
feature to do it.
Once the data is Unicode, it is all smooth sailing.
Data is converted between UTF-8 and UTF-16 losslessly.

One more thing:
Data for WideString/UnicodeString parameters in WinAPI functions are
converted automatically. You can ignore the warning or suppress it by
a type cast as Mattias showed.
However for PWideChar parameters you should create an explicit
temporary variable, usually UnicodeString but WideString for OLE.
Assigning to it from your "String" data converts encoding.
Then cast the new variable as the required pointer type.

Juha
--

Marcos Douglas B. Santos via Lazarus

2017-08-14 12:50:23 UTC

On Sun, Aug 13, 2017 at 7:51 AM, Juha Manninen via Lazarus

Post by Juha Manninen via Lazarus
On Sun, Aug 13, 2017 at 1:21 AM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
So AnsiString is not safe either....

That is a little misleading.
Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.
The right solution is to use Unicode everywhere. Windows codepages can
be seen as a historical remain, retained for backwards compatibility.
Now is year 2017, Unicode has been used for decades. Everybody should
use it by now.

"The right solution is to use Unicode everywhere."
I agree. But would be best if the compiler uses Unicode everywhere and
us, developers, using just one type called "string"... Even if this
break the old code. Maybe, instead using "string", the new code should
be use just UnicodeString...

Well, I know that many people here already had this "fight" about
Unicode so, let's forget about it what the compiler "should" or not to
do.

Post by Juha Manninen via Lazarus
Marcos Douglas, please change the encoding in your text file to UTF-8.
Every decent text editor, including the editor in Lazarus, has a
feature to do it.
Once the data is Unicode, it is all smooth sailing.
Data is converted between UTF-8 and UTF-16 losslessly.

You're right.

Post by Juha Manninen via Lazarus
Data for WideString/UnicodeString parameters in WinAPI functions are
converted automatically. You can ignore the warning or suppress it by
a type cast as Mattias showed.
However for PWideChar parameters you should create an explicit
temporary variable, usually UnicodeString but WideString for OLE.
Assigning to it from your "String" data converts encoding.
Then cast the new variable as the required pointer type.

This is a ugly trick... but I understood what you mean.

Best regards,
Marcos Douglas
--

Michael Schnell via Lazarus

2017-08-14 13:19:05 UTC

Post by Marcos Douglas B. Santos via Lazarus
"The right solution is to use Unicode everywhere."

Embarcadero though that this would not b the "right" solution. Otherwise
they would not have invented the encoding aware strings.

IMHO that was a good idea. They only completely failed to do a decent
specification and implementation.

-Michael
--

Graeme Geldenhuys via Lazarus

2017-08-14 13:55:47 UTC

Post by Juha Manninen via Lazarus
Now is year 2017, Unicode has been used for decades. Everybody should
use it by now.

Indeed, I can't agree more. Plus, I normally use UTF-8 for any text
files I create.

Regards,
Graeme

--

Juha Manninen via Lazarus

2017-08-13 11:18:23 UTC

On Sun, Aug 13, 2017 at 1:21 AM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
I recently had a problem with an application that was converted from
old string type to AnsiString and seemingly worked in the new Unicode
environment.

What was the old string type?

Post by Bo Berglund via Lazarus
However, we received reports that it had failed in some Asian
countries (Korea, China, Thailand) and upon checking it turned out
that the data inside a string used as buffer was changed because of
locale differences....

Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?

Post by Bo Berglund via Lazarus
After switching out the affected variable declarations from AnsiString
to RawByteString the application seemingly started to work again also
on these locations.
...
And after this I have spent some time to totally rework the use of
strings as buffers to instead use TBytes. Lots of work but
guaranteed to not sneak in unexpected conversions.

RawByteString is for text which encoding is not meant to be converted.
It has its special use cases.
TBytes is usually for binary data.
Did I understand right: you use TBytes to hold strings having Windows
codepage encoding?
That sounds like a very dummy thing to do!
Again: Why not Unicode? Then you could throw away your hacks.

Juha
--

Bo Berglund via Lazarus

2017-08-13 16:41:09 UTC

On Sun, 13 Aug 2017 14:18:23 +0300, Juha Manninen via Lazarus

Post by Juha Manninen via Lazarus
On Sun, Aug 13, 2017 at 1:21 AM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
I recently had a problem with an application that was converted from
old string type to AnsiString and seemingly worked in the new Unicode
environment.

What was the old string type?

Note: The programs were started back in around 2000 using Delphi 7...

We used "string" as the container for processing serial data to/from
CNC machine tool controllers amongst others. This was triggered really
by the serial components, which mostly transferred char(acters) and
had methods for sending and receiving strings, even though we usually
used char.

Post by Juha Manninen via Lazarus

Post by Bo Berglund via Lazarus
However, we received reports that it had failed in some Asian
countries (Korea, China, Thailand) and upon checking it turned out
that the data inside a string used as buffer was changed because of
locale differences....

Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?

Again, these are old existing programs and we are not doing this
anymore for new programs. However, there is one problem still becauyse
there is an interface point to the hardware, in the form of serial
components, which still handle chars...
And chars are nowadays Unicode chars, i.e. not mapping to bytes sent
by RS232...
And our data are NOT text, they are binary streams of bytes.

Post by Juha Manninen via Lazarus

Post by Bo Berglund via Lazarus
After switching out the affected variable declarations from AnsiString
to RawByteString the application seemingly started to work again also
on these locations.
...
And after this I have spent some time to totally rework the use of
strings as buffers to instead use TBytes. Lots of work but
guaranteed to not sneak in unexpected conversions.

RawByteString is for text which encoding is not meant to be converted.
It has its special use cases.

My first attempt at "fixing" the problem in Asian locales was to use
RawByteString so as to inhibit conversions.
Still with these as comm buffers...
It seemed to work out, but to be safer I have reworked one application
to replace with TBytes everywhere comm data are handled.

Post by Juha Manninen via Lazarus
TBytes is usually for binary data.

Exactly, and this is why I made the comment that to be on the safe
side dealing with RS232 the buffers should be TBytes (or some other
similar construct).

Post by Juha Manninen via Lazarus
Did I understand right: you use TBytes to hold strings having Windows
codepage encoding?

No, definitively not. At the time we were not aware of any encoding at
all. To us a string was just a handy container for the serial data
like a dynamic array of byte with some useful functions available for
searching and things like that. I think we were not alone...

Post by Juha Manninen via Lazarus
Again: Why not Unicode? Then you could throw away your hacks.

The application itself is Unicode now but we had to run circles around
the RS232 comm part. When converting to Unicode we first set the comm
related strings to be AnsiString...

PS: We never programmed the serial interface directly, we always used
commercial RS232 components and they all dealt with char and string...
DS

--
Bo Berglund
Developer in Sweden

--

Juha Manninen via Lazarus

2017-08-13 20:41:34 UTC

On Sun, Aug 13, 2017 at 7:41 PM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
And our data are NOT text, they are binary streams of bytes.

I see. Then TBytes indeed is the best choice.
You have misused "String" or "AnsiString" from the beginning for binary data.
There have always been warnings against it.
The new Lazarus Unicode system did not create the problem but made it
more visible.

Marcos Douglas however has a different problem.
Your recommendation to use RawByteString or TBytes does not apply in
his case and thus was a bit misleading.

Juha
--

Michael Schnell via Lazarus

2017-08-14 08:25:14 UTC

Post by Juha Manninen via Lazarus
You have misused "String" or "AnsiString" from the beginning for binary data.
There have always been warnings against it.

While this might be true, it's decently silly, IMHO.

The name "String" can easily be interpreted as "String of things" and
does not necessarily mean "String of printable stuff".

The management Pascal always provided for strings (after the "Short
String" was not the only string type) (i.e. Operators, built-in
functions, lazy copy, reference counting) is perfectly applicable to
"Strings of things", and don't force any known encoding at all.

The drama only was introduced by Embarcadero's abysmal / sloppy
implementation of automatic code conversion for strings.

-Michael
--

Tony Whyman via Lazarus

2017-08-14 09:53:44 UTC

Post by Juha Manninen via Lazarus
Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?
Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.

If you are processing strings as "text" then you probably do not care
how it is encoded and can live with "breaking changes". However, if, for
some reason you are or need to be aware of how the text is encoded - or
are using string types as a useful container for binary data then, types
that sneak up on you with implicit type conversions or which have
semantics that change between compilers or versions, are just another
source of bugs.

PChar used to be a safe means to access binary data - but not anymore,
especially if you move between FPC and Delphi. (One of my gripes is that
the FCL still makes too much use of PChar instead of PByte with the
resulting Delphi incompatibility). The "string" type also used to be a
safe container for any sort of binary data, but when its definition can
change between compilers and versions, it is now something to be avoided.

As a general rule, I now always use PByte for any sort of string that is
binary, untyped or encoding to be determined. It works across compilers
(FPC and Delphi) with consistent semantics and is safe for such use.

I also really like AnsiString from FCP 3.0 onwards. By making the
encoding a dynamic attribute of the type, it means that I know what is
in the container and can keep control.

I am sorry, but I would only even consider using Unicodestrings as a
type (or the default string type) when I am just processing text for
which the encoding is a don't care, such as a window caption, or for
intensive text analysis. If I am reading/writing text from a file or
database where the encoding is often implicit and may vary from the
Unicode standard then my preference is for AnsiString. I can then read
the text (e.g. from the file) into a (RawByteString) buffer, set the
encoding and then process it safely while often avoiding the overhead
from any transliteration. PByte comes into its own when the file
contains a mixture of binary data and text.

Text files and databases tend to use UTF-8 or are encoded using legacy
Windows Code pages. The Chinese also have GB18030. With a database, the
encoding is usually known and AnsiString is a good way to read/write
data and to convey the encoding, especially as databases usually use a
variable length multi-byte encoding natively and not UTF-16/Unicode.
With files, the text encoding is usually implicit and AnsiString is
ideal for this as it lets you read in the text and then assign the
(implicit) encoding to the string, or ensure the correct encoding when
writing.

And anyway, I do most of my work in Linux, so why would I even want to
bother myself with arrays of widechars when the default environment is UTF8?

We do need some stability and consistency in strings which, as someone
else noted have been confused by Embarcadero. I would like to see that
focused on AnsiString with UnicodeString being only for specialist use
on Windows or when intensive text analysis makes a two byte encoding
more efficient than a variable length multi-byte encoding.

Tony Whyman
MWA

--

Marcos Douglas B. Santos via Lazarus

2017-08-14 13:11:27 UTC

On Mon, Aug 14, 2017 at 6:53 AM, Tony Whyman via Lazarus

Post by Juha Manninen via Lazarus
Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?
Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.

If you are processing strings as "text" then you probably do not care how it
is encoded and can live with "breaking changes". However, if, for some
reason you are or need to be aware of how the text is encoded - or are using
string types as a useful container for binary data then, types that sneak up
on you with implicit type conversions or which have semantics that change
between compilers or versions, are just another source of bugs.
PChar used to be a safe means to access binary data - but not anymore,
especially if you move between FPC and Delphi. (One of my gripes is that the
FCL still makes too much use of PChar instead of PByte with the resulting
Delphi incompatibility). The "string" type also used to be a safe container
for any sort of binary data, but when its definition can change between
compilers and versions, it is now something to be avoided.
As a general rule, I now always use PByte for any sort of string that is
binary, untyped or encoding to be determined. It works across compilers (FPC
and Delphi) with consistent semantics and is safe for such use.
I also really like AnsiString from FCP 3.0 onwards. By making the encoding a
dynamic attribute of the type, it means that I know what is in the container
and can keep control.
I am sorry, but I would only even consider using Unicodestrings as a type
(or the default string type) when I am just processing text for which the
encoding is a don't care, such as a window caption, or for intensive text
analysis. If I am reading/writing text from a file or database where the
encoding is often implicit and may vary from the Unicode standard then my
preference is for AnsiString. I can then read the text (e.g. from the file)
into a (RawByteString) buffer, set the encoding and then process it safely
while often avoiding the overhead from any transliteration. PByte comes into
its own when the file contains a mixture of binary data and text.
Text files and databases tend to use UTF-8 or are encoded using legacy
Windows Code pages. The Chinese also have GB18030. With a database, the
encoding is usually known and AnsiString is a good way to read/write data
and to convey the encoding, especially as databases usually use a variable
length multi-byte encoding natively and not UTF-16/Unicode. With files, the
text encoding is usually implicit and AnsiString is ideal for this as it
lets you read in the text and then assign the (implicit) encoding to the
string, or ensure the correct encoding when writing.

Unicode everywhere and you using AnsiString and doing everything...
Now I'm confused.

And anyway, I do most of my work in Linux, so why would I even want to
bother myself with arrays of widechars when the default environment is UTF8?

Maybe you do not have problems because you don't use Windows.

We do need some stability and consistency in strings which, as someone else
noted have been confused by Embarcadero. I would like to see that focused on
AnsiString with UnicodeString being only for specialist use on Windows or
when intensive text analysis makes a two byte encoding more efficient than a
variable length multi-byte encoding.

FPC and Lazarus claim they are cross-platform — this is a fact — and
because that, IMHO, both should be use in only one way in every
system, don't you think?

Best regards,
Marcos Douglas
--

Tony Whyman via Lazarus

2017-08-14 13:21:57 UTC

Post by Marcos Douglas B. Santos via Lazarus
FPC and Lazarus claim they are cross-platform — this is a fact — and
because that, IMHO, both should be use in only one way in every
system, don't you think?
Best regards,
Marcos Douglas

Precisely. But why this fixation on UTF-16/Unicode and not UTF8?

Lazarus is already a UTF8 environment.

Much of the LCL assumes UTF8.

UTF8 is arguably a much more efficient way to store and transfer data

UTF-16/Unicode can only store 65,536 characters while the Unicode
standard (that covers UTF8 as well) defines 136,755 characters.

UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.

You made need UTF-16/Unicode support for accessing Microsoft APIs but
apart from that, why is it being promoted as the universal standard?
--

Mattias Gaertner via Lazarus

2017-08-14 13:46:54 UTC

On Mon, 14 Aug 2017 14:21:57 +0100

Post by Tony Whyman via Lazarus
[...]
Lazarus is already a UTF8 environment.
Much of the LCL assumes UTF8.

True.

Post by Tony Whyman via Lazarus
UTF8 is arguably a much more efficient way to store and transfer data

It depends.

Post by Tony Whyman via Lazarus
UTF-16/Unicode can only store 65,536 characters while the Unicode
standard (that covers UTF8 as well) defines 136,755 characters.

No.
UTF-16 can encode the full 1 million Unicode range. It uses one or
two words per codepoint. UTF-8 uses 1 to 4 bytes.
See here for more details:
https://en.wikipedia.org/wiki/UTF-16

Although you are right, that there are still many applications, that
falsely claim to support UTF-16, but only support the first $D800
codepoints.

Post by Tony Whyman via Lazarus
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.

That's only true for UCS-2, which is obsolete.

Post by Tony Whyman via Lazarus
You made need UTF-16/Unicode support for accessing Microsoft APIs but
apart from that, why is it being promoted as the universal standard?

Who does that?

Mattias
--

Tony Whyman via Lazarus

2017-08-14 14:11:15 UTC

Post by Mattias Gaertner via Lazarus

Post by Tony Whyman via Lazarus
You made need UTF-16/Unicode support for accessing Microsoft APIs but
apart from that, why is it being promoted as the universal standard?

Who does that?
Mattias

Because the obvious implication when someone argues against AnsiString
(from which UTF8String derives) and talks about Unicode is that they are
promoting UTF-16 and the UnicodeString type. Perhaps this is because I
am old enough to remember when MS first added wide characters to Windows
and that they called it "Unicode". To me, when people say "Unicode" they
mean Windows wide characters.

Perhaps the problem is the use of the word "Unicode". By trying to
embrace UTF8, UTF16 and UTF32 with the older UCS-2 it is perhaps too
ambiguous a term - especially as the Delphi/FPC UnicodeString type
exists and probably (but I'm not certain) means UTF-16.

What I see in FPC/Lazarus today is:

- UTF8 supported through AnsiString.

- A confusion of Widestring/UnicodeString for UTF-16 and legacy UCS-2.

- Nothing for UTF-32.

If nothing else, FPC Lazarus could do with a clean-up of both
terminology and string types. Indeed, why isn't there a single container
string type for all character sets where the encoding whether a legacy
code page, UTF8, UTF16 or UTF32 is simply a dynamic attribute of the
type - a sort of extended AnsiString?

Graeme Geldenhuys via Lazarus

2017-08-14 14:20:52 UTC

Post by Tony Whyman via Lazarus
ambiguous a term - especially as the Delphi/FPC UnicodeString type
exists and probably (but I'm not certain) means UTF-16.

Yes, that is f**ken annoying. FPC should have named it what it really is
- UTF16String! But instead they followed Delphi like a lemming and named
it UnicodeString.

In reality, UNICODE means text with an encoding of any of UTF-8,
UTF-16LE, UTF-16BE, or UTF-32.

In terms of Delphi and FPC, they decided Unicode = UTF-16. I'm not even
sure if they mean LE or BE.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Sven Barth via Lazarus

2017-08-14 16:49:58 UTC

Am 14.08.2017 16:21 schrieb "Graeme Geldenhuys via Lazarus" <

Post by Graeme Geldenhuys via Lazarus

Post by Tony Whyman via Lazarus
ambiguous a term - especially as the Delphi/FPC UnicodeString type
exists and probably (but I'm not certain) means UTF-16.

Yes, that is f**ken annoying. FPC should have named it what it really is

- UTF16String! But instead they followed Delphi like a lemming and named it
UnicodeString.

Because the crowd demanding Delphi compatibility is larger than the crowd
demanding exact terminology.

Post by Graeme Geldenhuys via Lazarus
In reality, UNICODE means text with an encoding of any of UTF-8,

UTF-16LE, UTF-16BE, or UTF-32.

Post by Graeme Geldenhuys via Lazarus
In terms of Delphi and FPC, they decided Unicode = UTF-16. I'm not even

sure if they mean LE or BE.

If I remember correctly it depends on the endianess of the platform...
Though I could be wrong.

Regards,
Sven

Michael Schnell via Lazarus

2017-08-15 07:59:38 UTC

Post by Sven Barth via Lazarus
Because the crowd demanding Delphi compatibility is larger than the
crowd demanding exact terminology.

... or even a revised concept avoiding the junk presented by Embarcadero :(

But obviously the fpc team has no choice.

-Michael
--

Sven Barth via Lazarus

2017-08-14 16:47:58 UTC

Am 14.08.2017 16:11 schrieb "Tony Whyman via Lazarus" <

If nothing else, FPC Lazarus could do with a clean-up of both terminology

and string types. Indeed, why isn't there a single container string type
for all character sets where the encoding whether a legacy code page, UTF8,
UTF16 or UTF32 is simply a dynamic attribute of the type - a sort of
extended AnsiString?

The main problem of such a dynamic type would be the inability to do fast
indexing as the compiler would need to insert runtime checks for the size
of a character. I had already thought the same, but then had to discard the
idea due to this.

Regards,
Sven

Michael Schnell via Lazarus

2017-08-15 08:03:14 UTC

Post by Sven Barth via Lazarus
The main problem of such a dynamic type would be the inability to do
fast indexing as the compiler would need to insert runtime checks for
the size of a character.

What "indexing" do you think of ?
Could you give an example where such a difference is supposed to get
important ?

(As you know I wrote a paper where I claimed the contrary. I'd like to
revise same if necessary.)

-Michael
--

Tony Whyman via Lazarus

2017-08-15 08:34:57 UTC

Post by Sven Barth via Lazarus
The main problem of such a dynamic type would be the inability to do
fast indexing as the compiler would need to insert runtime checks for
the size of a character. I had already thought the same, but then had
to discard the idea due to this.

Is this really a big problem? It is not as if it would be necessary to
do a table lookup everytime you index a string as the indexing method
could be an attribute of the string and updated with the character
encoding attribute. Is it really that complicated for the compiler to
generate code that jumps to an indexing method depending upon a data
attribute?

Is your problem really more about the result type as, depending on the
character width, the result could be an AnsiChar or WideChar or a UTF8
character for which I don't believe there is a defined char type (other
than an arguable mis-use of UCS4Char)?

I can accept that a clear up of this area would also have to extend to
the char types as well - but I would also argue that that is well
overdue. On a quick count, I found 7 different char types in the system
unit.
--

Sven Barth via Lazarus

2017-08-16 17:29:17 UTC

Post by Tony Whyman via Lazarus

Post by Sven Barth via Lazarus
The main problem of such a dynamic type would be the inability to do
fast indexing as the compiler would need to insert runtime checks for
the size of a character. I had already thought the same, but then had
to discard the idea due to this.

Is this really a big problem? It is not as if it would be necessary to
do a table lookup everytime you index a string as the indexing method
could be an attribute of the string and updated with the character
encoding attribute. Is it really that complicated for the compiler to
generate code that jumps to an indexing method depending upon a data
attribute?

In a tight loop where one accesss the string character by character
(take Pos() for example) this will lead to a significant slowdown as the
compiler (without optimizations) will have to insert a call to the
lookup function for each access. While I generally don't consider
performance degradation as a backwards compatibility issue I do in this
case, due to the significant decrease in performance.

Take this evaluation example:

=== code begin ===

program tperf;

{$mode objfpc}{$H+}

uses
SysUtils;

function lookup(const aStr: String; aIndex: SizeInt): Char;
begin
Result := aStr[aIndex];
end;

var
str: String;
starttime, endtime: TDateTime;
i, j: LongInt;
begin
SetLength(str, 10000);

starttime := Now;
for i := 0 to 10000 do
for j := 1 to Length(str) do
if str[j] <> '' then ;
endtime := Now;

Writeln('Direct: ', FormatDateTime('hh:nn:ss.zzz', endtime - starttime));

starttime := Now;
for i := 0 to 10000 do
for j := 1 to Length(str) do
if lookup(str, j) <> '' then ;
endtime := Now;

Writeln('Lookup: ', FormatDateTime('hh:nn:ss.zzz', endtime - starttime));
end.

=== code end ===

=== output begin ===

Direct: 00:00:01.766
Lookup: 00:00:02.061

=== output end ===

While this example is of course artificial it nevertheless shows the
slow down.

Post by Tony Whyman via Lazarus
Is your problem really more about the result type as, depending on the
character width, the result could be an AnsiChar or WideChar or a UTF8
character for which I don't believe there is a defined char type (other
than an arguable mis-use of UCS4Char)?

That is indeed also a problem. I might not have had that one in mind
with my mail above, but I did back then when I had brainstormed this.

Post by Tony Whyman via Lazarus
I can accept that a clear up of this area would also have to extend to
the char types as well - but I would also argue that that is well
overdue. On a quick count, I found 7 different char types in the system
unit.

And most important of all: any solution that is developed *MUST* be
backwards compatible, so that means that in the least that type aliases
would remain anyway.

Regards,
Sven
--

Mattias Gaertner via Lazarus

2017-08-15 09:13:13 UTC

On Mon, 14 Aug 2017 18:47:58 +0200

Post by Sven Barth via Lazarus
[...]
The main problem of such a dynamic type would be the inability to do fast
indexing as the compiler would need to insert runtime checks for the size
of a character. I had already thought the same, but then had to discard the
idea due to this.

IMHO the main problem of adding a new string type is
https://xkcd.com/927/

Mattias
--

Tony Whyman via Lazarus

2017-08-15 09:17:22 UTC

You can me as a "like" on that one.

Post by Mattias Gaertner via Lazarus
IMHO the main problem of adding a new string type is
https://xkcd.com/927/

--

Michael Van Canneyt via Lazarus

2017-08-15 09:25:26 UTC

Post by Mattias Gaertner via Lazarus
On Mon, 14 Aug 2017 18:47:58 +0200

Post by Sven Barth via Lazarus
[...]
The main problem of such a dynamic type would be the inability to do fast
indexing as the compiler would need to insert runtime checks for the size
of a character. I had already thought the same, but then had to discard the
idea due to this.

IMHO the main problem of adding a new string type is
https://xkcd.com/927/

Exactly. I don't think we should add even more.

As it is now, FPC offers a way out for all cases:

WideString/UnicodeString for those that want 2-byte characters.
A codepage-aware single-byte string for those that want 1-byte characters.
The shortstring is even still available.

Attempting to store binary data in a string is not advisable.
Dynamic arrays, TBytes and - in the worst case - TBytesStream are powerful enough to
cover most use-cases in this area.

Michael.
--

Michael Schnell via Lazarus

2017-08-15 09:49:24 UTC

Post by Michael Van Canneyt via Lazarus
WideString/UnicodeString for those that want 2-byte characters.
A codepage-aware single-byte string for those that want 1-byte
characters.
The shortstring is even still available.

IM (often stated) O, this does not help as long as TStrings does not
without forced auto-conversion support the string type the user is
inclined to choose.

This obviously requires an (additional) fully dynamic string brand.

This (again obviously) is not the "Embarcadero way", but supposedly does
not necessarily lead to incompatibility regarding the user code.

-Michael

--

Michael Van Canneyt via Lazarus

2017-08-15 09:52:49 UTC

Post by Michael Schnell via Lazarus

Post by Michael Van Canneyt via Lazarus
WideString/UnicodeString for those that want 2-byte characters.
A codepage-aware single-byte string for those that want 1-byte characters.
The shortstring is even still available.

IM (often stated) O, this does not help as long as TStrings does not
without forced auto-conversion support the string type the user is
inclined to choose.

Please check TStrings in trunk. This exists.

procedure LoadFromFile(const FileName: string; AEncoding: TEncoding); overload; virtual;
procedure LoadFromStream(Stream: TStream; AEncoding: TEncoding); overload; virtual;

The only 'problem' is that TStrings uses a single-byte string.

This cannot be solved properly except by duplicating the classes unit.

Michael.
--

Michael Schnell via Lazarus

2017-08-15 10:02:28 UTC

Post by Michael Van Canneyt via Lazarus
This cannot be solved properly except by duplicating the classes unit.

Sorry to disagree, but IMHO this can only be solved properly by defining
an additional fully dynamically encoded string type and use same for
TStrings (see ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
)

But I am perfectly aware that implementing this would be a huge effort
(see other mail here), and nobody i entitled to ask for this. (I wrote
the article just to elaborate what was discussed in the fpc mailing list
at that time.)

-Michael
--

Mattias Gaertner via Lazarus

2017-08-15 10:11:37 UTC

On Tue, 15 Aug 2017 12:02:28 +0200

Post by Michael Schnell via Lazarus

Post by Michael Van Canneyt via Lazarus
This cannot be solved properly except by duplicating the classes unit.

Sorry to disagree, but IMHO this can only be solved properly by defining
an additional fully dynamically encoded string type and use same for
TStrings (see ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
)

It does not explain what the characters of DynamicString are, does it?

Mattias
--

Michael Van Canneyt via Lazarus

2017-08-15 10:15:55 UTC

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 12:02:28 +0200

Post by Michael Schnell via Lazarus

Post by Michael Van Canneyt via Lazarus
This cannot be solved properly except by duplicating the classes unit.

Sorry to disagree, but IMHO this can only be solved properly by defining
an additional fully dynamically encoded string type and use same for
TStrings (see ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
)

It does not explain what the characters of DynamicString are, does it?

I was just going to write that.

The problem of the element size is circumvented by simply not digging into it.

What does S[2] mean in your proposal ? Is it 1, 2, 4 or even 8 bytes ?

Michael.

--

Michael Schnell via Lazarus

2017-08-15 10:44:10 UTC

Post by Michael Van Canneyt via Lazarus
What does S[2] mean in your proposal ? Is it 1, 2, 4 or even 8 bytes ?

Regarding the users' appreciation, the S[x] notation is decently
incompatible between the different string types and compiler versions.

There were hundreds of complains in all the appropriate forums and
mailing list.

So not much additional harm can be done, anyway.

I suggest that it should be according to the character_size definition
stored S, and the operation c := S[x] should transfer the appropriate
count of bits, provided the type of c allows for taking them.

This seems to be compatible to the current implementation of any 1-Byte
brand and UTF16.

-Michael
--

Michael Van Canneyt via Lazarus

2017-08-15 10:51:49 UTC

Post by Michael Schnell via Lazarus

Post by Michael Van Canneyt via Lazarus
What does S[2] mean in your proposal ? Is it 1, 2, 4 or even 8 bytes ?

Regarding the users' appreciation, the S[x] notation is decently
incompatible between the different string types and compiler versions.

Of course not.

It's 1 byte for ansistring, 2 bytes for widestring.

The point is that the compiler knows how many bytes it is based on the
declaration of S. In your proposal, it is dynamic, if I understand it
correctly.

Post by Michael Schnell via Lazarus
There were hundreds of complains in all the appropriate forums and
mailing list.

Complaints about what exactly ?

Post by Michael Schnell via Lazarus
So not much additional harm can be done, anyway.
I suggest that it should be according to the character_size definition
stored S, and the operation c := S[x] should transfer the appropriate
count of bits, provided the type of c allows for taking them.

As far as I understand your proposal, this currently cannot be done ?

The compiler needs to know the S[X] size at compile time.

Michael.
--

Michael Schnell via Lazarus

2017-08-15 10:34:45 UTC

Post by Mattias Gaertner via Lazarus
It does not explain what the characters of DynamicString are, does it?

I don't understand what you are asking.

The element size and encoding of a Dynamic String ("CP_ANY" in the
document) are not predefined, but depend on the content:

http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support

Post by Mattias Gaertner via Lazarus
*CP_ANY* = $FF00 // ElementSize dynamically assigned // fully
dynamical String for intermediate storing string content // just
assigned to the Type or variable, never used in the "Encoding" field
in the string header.

Hence it stores the "branding" when it is assigned to from a string with
a fixed branding (such as *CP_UTF8*), and the content is auto-converted
if necessary when assigning form CP_ANY to a fixed branded string variable.

If (in your example) the data is read from a file, a CP_ANY Strings
based StringList would keep the encoding/char_size of the data as t is
in the file (it would need to somehow get to know the presumed encoding
of the file, anyway) and store that information in the
EncodingBrandNumber and ElementSize fields (which do exist in any
"NewString" variable, anyway), in each String read.

If the user assignes an element of the stringlist to a fixed branding
(such as *CP_UTF8*), the content obviously is auto-converted if
necessary when assigning form CP_ANY to a fixed branded string
variable, as usual.

In fact I suppose that the current implementation of TStringlist does
not use new strings to store the data on the heap, but I never said that
trying to implement such idea would not require a lot of work.

-Michael

Graeme Geldenhuys via Lazarus

2017-08-15 17:18:01 UTC

Post by Michael Van Canneyt via Lazarus
The only 'problem' is that TStrings uses a single-byte string.

Why can't that be changed to a UnicodeString or UTF8String - after all,
the Unicode standard is meant to support all languages. I would have
thought that would be an obvious move for a Unicode-aware RTL. TStrings
could also be extended (if it hasn't already) to keep track of what
encoding is read in from file, and what encoding in should procedure
when lines are extracted - in case those two encodings are not the same.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Michael Schnell via Lazarus

2017-08-16 08:43:18 UTC

Post by Graeme Geldenhuys via Lazarus
Why can't that be changed to a UnicodeString or UTF8String

IMHO, any implementation of TStrings that forces a conversion (just
because the class uses TStrings and not due to a logical demand), is a
contradiction to providing code aware strings at all.

-Michael
--

Graeme Geldenhuys via Lazarus

2017-08-16 09:08:23 UTC

Post by Michael Schnell via Lazarus
IMHO, any implementation of TStrings that forces a conversion (just
because the class uses TStrings and not due to a logical demand), is a
contradiction to providing code aware strings at all.

But in FPC 3.x (using modern compiler modes - not TP or Mac) String =
UnicodeString. So it makes sense that TStrings should use UnicodeString
internally to store its data. The Unicode standard is also the only
standard that can support any language. So all Windows code-pages can be
supported with the single UnicodeString type.

Are you suggesting that internally TStrings should have different
storage for all possible languages, or some RawByteString type? So if
you load some non-Latin code-page text internally it still stores that
text as that non-Latin bytes? That would just over-complicate the
TStrings class. FPC is moving towards UnicodeString being used
internally for everything in the RTL, so why must TStrings be any different.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Michael Schnell via Lazarus

2017-08-16 09:33:04 UTC

So it makes sense that TStrings should use UnicodeString internally to
store its data. The Unicode standard is also the only standard that
can support any language.

But in fact "Unicode" is just a universal standard defining 64 bit
entities. The encoding of those varies: UTF-8, UTF-16 high byte first,
UTF-16 low byte first, 64 bit low byte first, 64 bit high byte first,
.... fpc and Delphi do support several of those as a string encoding
(and with that crating any number of problems).

-Michael
--

Mattias Gaertner via Lazarus

2017-08-16 09:55:04 UTC

On Wed, 16 Aug 2017 11:33:04 +0200

Post by Michael Schnell via Lazarus
[...]
But in fact "Unicode" is just a universal standard defining 64 bit
entities.

No.
1,114,112 possible code points need at most 21 bits. Due to encoding at
most 32bit.

Mattias
--

Michael Schnell via Lazarus

2017-08-16 10:15:47 UTC

Post by Mattias Gaertner via Lazarus
1,114,112 possible code points need at most 21 bits. Due to encoding
at most 32bit.

Sorry. Typo.
-Michael
--

Michael Schnell via Lazarus

2017-08-16 09:36:41 UTC

Post by Graeme Geldenhuys via Lazarus
Are you suggesting that internally TStrings should have different
storage for all possible languages,

Not at all. In the said paper I point out that a new fully dynamical
string encoding brand would be introduced and same is used for TStrings.
Everything else will not provide an improvement of the class of problems
under discussion since years.

-Michael (knowing that this will never happen)
--

Sven Barth via Lazarus

2017-08-16 17:35:23 UTC

Post by Graeme Geldenhuys via Lazarus

Post by Michael Schnell via Lazarus
IMHO, any implementation of TStrings that forces a conversion (just
because the class uses TStrings and not due to a logical demand), is a
contradiction to providing code aware strings at all.

But in FPC 3.x (using modern compiler modes - not TP or Mac) String =
UnicodeString. So it makes sense that TStrings should use UnicodeString
internally to store its data. The Unicode standard is also the only
standard that can support any language. So all Windows code-pages can be
supported with the single UnicodeString type.

You are wrong. The string types in 3.0.x and 3.1 are like this:

TP, Iso, ExtPas, MacPas, FPC, ObjFPC (or below modes with $H-): String =
ShortString
Delphi (or other modes with $H+): String = AnsiString (or more precisely
String(CP_ACP), meaning the system codepage)
Delphi_Unicode (or other modes with $H+ and $modeswitch unicodestring):
String = UnicodeString

Regards,
Sven
--

Graeme Geldenhuys via Lazarus

2017-08-16 23:30:09 UTC

Thanks for correcting me. I was thinking of the "$modeswitch
unicodestring" option.

Regards,
Graeme

--

wkitty42--- via Lazarus

2017-08-17 02:15:07 UTC

Thanks for correcting me. I was thinking of the "$modeswitch unicodestring" option.

will that modeswitch take care of the warning about explicit conversion between
ansistring and unicode string when one has

var foo : unicodestring;

writeln(padright(foo,5);

??

i wrote a quick and simple little array exhibit program for someone... i had
thought to try to embrace this new unicode stuff by using unicode strings... the
using the padright and similar string manipulators gave me warnings about
ansistring conversions :?

NOTE: this may be because i have an older lazarus and fpc installed... lazarus
fixes 1.6.1 and fpc fixes 3.0.something...

--
NOTE: No off-list assistance is given without prior approval.
*Please keep mailing list traffic on the list unless*
*a signed and pre-paid contract is in effect with us.*
--

Luca Olivetti via Lazarus

2017-08-15 17:29:23 UTC

Attempting to store binary data in a string is not advisable. Dynamic
arrays, TBytes and - in the worst case - TBytesStream are powerful
enough to
cover most use-cases in this area.

I has worked extremely well and reliably until fpc 2.6.4 (i.e. with
string=ansistring).
Does it not work in 3.x?
If not it's a big problem, not only for my code (that I can,
reluctantly, change) but for 3rd party libraries/components (e.g.
synapse comes to mind)

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Graeme Geldenhuys via Lazarus

2017-08-15 19:14:10 UTC

Post by Luca Olivetti via Lazarus
but for 3rd party libraries/components (e.g.
synapse comes to mind

Then better start filing bug reports to all those 3rd party libraries
and components - they have been abusing the system and will silently
fail. Not to mention that FPC is almost at v3.0.4 and the new string
changes were introduced in v3.0.0 already.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Luca Olivetti via Lazarus

2017-08-15 19:22:10 UTC

Post by Graeme Geldenhuys via Lazarus

Post by Luca Olivetti via Lazarus
but for 3rd party libraries/components (e.g.
synapse comes to mind

Then better start filing bug reports to all those 3rd party libraries
and components - they have been abusing the system and will silently
fail. Not to mention that FPC is almost at v3.0.4 and the new string
changes were introduced in v3.0.0 already.

Wait a minute, why "abuse"?
After all, before code aware strings, an ansistring could store any kind
of arbitrary data with no problem and no conversion, and made it
extremely easy to, e.g., add bytes to a buffer or find and extract data
from the same buffer.
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.
(I remarked the "if" because I don't know if that's the case, according
to Bo Berglund's experience it is)

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Mattias Gaertner via Lazarus

2017-08-15 19:34:40 UTC

On Tue, 15 Aug 2017 21:22:10 +0200

Post by Luca Olivetti via Lazarus
[...]
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.

... unless of course the incompatibility is deliberate and documented.
In this case it is.

Mattias
--

Ondrej Pokorny via Lazarus

2017-08-15 19:38:34 UTC

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 21:22:10 +0200

Post by Luca Olivetti via Lazarus
[...]
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.

... unless of course the incompatibility is deliberate and documented.
In this case it is.

Furthermore, if you use(d) strings for binary data, just replace old
string for AnsiString/RawByteString (and Char for AnsiChar, PChar for
PAnsiChar) and you are good to go. Annoying but no big deal.

Ondrej
--

Luca Olivetti via Lazarus

2017-08-15 20:08:52 UTC

Post by Ondrej Pokorny via Lazarus

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 21:22:10 +0200

Post by Luca Olivetti via Lazarus
[...]
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.

... unless of course the incompatibility is deliberate and documented.
In this case it is.

Furthermore, if you use(d) strings for binary data, just replace old
string for AnsiString/RawByteString (and Char for AnsiChar, PChar for
PAnsiChar) and you are good to go. Annoying but no big deal.

If that's all it's OK then, thank you.

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Luca Olivetti via Lazarus

2017-08-15 20:10:45 UTC

Post by Luca Olivetti via Lazarus

Post by Ondrej Pokorny via Lazarus

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 21:22:10 +0200

Post by Luca Olivetti via Lazarus
[...]
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.

... unless of course the incompatibility is deliberate and documented.
In this case it is.

Furthermore, if you use(d) strings for binary data, just replace old
string for AnsiString/RawByteString (and Char for AnsiChar, PChar for
PAnsiChar) and you are good to go. Annoying but no big deal.

If that's all it's OK then, thank you.

Sorry for the direct reply, it was meant for the list.

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Luca Olivetti via Lazarus

2017-08-15 20:09:43 UTC

Post by Ondrej Pokorny via Lazarus

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 21:22:10 +0200

Post by Luca Olivetti via Lazarus
[...]
*If* code that worked before (and dare I say without abusing the
language) suddenly breaks, the bug is in the compiler and not in the
library.

... unless of course the incompatibility is deliberate and documented.
In this case it is.

Furthermore, if you use(d) strings for binary data, just replace old
string for AnsiString/RawByteString (and Char for AnsiChar, PChar for
PAnsiChar) and you are good to go. Annoying but no big deal.

If that's all it's OK then, thank you.

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Michael Schnell via Lazarus

2017-08-16 08:51:52 UTC

Post by Ondrej Pokorny via Lazarus
Furthermore, if you use(d) strings for binary data, just replace old
string for AnsiString/RawByteString (and Char for AnsiChar, PChar for
PAnsiChar) and you are good to go. Annoying but no big deal.

This only works if all tools that you use do the same. And a major tool
for handling strings is TStrings and it's siblings. You hardly an avoid
using same.

-Michael

--

Graeme Geldenhuys via Lazarus

2017-08-15 20:45:33 UTC

Post by Luca Olivetti via Lazarus
Wait a minute, why "abuse"?
After all, before code aware strings, an ansistring could store any kind
of arbitrary data with no problem and no conversion, and made it
extremely easy

Just listen to what you are saying.... A string type and you want to
store all kinds of non-string related data in that type. How is that not
"abuse"??? Use a TBytes, TStream or other binary byte based storage
mechanism. A string type was definitely not the right choice.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Luca Olivetti via Lazarus

2017-08-15 22:41:48 UTC

Post by Graeme Geldenhuys via Lazarus

Post by Luca Olivetti via Lazarus
Wait a minute, why "abuse"?
After all, before code aware strings, an ansistring could store any kind
of arbitrary data with no problem and no conversion, and made it
extremely easy

Just listen to what you are saying.... A string type and you want to
store all kinds of non-string related data in that type. How is that not
"abuse"??? Use a TBytes, TStream or other binary byte based storage
mechanism. A string type was definitely not the right choice.

A "string" was just a handy container for bytes so I think it was the
right choice for storing, er, bytes.

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Graeme Geldenhuys via Lazarus

2017-08-15 23:17:32 UTC

Post by Luca Olivetti via Lazarus
A "string" was just a handy container for bytes so I think it was the
right choice for storing, er, bytes.

The type "String" has always been an alias to another type, and could
mean many things. eg: ShortString, AnsiString, and now UnicodeString.
Making the assumption that it will always be a container for byte sized
data was wrong.

In hind sight, using TBytes or TMemoryStream and it would have been very
clear that it is a storage container for byte sized data, and no
automatic conversion (by the compiler) would be done to data stored in
such containers.

Don't worry though, you were not alone in making that wrong assumption.
Many Delphi developers have made that mistake, and some are still making
that mistake today.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Luca Olivetti via Lazarus

2017-08-16 18:26:50 UTC

Post by Graeme Geldenhuys via Lazarus
In hind sight, using TBytes or TMemoryStream and it would have been very
clear that it is a storage container for byte sized data, and no
automatic conversion (by the compiler) would be done to data stored in
such containers.

Call me lazy but I don't want to reinvent the wheel and re-implement
from scratch the functionality that a plain ansistring provides and
TBytes to this day doesn't.
I mean, TBytes is just an "array of char". I can't (easily) add a byte
to the end, cut a slice of the bytes, find one byte in the array, etc.
OK, I can, but I have to program it all by myself while a string does
all that and more and probably it's a lot more efficient.

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Luca Olivetti via Lazarus

2017-08-16 18:28:27 UTC

Post by Luca Olivetti via Lazarus

Post by Graeme Geldenhuys via Lazarus
In hind sight, using TBytes or TMemoryStream and it would have been
very clear that it is a storage container for byte sized data, and no
automatic conversion (by the compiler) would be done to data stored in
such containers.

Call me lazy but I don't want to reinvent the wheel and re-implement
from scratch the functionality that a plain ansistring provides and
TBytes to this day doesn't.
I mean, TBytes is just an "array of char". I can't (easily) add a byte
to the end, cut a slice of the bytes, find one byte in the array, etc.
OK, I can, but I have to program it all by myself while a string does
all that and more and probably it's a lot more efficient.

Not to mention that its index starts from 0. If I wanted to program in C
I would be programming in C, not pascal ;-)

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Luca Olivetti via Lazarus

2017-08-16 22:46:33 UTC

Post by Luca Olivetti via Lazarus

Post by Graeme Geldenhuys via Lazarus
In hind sight, using TBytes or TMemoryStream and it would have been
very clear that it is a storage container for byte sized data, and no
automatic conversion (by the compiler) would be done to data stored in
such containers.

Call me lazy but I don't want to reinvent the wheel and re-implement
from scratch the functionality that a plain ansistring provides and
TBytes to this day doesn't.
I mean, TBytes is just an "array of char". I can't (easily) add a byte
to the end, cut a slice of the bytes, find one byte in the array, etc.
OK, I can, but I have to program it all by myself while a string does
all that and more and probably it's a lot more efficient.

Trunk supports Insert() and Delete() on dynamic arrays, Concat() and +
are on the near term ToDo list.

I started using strings as communication buffers since delphi 2. There
weren't even dynamic arrays then...

Bye

--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es/
Tel. +34 93 5883004 (Ext.3010) Fax +34 93 5883007
--

Graeme Geldenhuys via Lazarus

2017-08-16 23:38:06 UTC

Post by Luca Olivetti via Lazarus
I started using strings as communication buffers since delphi 2. There
weren't even dynamic arrays then...

Well, Link-Lists existed from the beginning of time. I used them plenty
in my TP days, and adding, inserting, indexing etc was pretty easy.
Maybe programmers have just become spoilt over time with all the "out of
the box" functionality and actually become lazy in coding.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Graeme Geldenhuys via Lazarus

2017-08-16 23:34:42 UTC

Post by Luca Olivetti via Lazarus
I mean, TBytes is just an "array of char".

NO! Char can now mean a 1-byte char or a 2-byte char (I don't know how
FPC plans to support Unicode surrogate pairs which will require
4-bytes). In the olden days (Delphi 7 and FPC 2.6.4) the Char type might
always have meant 1-byte, but it doesn't necessarily these days.

TBytes has always been a container for Byte data.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Michael Schnell via Lazarus

2017-08-16 09:01:26 UTC

How is that not "abuse"???

IMHO it's a major shortcoming to define "string" as "printable text". In
fact the name "String" does not suggest this at all. A "string" in my
understanding just is a sequence of similar "things".

A string type was definitely not the right choice.

Notwithstanding the discussion about the mere wording, this only would
hold, if the system would provide a differently named non "printable
text" basic type that comes with the features needed for such usage:
reference counting, lazy copy, simple operators for concatenating and
element extraction and replacement, built-in function for substring
locating, ...

-Michael
--

Michael Van Canneyt via Lazarus

2017-08-16 09:06:41 UTC

Post by Michael Schnell via Lazarus

How is that not "abuse"???

IMHO it's a major shortcoming to define "string" as "printable text".

On the contrary. That is exactly what it means.
Anything else is just a collection of bytes.

Michael.
--

Bo Berglund via Lazarus

2017-08-16 05:53:11 UTC

On Tue, 15 Aug 2017 21:22:10 +0200, Luca Olivetti via Lazarus

Post by Luca Olivetti via Lazarus
(I remarked the "if" because I don't know if that's the case, according
to Bo Berglund's experience it is)

Just to expand on my "experience" and the reason I posted:

My work on converting the old program started back a couple of years
when I went from Delphi 2007 (pre-unicode) to Delphi XE5 because we
wanted the GUI to be translatable to non-western languages.

But then all the communications functions (and these are many in this
utility application) broke because they used strings as containers for
the inherently binary serial data.

So I followed advice on the Embarcadero forum to switch to AnsiString
because that was really what the old string type was an alias for.
I had no great insight in the inner workings of the string handling
functions but I "knew" that AnsiString was a 1-byte per element and
(unicode)string was now a 2-byte per element container. The fact that
the code could alter the content of the AnsiString did not dawn on me
at all.
And the comm functions worked fine after the change (I tested a lot,
but of course only on my English Win7 computer).

Then some time ago there was a report of a failure of the new program
version that only happened in Korea, China and Thailand. In the log
files there was a very strange entry about finding an illegal command
byte when sending a command to the equipment.

It never triggered when I debugged the problem, for me and my
collegues it worked flawlessly. So I had to add more logging and found
that the problem arose when the outgoing command was built. A certain
1-byte command was then expanded to 2 bytes with the wrong first byte!
The commands in the protocol are the first byte of the data of a
telegram and they are in range $C0..$E9.
When one of these (I don't now remember exactly which one) was used in
an assignment to the AnsiString buffer it was converted to $3F +
something that was never logged and the operation failed because the
equipment could not decode the command.

So I asked again on the forum and was steered towards RawByteString
because presumably that container would disallow conversions.
And when I changed this and sent a new version to the distributor in
Korea the problem was seemingly gone.

Based on this experience I wanted to alert the OP of the fact that
using AnsiString instead of string is not a cure-all for binary data,
you need to fix the codepage too, which is what the RawByteString does
for you....

But I have now moved on and replaced all comm related containers with
TBytes including modifying the serial component we have used.
(With some help from Remy Lebeau).

--
Bo Berglund
Developer in Sweden

--

Bo Berglund via Lazarus

2017-08-16 06:01:24 UTC

On Wed, 16 Aug 2017 07:53:11 +0200, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
But I have now moved on and replaced all comm related containers with
TBytes including modifying the serial component we have used.
(With some help from Remy Lebeau).

I forgot to mention that the problem area is located inside a non-GUI
class file for handling the communications, and this file is also used
in some programs written in FPC for Raspberry Pi target computers.
I.e. Linux and the reason for going to FPC.
So I want it to be both FPC and Delphi compatible...

--
Bo Berglund
Developer in Sweden

--

Juha Manninen via Lazarus

2017-08-16 08:59:20 UTC

On Wed, Aug 16, 2017 at 8:53 AM, Bo Berglund via Lazarus

Post by Bo Berglund via Lazarus
Based on this experience I wanted to alert the OP of the fact that
using AnsiString instead of string is not a cure-all for binary data,
you need to fix the codepage too, which is what the RawByteString does
for you....

Bo, everybody has known for decades that AnsiString is not for binary data.
Why do you proclaim it as a new discovery?
The OP's problem was completely different. It was about text encoding.
TBytes is clearly the right choice for your binary data, but this
discussion is not about binary data!

What means "AnsiString instead of string"?
String is typically an alias for AnsiString.

Your sentence about RawByteString is also wrong.
There is no automatic codepage conversion for RawByteString.

Juha
--

Michael Schnell via Lazarus

2017-08-16 08:47:37 UTC

Post by Luca Olivetti via Lazarus
I has worked extremely well and reliably until fpc 2.6.4 (i.e. with
string=ansistring).
Does it not work in 3.x?

I understand that storing uncoded Bytes in UTF8-Strings (hence in fpc)
works as good as it always had, as long as all strings are defined with
the same code branding as TSrings (and friends) is (i.e. UTF8), because
there never will be a conversion.

But it does not work in Delphi, as here TStrings is defined to be UTF-16.

-Michael
--

Mattias Gaertner via Lazarus

2017-08-16 08:58:13 UTC

On Wed, 16 Aug 2017 10:47:37 +0200

Post by Michael Schnell via Lazarus

Post by Luca Olivetti via Lazarus
I has worked extremely well and reliably until fpc 2.6.4 (i.e. with
string=ansistring).
Does it not work in 3.x?

I understand that storing uncoded Bytes in UTF8-Strings (hence in fpc)
works as good as it always had, as long as all strings are defined with
the same code branding as TSrings (and friends) is (i.e. UTF8), because
there never will be a conversion.
But it does not work in Delphi, as here TStrings is defined to be UTF-16.

This thread is going out of topic.
Please start a new thread if you want to discuss Delphi strings.

Mattias
--

Michael Schnell via Lazarus

2017-08-16 09:09:17 UTC

Post by Mattias Gaertner via Lazarus
This thread is going out of topic.
Please start a new thread if you want to discuss Delphi strings.

You can't discuss fpc's string problems without mentioning Delphi, as
they are a direct consequence as well of Delphi-compatibility as of
Delphi-incompatibility.

-Michael

--

Mattias Gaertner via Lazarus

2017-08-16 09:32:16 UTC

On Wed, 16 Aug 2017 11:09:17 +0200

Post by Michael Schnell via Lazarus

Post by Mattias Gaertner via Lazarus
This thread is going out of topic.
Please start a new thread if you want to discuss Delphi strings.

You can't discuss fpc's string problems without mentioning Delphi, as
they are a direct consequence as well of Delphi-compatibility as of
Delphi-incompatibility.

The original post was about a string conversion warning.

Anyone who wants to discuss the grand picture of strings in FPC for
the millionth time should start a new topic.

Mattias
--

Michael Schnell via Lazarus

2017-08-16 09:41:10 UTC

Anyone who wants to discuss the grand picture of strings in FPC for the millionth time should start a new topic.

Right you are. And it will be by far too late and futile, anyway,
because of the reasons discussed a million times.

-Michael
--

wkitty42--- via Lazarus

2017-08-15 17:53:21 UTC

Post by Michael Van Canneyt via Lazarus
WideString/UnicodeString for those that want 2-byte characters.

what if 3 and 4 byte characters are required? will they also work in UnicodeStrings?

i'm looking at this from a linux POV but have been trying to come from the very
old school DOS TP stuff using codepages... especially needing to be able to read
codepage strings and properly convert all their characters to UTF-8...

converting back would be a huge help, too... even with the possible loss of
characters requiring replacing them with "?" or something to hold their place
and show they didn't convert... that or even leaving them in their 2, 3 or 4
byte form and let those using older codepage stuff see them raw...

--
NOTE: No off-list assistance is given without prior approval.
*Please keep mailing list traffic on the list unless*
*a signed and pre-paid contract is in effect with us.*
--

Michael Schnell via Lazarus

2017-08-16 09:12:40 UTC

Post by wkitty42--- via Lazarus
what if 3 and 4 byte characters are required? will they also work in UnicodeStrings?

UTF-8 and UTF-16 are just encoding variants for 32 bit Unicode
"characters", storing them in n (or 2*n) Bytes according to a simple
scheme.

-Michael

--

Juha Manninen via Lazarus

2017-08-16 10:22:34 UTC

On Wed, Aug 16, 2017 at 12:12 PM, Michael Schnell via Lazarus

UTF-8 and UTF-16 are just encoding variants for 32 bit Unicode "characters",
storing them in n (or 2*n) Bytes according to a simple scheme.

No, they are encodings for codepoints, not "characters" (whatever that means).

Michael Schnell, your posts are completely out of topic.
Unicode related topics clearly pull you like a magnet and then you
loose all control and start to proclaim your grand plan for a string
revamp.
It can continue for months as we remember from past years.
You should stop writing in this thread now. I agree with Mattias.

Juha
--

Michael Schnell via Lazarus

2017-08-16 10:26:48 UTC

Post by Juha Manninen via Lazarus
You should stop writing in this thread now. I agree with Mattias.

I perfectly agree with you. But you can't blame me for answering when
asked.

-Michael

--

Juha Manninen via Lazarus

2017-08-14 21:01:55 UTC

On Mon, Aug 14, 2017 at 5:11 PM, Tony Whyman via Lazarus

Indeed, why isn't there a single container string type for
all character sets where the encoding whether a legacy code page, UTF8,
UTF16 or UTF32 is simply a dynamic attribute of the type - a sort of
extended AnsiString?

As Sven Barth wrote, they have different size of char.

Tony Whyman, this issue has been discussed again and again for the
past 10+ years first in FPC mailing lists and then in Lazarus lists.
The current Unicode support in Lazarus works f***ing well and is
amazingly compatible with Delphi.
WinAPI parameters may require an explicit temporary UnicodeString
variable but even then the code is compatible with Delphi.

Tony Whyman, Marcos Douglas and Michael Schnell, please study the facts.
For starters, this is about the current Unicode support in Lazarus:
http://wiki.freepascal.org/Unicode_Support_in_Lazarus
I think the dynamic encoding and automatic conversion now work perfectly well.
If you have a piece of code where it does not work, please ask for
detailed info.

Juha
--

Tony Whyman via Lazarus

2017-08-15 09:15:36 UTC

Post by Juha Manninen via Lazarus
Tony Whyman, this issue has been discussed again and again for the
past 10+ years first in FPC mailing lists and then in Lazarus lists.
The current Unicode support in Lazarus works f***ing well and is
amazingly compatible with Delphi.
WinAPI parameters may require an explicit temporary UnicodeString
variable but even then the code is compatible with Delphi.
Tony Whyman, Marcos Douglas and Michael Schnell, please study the facts.
http://wiki.freepascal.org/Unicode_Support_in_Lazarus
I think the dynamic encoding and automatic conversion now work perfectly well.
If you have a piece of code where it does not work, please ask for
detailed info.

If a topic keeps on being discussed after 10+ years of argument, the
reason is usually either (a) the problem and its solution have not been
documented properly, or (b) the outcome is an unsatisfactory compromise.

In this case, I would argue that both are true.

I went back and read the wiki article you mentioned and was no more the
wiser as to why the current mess exists. Is it really no more than
because Delphi continues to screw up in this area, so must FPC? The body
of the article appears to be a set of notes - not necessarily wrong in
themselves but lacking the background and context needed to explain why
it is like it is.

This problem will keep coming up until it is fixed properly and, by
that, I mean the that solution is consistent, understandable intuitively
and well documented. Windows eccentricity also need to kept to Windows.

Here is my wish list:

1. Stop using the term "Unicode".

It is too ambiguous. It is used as both an all embracing term for
multi-byte encoding and as a synonym for UTF16 and that is really
too confusing. The problem is made worse by having UnicodeString as
a two byte wide string type in both FPC and Delphi.

2. Clean up the char type.

When Wirth created the "char" type in Pascal it was a simple ASCII
or EBCDIC character. There are now seven different char types
(including type equivalence) with no guidelines on when each is
applicable. This is too many. Why shouldn't there be a single char
type that intuitively represents a single character regardless of
how many bytes are used to represent it. Yes, in a world where we
have to live with UTF8, UTF16, UTF32, legacy code pages and Chinese
variations on UTF8, that means that dynamic attributes have to be
included in the type. But isn't that the only way to have consistent
and intuitive character handling?

3. The problem with string handling today is that it is not based on a
consistent approach to the character type.

If you clean up character handling then the model for string
handling should become obvious. A string is after all no more than a
container for a character array and which should be constrained to
have the same character encoding. A string should intuitively
represent a string of text regardless of how many bytes are used to
represent each character and with dynamic attributes to tell you how
it is encoded.

4. FPC should clean up Delphi's mess for it. If a unified string type
follows a consistent model then it should be possible to make all Delphi
string types synonyms.

You will need to allow exceptions for legacy programs that insist on
manipulating the bytes themselves - but that is not rocket science.
There is also the issue of the Windows API and its insistence on
Wide Strings - but isn't that why calling conventions such as cdecl
and stdcall exist - to tell the compiler when it needs to reformat
the call for a given API convention.

Tony Whyman

Michael Schnell via Lazarus

2017-08-15 09:57:09 UTC

Post by Tony Whyman via Lazarus
In this case, I would argue that both are true.

And the culprit obviously is Embarcadeo and not the fpc or the Lazarus
team, who did their best to try to do a compatible and implementation
that is really workable on the multiple supported platforms (which E$
did not feel necessary when they released the encoding aware strings).

Maybe a better solution can be found, but who would want to nudge the
fpc / Lazarus developers to invest a huge amount of time to create it
and then make sure it is decently tested stable ?

-Michael
--

Bart via Lazarus

2017-08-15 12:02:16 UTC

Post by Tony Whyman via Lazarus
2. Clean up the char type.
Why shouldn't there be a single char
type that intuitively represents a single character regardless of
how many bytes are used to represent it.

You would have to define what "a single character" means in the first place.
This is especially important when it involves precomposed characters
and combining characters.

Bart
--

Michael Schnell via Lazarus

2017-08-15 12:26:34 UTC

Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

I suppose by "char" you mean "single printable thingy" with Unicode it's
rather debatable what such a thingy is.

Hence a Unicode singe char would need to be just be a Unicode string.

-Michael
--

Mattias Gaertner via Lazarus

2017-08-15 12:53:00 UTC

On Tue, 15 Aug 2017 14:26:34 +0200

Post by Michael Schnell via Lazarus

Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

I suppose by "char" you mean "single printable thingy" with Unicode it's
rather debatable what such a thingy is.
Hence a Unicode singe char would need to be just be a Unicode string.

Do you mean a 'char' is a string in your proposal?

Mattias
--

Michael Van Canneyt via Lazarus

2017-08-15 12:53:38 UTC

Post by Mattias Gaertner via Lazarus
On Tue, 15 Aug 2017 14:26:34 +0200

Post by Michael Schnell via Lazarus

Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

I suppose by "char" you mean "single printable thingy" with Unicode it's
rather debatable what such a thingy is.
Hence a Unicode singe char would need to be just be a Unicode string.

Do you mean a 'char' is a string in your proposal?

That would be a neat recursive definition :)

Michael.
--

Michael Schnell via Lazarus

2017-08-15 14:44:30 UTC

Post by Mattias Gaertner via Lazarus
Do you mean a 'char' is a string in your proposal?

Nope. In my proposal there would be Chars for any statically encoded
String Type, hence 1, 2, 4, and 8 byte wide. (As regarding statically
encoded string (and char) brands, it's just an extension of the existing
paradigm.

I did not think about the necessity to also have a dynamically encoded
Char type. If yes, it (like a string) would need the additional fields
for encoding number and bytes_per_char, and the appropriate compiler
magic to handle them appropriately (workalike to a on-element string).

-Michael
--

Mattias Gaertner via Lazarus

2017-08-15 16:33:25 UTC

On Tue, 15 Aug 2017 16:44:30 +0200

Post by Michael Schnell via Lazarus

Post by Mattias Gaertner via Lazarus
Do you mean a 'char' is a string in your proposal?

Nope. In my proposal there would be Chars for any statically encoded
String Type, hence 1, 2, 4, and 8 byte wide. (As regarding statically
encoded string (and char) brands, it's just an extension of the existing
paradigm.

8 bytes?

Do you propose a string without the array operator [] ?

Mattias
--

Michael Schnell via Lazarus

2017-08-16 09:26:46 UTC

Post by Mattias Gaertner via Lazarus
Do you propose a string without the array operator [] ?

I don't understand what you mean by this.

Of course an appropriate "char" type for each string encoding brand
could to be provided, hence a "CP_QWord Char" as an alias or a QWord.

(Please keep in mind that in that paper (as explicitly pointed out)
"String" is not a synonym for "printable text" but for "sequence of
similar things". And here of course (at least in a 64 bit system) it's
extremely appropriate to allow for 64 bit elements. And of course this
is just a suggestion that could solve a certain class of problems but
needs a big effort to do and verify the modifications in the compiler
and the libraries.)

-Michael
--

Michael Schnell via Lazarus

2017-08-15 12:30:40 UTC

Post by Tony Whyman via Lazarus
3. The problem with string handling today is that it is not based on a
consistent approach to the character type.
If you clean up character handling then the model for string
handling should become obvious. A string is after all no more than
a container for a character array and which should be constrained
to have the same character encoding. A string should intuitively
represent a string of text regardless of how many bytes are used
to represent each character and with dynamic attributes to tell
you how it is encoded.
4. FPC should clean up Delphi's mess for it. If a unified string type
follows a consistent model then it should be possible to make all
Delphi string types synonyms.
You will need to allow exceptions for legacy programs that insist
on manipulating the bytes themselves - but that is not rocket
science. There is also the issue of the Windows API and its
insistence on Wide Strings - but isn't that why calling
conventions such as cdecl and stdcall exist - to tell the compiler
when it needs to reformat the call for a given API convention.

see ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support

-Michael

Juha Manninen via Lazarus

2017-08-16 10:05:43 UTC

On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus

Post by Tony Whyman via Lazarus
UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.

That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding. :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.

On Tue, Aug 15, 2017 at 12:15 PM, Tony Whyman via Lazarus

Post by Tony Whyman via Lazarus
If a topic keeps on being discussed after 10+ years of argument, the reason
is usually either (a) the problem and its solution have not been documented
properly, or (b) the outcome is an unsatisfactory compromise.

Or (c) The people discussing are ignorant about the topic.

Post by Tony Whyman via Lazarus
I went back and read the wiki article you mentioned and was no more the
wiser as to why the current mess exists. Is it really no more than because
Delphi continues to screw up in this area, so must FPC? The body of the
article appears to be a set of notes - not necessarily wrong in themselves
but lacking the background and context needed to explain why it is like it is.

Hmmm...
Originally the page was a mess because it had lots of irrelevant
background info about the old obsolete LCL Unicode support. Text was
added by many people but none was removed.
Finally I cleaned the page. It now has most relevant info at the top
and then special cases and technical details later.
I am rather happy with the page now, it explains how to use Unicode
with Lazarus as clearly as possible.
However I am willing to improve it. What kind of background and
context would you need?

Post by Tony Whyman via Lazarus
1. Stop using the term "Unicode".

You can stop using it. No problem.
For others however it is a well defined international standard. See:
https://en.wikipedia.org/wiki/Unicode

Post by Tony Whyman via Lazarus
2. Clean up the char type.
...
Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

What do you mean by "a single character"?
A "character" in Unicode can mean about 7 different things. Which one
is your pick?
This question is for everybody in this thread who used the word "character".

Post by Tony Whyman via Lazarus
Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
pages and Chinese variations on UTF8, that means that dynamic attributes
have to be included in the type. But isn't that the only way to have
consistent and intuitive character handling?

What do you mean? Chinese don't have a variation of UTF8.
UTF8 is global unambiguous encoding standard, part of Unicode.

The fundamental problem is that you want to hide the complexity of
Unicode by some magic String type of a compiler.
It is not possible. Unicode remains complex but the complexity is NOT
in encodings!
No, a codepoint's encoding is the easy part. For example I was easily
able to create a unit to support encoding agnostic code. See unit
LazUnicode in package LazUtils.
The complexity is elsewhere:
- "Character" composed of codepoints in precomposed and decomposed
(normalized) forms.
- Compare and sort text based on locale.
- Uppercase / Lowercase rules based on locale.
- Glyphs
- Graphemes
- etc.

I must admit I don't understand well those complex parts.
I do understand codeunits and codepoints, and I understand they are
the easy part.

Juha
--

Graeme Geldenhuys via Lazarus

2017-08-16 11:02:33 UTC

Post by Juha Manninen via Lazarus
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.

Yeah, I see that even in commercial products and projects. It's very sad
to see. Hence I always promote UTF-8, and you can't get it wrong as
easily as UTF-16. No endianess to worry about, no surrogate pairs and
UTF-8 is ready for streaming (network or disk) out of the box.

Regards,
Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key: http://tinyurl.com/graeme-pgp
--

Marcos Douglas B. Santos via Lazarus

2017-08-14 17:41:34 UTC

On Mon, Aug 14, 2017 at 10:21 AM, Tony Whyman via Lazarus

Post by Tony Whyman via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
FPC and Lazarus claim they are cross-platform — this is a fact — and
because that, IMHO, both should be use in only one way in every
system, don't you think?
Best regards,
Marcos Douglas

Precisely. But why this fixation on UTF-16/Unicode and not UTF8?

I have no fixation in any Unicode flavors...
My "problem" is because I use Windows, not Linux where UTF8 is the default.

Post by Tony Whyman via Lazarus
Lazarus is already a UTF8 environment.
Much of the LCL assumes UTF8.
UTF8 is arguably a much more efficient way to store and transfer data
UTF-16/Unicode can only store 65,536 characters while the Unicode standard
(that covers UTF8 as well) defines 136,755 characters.
UTF-16/Unicode's main advantage seems to be for rapid indexing of large
strings.
You made need UTF-16/Unicode support for accessing Microsoft APIs but apart
from that, why is it being promoted as the universal standard?

I didn't propose that.
But take a look in other languages, see what they are using.

Best regards,
Marcos Douglas
--

Juha Manninen via Lazarus

2017-08-16 09:12:20 UTC

On Mon, Aug 14, 2017 at 4:11 PM, Marcos Douglas B. Santos via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
Unicode everywhere and you using AnsiString and doing everything...
Now I'm confused.

Yes, please read:
http://wiki.freepascal.org/Unicode_Support_in_Lazarus
I have advertised it so much that some people are already irritated,
but maybe you missed it so far.

Post by Marcos Douglas B. Santos via Lazarus
FPC and Lazarus claim they are cross-platform — this is a fact — and
because that, IMHO, both should be use in only one way in every
system, don't you think?

Yes, and that's how it works.

Post by Marcos Douglas B. Santos via Lazarus
This is a ugly trick... but I understood what you mean.

This was about the explicit temporary UnicodeString variable for
WinAPI call parameters.
No, it is not ugly, the code remains 100% compatible with Delphi.
Please remember also that direct WinAPI call are not needed in
cross-platform code.

Juha
--

Marcos Douglas B. Santos via Lazarus

2017-08-16 14:13:29 UTC

On Wed, Aug 16, 2017 at 6:12 AM, Juha Manninen via Lazarus

Post by Juha Manninen via Lazarus
On Mon, Aug 14, 2017 at 4:11 PM, Marcos Douglas B. Santos via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
Unicode everywhere and you using AnsiString and doing everything...
Now I'm confused.

http://wiki.freepascal.org/Unicode_Support_in_Lazarus
I have advertised it so much that some people are already irritated,
but maybe you missed it so far.

Thanks. I know about this page... unfortunately looks like it is not
enough, since many others still complain.

Post by Juha Manninen via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
This is a ugly trick... but I understood what you mean.

This was about the explicit temporary UnicodeString variable for
WinAPI call parameters.
No, it is not ugly, the code remains 100% compatible with Delphi.
Please remember also that direct WinAPI call are not needed in
cross-platform code.

This thread is not only about WinAPI. I have this problem because I
need to use a Windows 3rd Lib, which uses WideString.

Best regards,
Marcos Douglas
--

Juha Manninen via Lazarus

2017-08-16 14:37:07 UTC

On Wed, Aug 16, 2017 at 5:13 PM, Marcos Douglas B. Santos via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
Thanks. I know about this page... unfortunately looks like it is not
enough, since many others still complain.

What is missing? I can try to improve it.

Post by Marcos Douglas B. Santos via Lazarus
This thread is not only about WinAPI. I have this problem because I
need to use a Windows 3rd Lib, which uses WideString.

Then just use WideString or UnicodeString where needed. It is not a problem.

Note, WideString is for OLE programming. Most often you should use
UnicodeString. Their memory management differs.

Juha
--

Marcos Douglas B. Santos via Lazarus

2017-08-16 14:48:43 UTC

On Wed, Aug 16, 2017 at 11:37 AM, Juha Manninen via Lazarus

Post by Juha Manninen via Lazarus
On Wed, Aug 16, 2017 at 5:13 PM, Marcos Douglas B. Santos via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
Thanks. I know about this page... unfortunately looks like it is not
enough, since many others still complain.

What is missing? I can try to improve it.

I cannot say from others, but I had this issue (about WideString) for now.

Post by Juha Manninen via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
This thread is not only about WinAPI. I have this problem because I
need to use a Windows 3rd Lib, which uses WideString.

Then just use WideString or UnicodeString where needed. It is not a problem.

Are you saying that I need to do this?
(following the firt example on this thread)

=== begin ===
var
U: UnicodeString;
W: WideString;
begin
U := IniFile.ReadString('TheLib', 'license', '');
W := U;
Lib.SetLicense(W);
// ...
end;
=== end ===

...and I will not get a "Warning", right?

Post by Juha Manninen via Lazarus
Note, WideString is for OLE programming. Most often you should use
UnicodeString. Their memory management differs.

Ok... thanks... but in my case is a OLE object that I need to use.

Best regards,
Marcos Douglas
--

Juha Manninen via Lazarus

2017-08-16 15:38:15 UTC

On Wed, Aug 16, 2017 at 5:48 PM, Marcos Douglas B. Santos via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
I cannot say from others, but I had this issue (about WideString) for now.

The section "Calling Windows API" says:
'Only the "W" versions of Windows API functions should be called. It
is like in Delphi except that you must assign strings to/from API
calls to UnicodeString variables or typecast with UnicodeString().'
Then it also explains the difference between WideString and UnicodeString.
I should add a mention about PWideChar parameters.
Anyway the idea is to keep the information useful and dense. Earlier
it was bloated and intimidating.

Post by Marcos Douglas B. Santos via Lazarus
Are you saying that I need to do this?
(following the firt example on this thread)

No, if the parameter is WideString, not a pointer PWideChar, you can
just call it like you did. Suppress the warning as Mattias told if it
bothers you. You can also make a helper function so the conversion
happens in one place.
Yes, for OLE you need WideString.

Juha
--

Mattias Gaertner via Lazarus

2017-08-15 09:10:45 UTC

On Sat, 12 Aug 2017 17:56:58 -0300

Post by Marcos Douglas B. Santos via Lazarus
[...]

Post by Mattias Gaertner via Lazarus
Which one? Do you mean Windows CP-1252?

Yes...
But would it make any difference?

Just

Post by Marcos Douglas B. Santos via Lazarus

Post by Mattias Gaertner via Lazarus

Post by Marcos Douglas B. Santos via Lazarus
[...]
Warning: Implicit string type conversion from "AnsiString" to "WideString"

Lib.SetLicense(
WideString(IniFile.ReadString('TheLib', 'license', ''))
);

Wow... everywhere? :(

You could instead define an overloaded Lib.SetLicense(AnsiString). Or
you could disable this hint altogether for your project (not
recommended). Select the message in the Messages window. Right click
and click on add -vm....

Mattias
--

129 Replies
407 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Marcos Douglas B. Santos via Lazarus 2017-08-12 19:46:09 UTC

Mattias Gaertner via Lazarus 2017-08-12 20:32:47 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-12 20:43:29 UTC

Mattias Gaertner via Lazarus 2017-08-12 20:49:56 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-12 20:56:58 UTC

Bo Berglund via Lazarus 2017-08-12 22:21:55 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-13 02:42:43 UTC

Bo Berglund via Lazarus 2017-08-13 09:19:35 UTC

Juha Manninen via Lazarus 2017-08-13 10:51:19 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-14 12:50:23 UTC

Michael Schnell via Lazarus 2017-08-14 13:19:05 UTC

Graeme Geldenhuys via Lazarus 2017-08-14 13:55:47 UTC

Juha Manninen via Lazarus 2017-08-13 11:18:23 UTC

Bo Berglund via Lazarus 2017-08-13 16:41:09 UTC

Juha Manninen via Lazarus 2017-08-13 20:41:34 UTC

Michael Schnell via Lazarus 2017-08-14 08:25:14 UTC

Tony Whyman via Lazarus 2017-08-14 09:53:44 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-14 13:11:27 UTC

Tony Whyman via Lazarus 2017-08-14 13:21:57 UTC

Mattias Gaertner via Lazarus 2017-08-14 13:46:54 UTC

Tony Whyman via Lazarus 2017-08-14 14:11:15 UTC

Graeme Geldenhuys via Lazarus 2017-08-14 14:20:52 UTC

Sven Barth via Lazarus 2017-08-14 16:49:58 UTC

Michael Schnell via Lazarus 2017-08-15 07:59:38 UTC

Sven Barth via Lazarus 2017-08-14 16:47:58 UTC

Michael Schnell via Lazarus 2017-08-15 08:03:14 UTC

Tony Whyman via Lazarus 2017-08-15 08:34:57 UTC

Sven Barth via Lazarus 2017-08-16 17:29:17 UTC

Mattias Gaertner via Lazarus 2017-08-15 09:13:13 UTC

Tony Whyman via Lazarus 2017-08-15 09:17:22 UTC

Michael Van Canneyt via Lazarus 2017-08-15 09:25:26 UTC

Michael Schnell via Lazarus 2017-08-15 09:49:24 UTC

Michael Van Canneyt via Lazarus 2017-08-15 09:52:49 UTC

Michael Schnell via Lazarus 2017-08-15 10:02:28 UTC

Mattias Gaertner via Lazarus 2017-08-15 10:11:37 UTC

Michael Van Canneyt via Lazarus 2017-08-15 10:15:55 UTC

Michael Schnell via Lazarus 2017-08-15 10:44:10 UTC

Michael Van Canneyt via Lazarus 2017-08-15 10:51:49 UTC

Michael Schnell via Lazarus 2017-08-15 10:34:45 UTC

Graeme Geldenhuys via Lazarus 2017-08-15 17:18:01 UTC

Michael Schnell via Lazarus 2017-08-16 08:43:18 UTC

Graeme Geldenhuys via Lazarus 2017-08-16 09:08:23 UTC

Michael Schnell via Lazarus 2017-08-16 09:33:04 UTC

Mattias Gaertner via Lazarus 2017-08-16 09:55:04 UTC

Michael Schnell via Lazarus 2017-08-16 10:15:47 UTC

Michael Schnell via Lazarus 2017-08-16 09:36:41 UTC

Sven Barth via Lazarus 2017-08-16 17:35:23 UTC

Graeme Geldenhuys via Lazarus 2017-08-16 23:30:09 UTC

wkitty42--- via Lazarus 2017-08-17 02:15:07 UTC

Luca Olivetti via Lazarus 2017-08-15 17:29:23 UTC

Graeme Geldenhuys via Lazarus 2017-08-15 19:14:10 UTC

Luca Olivetti via Lazarus 2017-08-15 19:22:10 UTC

Mattias Gaertner via Lazarus 2017-08-15 19:34:40 UTC

Ondrej Pokorny via Lazarus 2017-08-15 19:38:34 UTC

Luca Olivetti via Lazarus 2017-08-15 20:08:52 UTC

Luca Olivetti via Lazarus 2017-08-15 20:10:45 UTC

Luca Olivetti via Lazarus 2017-08-15 20:09:43 UTC

Michael Schnell via Lazarus 2017-08-16 08:51:52 UTC

Graeme Geldenhuys via Lazarus 2017-08-15 20:45:33 UTC

Luca Olivetti via Lazarus 2017-08-15 22:41:48 UTC

Graeme Geldenhuys via Lazarus 2017-08-15 23:17:32 UTC

Luca Olivetti via Lazarus 2017-08-16 18:26:50 UTC

Luca Olivetti via Lazarus 2017-08-16 18:28:27 UTC

Luca Olivetti via Lazarus 2017-08-16 22:46:33 UTC

Graeme Geldenhuys via Lazarus 2017-08-16 23:38:06 UTC

Graeme Geldenhuys via Lazarus 2017-08-16 23:34:42 UTC

Michael Schnell via Lazarus 2017-08-16 09:01:26 UTC

Michael Van Canneyt via Lazarus 2017-08-16 09:06:41 UTC

Bo Berglund via Lazarus 2017-08-16 05:53:11 UTC

Bo Berglund via Lazarus 2017-08-16 06:01:24 UTC

Juha Manninen via Lazarus 2017-08-16 08:59:20 UTC

Michael Schnell via Lazarus 2017-08-16 08:47:37 UTC

Mattias Gaertner via Lazarus 2017-08-16 08:58:13 UTC

Michael Schnell via Lazarus 2017-08-16 09:09:17 UTC

Mattias Gaertner via Lazarus 2017-08-16 09:32:16 UTC

Michael Schnell via Lazarus 2017-08-16 09:41:10 UTC

wkitty42--- via Lazarus 2017-08-15 17:53:21 UTC

Michael Schnell via Lazarus 2017-08-16 09:12:40 UTC

Juha Manninen via Lazarus 2017-08-16 10:22:34 UTC

Michael Schnell via Lazarus 2017-08-16 10:26:48 UTC

Juha Manninen via Lazarus 2017-08-14 21:01:55 UTC

Tony Whyman via Lazarus 2017-08-15 09:15:36 UTC

Michael Schnell via Lazarus 2017-08-15 09:57:09 UTC

Bart via Lazarus 2017-08-15 12:02:16 UTC

Michael Schnell via Lazarus 2017-08-15 12:26:34 UTC

Mattias Gaertner via Lazarus 2017-08-15 12:53:00 UTC

Michael Van Canneyt via Lazarus 2017-08-15 12:53:38 UTC

Michael Schnell via Lazarus 2017-08-15 14:44:30 UTC

Mattias Gaertner via Lazarus 2017-08-15 16:33:25 UTC

Michael Schnell via Lazarus 2017-08-16 09:26:46 UTC

Michael Schnell via Lazarus 2017-08-15 12:30:40 UTC

Juha Manninen via Lazarus 2017-08-16 10:05:43 UTC

Graeme Geldenhuys via Lazarus 2017-08-16 11:02:33 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-14 17:41:34 UTC

Juha Manninen via Lazarus 2017-08-16 09:12:20 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-16 14:13:29 UTC

Juha Manninen via Lazarus 2017-08-16 14:37:07 UTC

Marcos Douglas B. Santos via Lazarus 2017-08-16 14:48:43 UTC

Juha Manninen via Lazarus 2017-08-16 15:38:15 UTC

Mattias Gaertner via Lazarus 2017-08-15 09:10:45 UTC

about - legalese

Loading...