Jump to content
Tuts 4 You

How to format specific text to correctly displaying text?


Recommended Posts

adoxa

Let me say it for the third time, also repeating Teddy: WideCharCount IS CHARACTERS NOT BYTES! When you use -1 it effectively does lstrlen itself, but also including the null.

WideCharCount is how many characters (16-bit units) are in your source wide string (-1 to determine automatically); MultiByteCount is how big your buffer is (in bytes) for the resulting UTF-8 string.

  • Like 3
Link to comment
Teddy Rogers
10 hours ago, LCF-AT said:

Also in your code @Teddy Rogers I don't see any check what you got in EAX after second call to WideCharToMultiByte.

I used "Debug" to visually show, the second return value of WideCharToMultiByte - if successful it returns the number of bytes written to the buffer. See my screenshot using the "KISS MARK" below.

10 hours ago, LCF-AT said:

Also you do len the string using the value with the first call to WideCharToMultiByte instead of using -1h. Why?

"Len" is a PB command to obtain the number of characters in a string, without the null terminator. I use this value in cchWideChar because the function needs to know how many characters are in the string.

10 hours ago, LCF-AT said:

Just when I use a value of 0Bh instead of 5 (the function told me I could use) it does work. The question is why does it need 0Bh bytes to write lots of bytes I even don't need!?

In your first call to WideCharToMultiByte cchWideChar is using "-1". The function has worked out the string length to being three characters requiring five bytes in the buffer to store the converted string.

The second call to WideCharToMultiByte, you are passing in cchWideChar the number of "five" which was intended for cbMultiByte - the buffer size for storing the converted string. Now the function believes you have five characters in the string you are trying to convert with a buffer size of five bytes. When the end of the buffer is reached it fails with ERROR_INSUFFICIENT_BUFFER. Those extra bytes look to be what ever is behind the first half of the null terminator.

You still have your characters and bytes mixed up...

EnableExplicit

Define *lpMultiByteStr
Define cbMultiByte.i
Define lpMultiByteStr.s
Define cchWideChar.i

lpMultiByteStr = "💋"
cchWideChar = Len(lpMultiByteStr)

cbMultiByte = WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, #Null, #Null, #Null, #Null)
Debug cchWideChar
If cbMultiByte
  Debug cbMultiByte
  *lpMultiByteStr = AllocateMemory(cbMultiByte)
  
  If *lpMultiByteStr
    
    Debug WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, *lpMultiByteStr, cbMultiByte, #Null, #Null)

    Debug PeekS(*lpMultiByteStr, cchWideChar, #PB_UTF8)
    ShowMemoryViewer(*lpMultiByteStr, cbMultiByte)
    
  EndIf
  FreeMemory(*lpMultiByteStr)
EndIf

image.png

Ted.

  • Like 2
Link to comment
LCF-AT

Hey again,

oh boy! That's so "I paint the town red" confusing! :kick:

Lets summary.

💋

1.) I use lstrlenW = 2 in EAX to get the ccWideChar lenght
2.) Using ccWideChar (2) with WideCharToMultiByte which returns
	the ccMultiByte lenght (4) I would need as buffer lenght
3.) Alloc that ccMultiByte lenght of 4 or more
4.) Calling WideCharToMultiByte with 2 & 4

The code would be like this....

invoke lstrlen,addr _some_  // 💋
mov esi, eax
invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,esi,0,0,0,0
mov ebx, eax
mov edi, alloc(eax)
invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,esi,edi,ebx,0,0

...below like this...

0019FF4C   00401163  /CALL to lstrlenW from bones.0040115E
0019FF50   0045F3AE  \String = "??" 💋
=
EAX = 2 length I need to use for WideCharCount

0019FF30   0040117F  /CALL to WideCharToMultiByte from bones.0040117A
0019FF34   0000FDE9  |CodePage = FDE9
0019FF38   00000000  |Options = 0
0019FF3C   0045F3AE  |WideCharStr = "??"
0019FF40   00000002  |WideCharCount = 2		<-- 2 from lstrlen before
0019FF44   00000000  |MultiByteStr = NULL
0019FF48   00000000  |MultiByteCount = 0
0019FF4C   00000000  |pDefaultChar = NULL
0019FF50   00000000  \pDefaultCharUsed = NULL
=
EAX = 4 length I need to use for MultiByteCount next time

0019FF30   004011A3  /CALL to WideCharToMultiByte from bones.0040119E
0019FF34   0000FDE9  |CodePage = FDE9
0019FF38   00000000  |Options = 0
0019FF3C   0045F3AE  |WideCharStr = "??"
0019FF40   00000002  |WideCharCount = 2		<-- 2 from lstrlen
0019FF44   005046A8  |MultiByteStr = 005046A8	<-- alloc buffer of 4 bytes
0019FF48   00000004  |MultiByteCount = 4	<-- buffer length
0019FF4C   00000000  |pDefaultChar = NULL
0019FF50   00000000  \pDefaultCharUsed = NULL
=
EAX = 4 bytes was written into new buffer of MultiByteStr 005046A8

...so is this correct now so far? I think so. Why did you guys not telling me like this before? :) What a HARD BIRTH!

But in case of using MultiByteToWideChar function I have to double the alloc size like this...

invoke lstrlenA,ansi$("TEST")
mov esi, eax
invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0
add eax, eax		<--- double size to alloc
mov ebx, eax
mov edi, alloc(ebx)
invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,edi,ebx

....maybe the best is really just using -1h in all cases of using those 2 functions and let the function handle / calc the length and I just need to sub the 0 termination byte.

greetz

  • Like 1
Link to comment
Teddy Rogers
2 hours ago, LCF-AT said:

...so is this correct now so far? I think so. Why did you guys not telling me like this before?

The issue was picked up multiple times. Plus you have the API docs to refer to, with code examples.

Looping back a bit, are you sure your text is "ANSI"? You won't be able to convert "ANSI" characters/ control codes passing CP_UTF8 with MultiByteToWideChar. They don't exist in the character set, and you will likely get $FFFD after the ASCII (127) characters. You will need to use CP_ACP...

Ted.

  • Like 2
Link to comment
adoxa

Your MultiByteToWideChar is still wrong: you do need to double the returned length for the allocation (wide characters to bytes), but you still need to pass in the returned length (the buffer is in wide characters, not bytes).

  • Like 2
Link to comment
LCF-AT

Hi,

so what about that MultiByteToWideChar function now? Wrong?! Yes, the ansi$("TEST") is ANSI string I want to convert to WideChar using CP_UTF8 CodePage because I did save everything in CP_UTF8 CodePage before and I also want to handle Symbol Chars if any in there like this 💋 to use it in my UNICODE app. In this case I'm using MultiByteToWideChar with CP_UTF8 too. Why do you say it's wrong Ted? It's working for me. Maybe its not important whether using CP_UTF8 / CP_ACP when reading from a ANSI text & CP_UTF8 text which are pretty same.

So I did convert my UNICODE text / symbols from my Listview to CP_UTF8 with WideCharToMultiByte function as we talked before all the time about it. Now my content was converted & saved into content file which is using the UTF-8 CodePage (I need to have). Now when I run the app new it must read that content file back into listview and here I'm using the CP_UTF8 again with MultiByteToWideChar function to convert entire content file into WideChar format I need to use in my UNICODE style app. What is wrong here? When I use CP_ACP instead of CP_UTF8 then it will not display the symbols anymore like this 💋....

0069B090              F0 9F 92 8B                 	💋		<--- 💋 ReadFile
to // when using MultiByteToWideChar with CP_ACP I get that below...
0069B168              F0 00 78 01 19 20 39 20           ð.x 9		<--- Not displaying 💋 in LV

0069B090              F0 9F 92 8B                 	💋		<--- 💋 ReadFile
to // when using MultiByteToWideChar with CP_UTF8 I get that below...
005CB258              3D D8 8B DC                       =Ø‹Ü		<--- Does Display 💋 in LV

...you see? Guys, so when it is so simple as you say / think then just post any short example to see it.

@adoxa

16 hours ago, adoxa said:

Your MultiByteToWideChar is still wrong: you do need to double the returned length for the allocation (wide characters to bytes), but you still need to pass in the returned length (the buffer is in wide characters, not bytes).

I did double the space after MultiByteToWideChar function. Its telling my I need X size (wCHAR) and I do double it via add eax,eax and in EBX I have still the wCHAR size (not doubled) I'm using on second call to MultiByteToWideChar function. Did you not seen this or what do you mean? Common guys, don't make it harder to understand & handle as it should be. Thanks.

greetz

  • Like 2
Link to comment
adoxa
4 hours ago, LCF-AT said:

in EBX I have still the wCHAR size (not doubled)

Really? Did you change it from what you posted?

On 5/5/2024 at 7:09 AM, LCF-AT said:
invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0
add eax, eax		<--- double size to alloc
mov ebx, eax
mov edi, alloc(ebx)

So maybe this is what you thought you did, not what you actually did?

invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0
mov ebx, eax
add eax, eax		<--- double size to alloc
mov edi, alloc(eax)

 

  • Like 1
Link to comment
Teddy Rogers
On 5/6/2024 at 4:06 AM, LCF-AT said:

so what about that MultiByteToWideChar function now? Wrong?! Yes, the ansi$("TEST") is ANSI string I want to convert to WideChar using CP_UTF8 CodePage because I did save everything in CP_UTF8 CodePage before and I also want to handle Symbol Chars if any in there like this 💋 to use it in my UNICODE app. In this case I'm using MultiByteToWideChar with CP_UTF8 too. Why do you say it's wrong Ted? It's working for me. Maybe its not important whether using CP_UTF8 / CP_ACP when reading from a ANSI text & CP_UTF8 text which are pretty same.

When you are referring to ANSI do you actually mean the displayable/ printable 7-bit ASCII Latin-1 character set within Windows-1252 code-page or the ANSI supplementary set above it?

ANSI in Windows-1252 and ISO-8859-1 code-pages is the 8-bit character set that supplements the 7-bit ASCII character set.

If you don't know what I mean it may be worth doing a Google and reading up on these code-pages; Windows-1252, UTF-8 and UTF-16.

Windows-1252 code-page is segregated like this...

;     0-31    - ASCII Control Characters
;             - Control characters (not intended for display or printing).
;     32-126  - ASCII Characters
;             - Display and printable characters.
;     127     - ASCII Control Character
;             - Control character (not intended for display or printing).
;     128-159 - ANSI Characters
;             - Windows-1252 and ISO-8859-1 control characters.
;     160-225 - ANSI Characters
;             - Windows-1252 and ISO-8859-1 characters.

We can create a Windows-1252 code-page quite easily with a bit of code...

cbMultiByte = 256
*lpMultiByteStr_1252 = AllocateMemory(cbMultiByte)

If *lpMultiByteStr_1252
  
  For Char = 0 To 255
    PokeB(*lpMultiByteStr_1252+Char, Char)
  Next Char
  
  Debug PeekS(*lpMultiByteStr_1252+32, cbMultiByte-32, #PB_Ascii)
  ShowMemoryViewer(*lpMultiByteStr_1252, cbMultiByte)
  
  FreeMemory(*lpMultiByteStr_1252)
EndIf

We can visually see the results in "Memory Viewer", found in the screenshot below, with the (ASCII and ANSI) character set displayed in the "Debug Output" window (starting at offset 32) the displayable/ printable ASCII characters.

image.png

If we use MultiByteToWideChar and tell it the text being converted is UTF8 (CP_UTF8) this is how all the ASCII and ANSI characters of a Windows-1252 code-page will be mapped to UTF-16...

EnableExplicit

Define *lpMultiByteStr
Define *lpWideCharStr
Define cbMultiByte.i
Define cchWideChar.i
Define Char.l

cbMultiByte = 256
*lpMultiByteStr = AllocateMemory(cbMultiByte)

If *lpMultiByteStr
  
  ; Create the character set 0 through to 255 (0xFF).
  
  For Char = 0 To 255
    PokeB(*lpMultiByteStr+Char, Char)
  Next Char

  cchWideChar = MultiByteToWideChar_(#CP_UTF8, #MB_PRECOMPOSED, *lpMultiByteStr, cbMultiByte, #Null, #Null)
  
  If cchWideChar
    
    cchWideChar = cchWideChar * 2
    
    *lpWideCharStr = AllocateMemory(cchWideChar)
    
    If *lpWideCharStr
      
      Debug MultiByteToWideChar_(#CP_UTF8, #MB_PRECOMPOSED, *lpMultiByteStr, cbMultiByte, *lpWideCharStr, cchWideChar)
      
      Debug PeekS(*lpWideCharStr+64, cbMultiByte-32, #PB_Unicode)
      ShowMemoryViewer(*lpWideCharStr, cchWideChar)
      
    EndIf
    FreeMemory(*lpWideCharStr)
  EndIf
  FreeMemory(*lpMultiByteStr)
EndIf

image.png

In the screenshot above you can see the ANSI character/ controls are all 0xFFFD. The reason for this is because those character codes (the number it represents) do not exist within the UTF8 code-page and can't be mapped to UTF-16. 0xFFFD is a UTF-16 replacement for unknown characters/ controls and is visually represented with the question mark within the diamond.

image.png

How do we map the ANSI code-page to UTF8? First use MultiByteToWideChar to convert to UTF-16 using CP_ACP (or Windows-1252 code-page if not the system default) which then maps the ANSI characters (0xFFFD's) to known locations within the UTF-16 code-page.

Once that is complete use WideCharToMultiByte to then on-convert from UTF-16 to UTF-8...

EnableExplicit

Define *lpMultiByteStr_1252
Define *lpMultiByteStr_UTF8
Define *lpWideCharStr_UTF16
Define cbMultiByte.i
Define cbMultiByte_UTF8.i
Define cchWideChar.i
Define Char.l

; Create the Windows-1252 character set 0 through to 255 (0xFF).
;     0-31    - ASCII Control Characters
;             - Control characters (not intended for display or printing).
;     32-126  - ASCII Characters
;             - Printable characters.
;     127     - ASCII Control Character
;             - Control character (not intended for display or printing).
;     128-159 - ANSI Characters
;             - Windows-1252 and ISO-8859-1 control characters.
;     160-225 - ANSI Characters
;             - Windows-1252 and ISO-8859-1 characters.

cbMultiByte = 256
*lpMultiByteStr_1252 = AllocateMemory(cbMultiByte)

If *lpMultiByteStr_1252
  
  For Char = 0 To 255
    PokeB(*lpMultiByteStr_1252+Char, Char)
  Next Char
  
  ; Get the required buffer size, in characters, for *lpWideCharStr_UTF16.
  
  cchWideChar = MultiByteToWideChar_(#CP_ACP, #MB_PRECOMPOSED, *lpMultiByteStr_1252, cbMultiByte, #Null, #Null)
  
  If cchWideChar
    
    ; Convert SBCS -> DBCS by multiplying by two (2) and allocate the memory.
    
    cchWideChar = cchWideChar * 2
    *lpWideCharStr_UTF16 = AllocateMemory(cchWideChar)
    
    If *lpWideCharStr_UTF16
      
      ; Convert the string and update cchWideChar with the number of characters written to *lpWideCharStr_UTF16.
      
      cchWideChar = MultiByteToWideChar_(#CP_ACP, #MB_PRECOMPOSED, *lpMultiByteStr_1252, cbMultiByte, *lpWideCharStr_UTF16, cchWideChar)
      
      If cchWideChar
        
        ; Get the required buffer size, in bytes, for *lpMultiByteStr_UTF8.
        
        cbMultiByte_UTF8 = WideCharToMultiByte_(#CP_UTF8, #Null, *lpWideCharStr_UTF16, cchWideChar, #Null, #Null, #Null, #Null)
        
        If cbMultiByte_UTF8
          
          ; Allocate the memory.
          
          *lpMultiByteStr_UTF8 = AllocateMemory(cbMultiByte_UTF8)
          
          If *lpMultiByteStr_UTF8
            If WideCharToMultiByte_(#CP_UTF8, #Null, *lpWideCharStr_UTF16, cchWideChar, *lpMultiByteStr_UTF8, cbMultiByte_UTF8, #Null, #Null)
              
              Debug "Win-1252: " + PeekS(*lpMultiByteStr_1252+32, 224, #PB_Ascii)
              Debug "UTF-16  : " + PeekS(*lpWideCharStr_UTF16+64, cbMultiByte-32, #PB_Unicode)
              Debug "UTF-8   : " + PeekS(*lpMultiByteStr_UTF8+32, cbMultiByte_UTF8-32, #PB_UTF8 | #PB_ByteLength)
              
              ;ShowMemoryViewer(*lpMultiByteStr_1252, 256)
              ;ShowMemoryViewer(*lpWideCharStr_UTF16, cbMultiByte)
              ShowMemoryViewer(*lpMultiByteStr_UTF8, cbMultiByte_UTF8)
              
            EndIf
            FreeMemory(*lpMultiByteStr_UTF8)
          EndIf
        EndIf
      EndIf
    EndIf
    FreeMemory(*lpWideCharStr_UTF16)
  EndIf
  FreeMemory(*lpMultiByteStr_1252)
EndIf

image.png

To sum up all the above... if you are sure your text contains ANSI characters give MultiByteToWideChar the correct code-page to use.

If you intend converting ASCII/ ANSI strings you may be better using the ENG functions as they have a simpler parameter requirement; EngWideCharToMultiByte, EngMultiByteToWideChar, EngMultiByteToUnicodeN, see attached example...

Ted.

EngMultiByteToUnicodeN.zip

  • Like 3
Link to comment
LCF-AT

Hey Ted,

thanks again Ted. Lots of input what does confusing me more & more now. Alright, so I thought ANSI = ASCII just other term. So at the moment I'm just using codepage flag CP_UTF8 for both API's MultiByteToWideChar (reading my content file I did save before) & WideCharToMultiByte (convet my app internal UNICODE text / chars to export it in my content file). The text I have to deal with is already just in UNICODE only and all chars getting displayed alright so far like edit controls / listview etc.

UNICODE text / char || WideCharToMultiByte,CP_UTF8 || WriteFile = Content Export text is CP_UTF8

CP_UTF8 text / char || ReadFile || MultiByteToWideChar,CP_UTF8 = Content Export is UNICODE

So this is how I use the function now and it seems to work so I don't see or get any issues yet.

Beside, that problem (I still don't know how to handle it correctly for 100% - above) I found another displaying problem. Look at this image below...

U1_2024-05-06_223802.png.0c0d1ebeb2b819519065b74931cc9e20.png

...so here you can see my listview above with text / symbols I did add and all are displaying correctly so far also in the EDIT control below = selected command with ƒƒƒƒƒƒ is displaying in the EDIT control below and all using same buffer. Now I want to edit the entry 5 in my listview and double click it to call the new EditBox DialogBox where you can see 2 EDIT controls displaying the same content of selected LV entry 5 but in this case (new dialogbox Edit controls) its not displaying the ƒƒƒƒƒƒ chars correctly! But why? What is the problem here? In the EDIT control under the LV it does display the ƒƒƒƒƒƒ and in the other one not but both using same style / ex values and using same buffer. Do you know what the problem in this case is?

greetz

  • Like 1
Link to comment
Teddy Rogers
18 hours ago, LCF-AT said:

Do you know what the problem in this case is?

What is the character code/ hex value of the incorrect character?

Assuming all the functions used are Unicode and the edit control styles are all good I would check the font is capable of displaying character "ƒ", try another font....

Ted.

  • Like 1
  • Thanks 1
Link to comment
LCF-AT

Hey Ted,

you are right! In the other dialog window its using a other font as in main dialog window....

FONT 8,"Tahoma"		<-- IDD_MAIN 	DIALOGEX
FONT 8,"MS Sans Serif"	<-- IDD_DLGEDIT DIALOGEX

....

U2_2024-05-06_223802.png.22806d7ba0e82840f4da4e44d3f6c42a.png

...somehow funny. :) Just because of the used font. Thank you for that info Ted. So I don't wanna bother you go on because of that WideChar / Multi function stuff but do you have some another advice how to deal with it? I know you told already lots of thing about it and it looks like my method using the function as I told you works so far but still not sure about it because you said something else you know. I would like to handle those different string stuff / codepage things.

greetz

  • Like 1
Link to comment
Teddy Rogers
On 5/8/2024 at 2:42 AM, LCF-AT said:

do you have some another advice how to deal with it?

Read the API docs and function parameters carefully.

What else did you want to know?

Ted.

  • Like 1
Link to comment
LCF-AT

Hey Ted,

20 hours ago, Teddy Rogers said:

What else did you want to know?

everything I need to know to prevent possibly random error's / crash during using those function to read / save text / chars etc. One more question about that small app you made to show all ASCII Characters 0-127 / 128-255 (ANSI). So all of those chars I can use without any issue as ASCII or UNICODE. Before you said I have to use CP_ACP instead of CP_UTF8 but in this case my app does display wrong chars. Have a look on the image below I made...

X1_2024-05-09_225512.png.6fd435bb85fd8ed975a0edc6170d8267.png

....above you can see the content file I created with my app (UNICODE) to export all text / chars to UTF8 (just ignore the /*=\ combo which are just markers to read start / end of content parts). In the middle you can see the read content when using also CP_UTF8 flag then everything gets converted fine and all looks same as in NOTEPAD itself. Below you can see what happens when I use CP_ACP flag and in this case it does not display all chars correctly as before. Yes of course its mixed but this is just so when using text + symbol things in titles or text like "ABC 𝐔𝐋𝐓𝐑𝐀 💋 is :thumbs:". In this case I can not just use ACP right. Mail goal is just to read / save / display all text / chars / symbol things whatever it called correctly as they really look like without to display any strange symbols things like in my listview image below using CP_ACP. That's all. My question now is whether I'm right so far just using the CP_UTF8 flag to save / read those random mixed text etc? So far I see it seem to work. So what do you say now? AM'I safe now with that method without to run into an error xy or not?

greetz

  • Like 1
Link to comment
Teddy Rogers
2 hours ago, LCF-AT said:

Before you said I have to use CP_ACP instead of CP_UTF8

That was in the context if the encoding used characters within Windows-1252 code-page above the Latin-1 ASCII character set - the actual ANSI characters.

2 hours ago, LCF-AT said:

My question now is whether I'm right so far just using the CP_UTF8 flag to save / read those random mixed text etc? So far I see it seem to work. So what do you say now? AM'I safe now with that method without to run into an error xy or not?

If you are sure the encoding of the source format it UTF8 then you should be fine specifying it as the code-page.

If the source is a HTML page you can try checking in your code if it has declared the character encoding in the header using, "meta charset", like one of these values...

<meta charset="UTF-8">
<meta charset="Windows-1252">
<meta charset="ISO-8859-1"> 

You could also check for a byte order mark (BOM) set in the HTML page...

https://www.w3schools.com/html/html_charset.asp

Ted.

  • Like 1
  • Thanks 1
Link to comment
LCF-AT

Hey Ted,

yes, in my own case I export / import just by using the CP_UTF8 flag.

Good idea to check the charset of HTML page. Maybe asking again about text I have to deal with which is not from me (unknown) and having no meta data / response header information's about it. How to deal with that? I know you did show already a example of using IsTextUnicode function (RtlIsTextUnicode in my case) but is it possible to just auto convert the text I have to anything like UNICODE / ASCII / ANSI without to do any check of the text itself?

greetz

  • Like 2
Link to comment

If you cannot detect the meta charset or a BOM consider - as of 13th May 2024 - 98.2% of all websites are encoded in UTF8 with Windows 1252 and ISO-8859-1 making up 1.5%...

Ted.

  • Like 1
Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...