adoxa Posted May 4, 2024 Posted May 4, 2024 Let me say it for the third time, also repeating Teddy: WideCharCount IS CHARACTERS NOT BYTES! When you use -1 it effectively does lstrlen itself, but also including the null. WideCharCount is how many characters (16-bit units) are in your source wide string (-1 to determine automatically); MultiByteCount is how big your buffer is (in bytes) for the resulting UTF-8 string. 3
Teddy Rogers Posted May 4, 2024 Posted May 4, 2024 10 hours ago, LCF-AT said: Also in your code @Teddy Rogers I don't see any check what you got in EAX after second call to WideCharToMultiByte. I used "Debug" to visually show, the second return value of WideCharToMultiByte - if successful it returns the number of bytes written to the buffer. See my screenshot using the "KISS MARK" below. 10 hours ago, LCF-AT said: Also you do len the string using the value with the first call to WideCharToMultiByte instead of using -1h. Why? "Len" is a PB command to obtain the number of characters in a string, without the null terminator. I use this value in cchWideChar because the function needs to know how many characters are in the string. 10 hours ago, LCF-AT said: Just when I use a value of 0Bh instead of 5 (the function told me I could use) it does work. The question is why does it need 0Bh bytes to write lots of bytes I even don't need!? In your first call to WideCharToMultiByte cchWideChar is using "-1". The function has worked out the string length to being three characters requiring five bytes in the buffer to store the converted string. The second call to WideCharToMultiByte, you are passing in cchWideChar the number of "five" which was intended for cbMultiByte - the buffer size for storing the converted string. Now the function believes you have five characters in the string you are trying to convert with a buffer size of five bytes. When the end of the buffer is reached it fails with ERROR_INSUFFICIENT_BUFFER. Those extra bytes look to be what ever is behind the first half of the null terminator. You still have your characters and bytes mixed up... EnableExplicit Define *lpMultiByteStr Define cbMultiByte.i Define lpMultiByteStr.s Define cchWideChar.i lpMultiByteStr = "💋" cchWideChar = Len(lpMultiByteStr) cbMultiByte = WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, #Null, #Null, #Null, #Null) Debug cchWideChar If cbMultiByte Debug cbMultiByte *lpMultiByteStr = AllocateMemory(cbMultiByte) If *lpMultiByteStr Debug WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, *lpMultiByteStr, cbMultiByte, #Null, #Null) Debug PeekS(*lpMultiByteStr, cchWideChar, #PB_UTF8) ShowMemoryViewer(*lpMultiByteStr, cbMultiByte) EndIf FreeMemory(*lpMultiByteStr) EndIf Ted. 2
LCF-AT Posted May 4, 2024 Author Posted May 4, 2024 Hey again, oh boy! That's so "I paint the town red" confusing! Lets summary. 💋 1.) I use lstrlenW = 2 in EAX to get the ccWideChar lenght 2.) Using ccWideChar (2) with WideCharToMultiByte which returns the ccMultiByte lenght (4) I would need as buffer lenght 3.) Alloc that ccMultiByte lenght of 4 or more 4.) Calling WideCharToMultiByte with 2 & 4 The code would be like this.... invoke lstrlen,addr _some_ // 💋 mov esi, eax invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,esi,0,0,0,0 mov ebx, eax mov edi, alloc(eax) invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,esi,edi,ebx,0,0 ...below like this... 0019FF4C 00401163 /CALL to lstrlenW from bones.0040115E 0019FF50 0045F3AE \String = "??" 💋 = EAX = 2 length I need to use for WideCharCount 0019FF30 0040117F /CALL to WideCharToMultiByte from bones.0040117A 0019FF34 0000FDE9 |CodePage = FDE9 0019FF38 00000000 |Options = 0 0019FF3C 0045F3AE |WideCharStr = "??" 0019FF40 00000002 |WideCharCount = 2 <-- 2 from lstrlen before 0019FF44 00000000 |MultiByteStr = NULL 0019FF48 00000000 |MultiByteCount = 0 0019FF4C 00000000 |pDefaultChar = NULL 0019FF50 00000000 \pDefaultCharUsed = NULL = EAX = 4 length I need to use for MultiByteCount next time 0019FF30 004011A3 /CALL to WideCharToMultiByte from bones.0040119E 0019FF34 0000FDE9 |CodePage = FDE9 0019FF38 00000000 |Options = 0 0019FF3C 0045F3AE |WideCharStr = "??" 0019FF40 00000002 |WideCharCount = 2 <-- 2 from lstrlen 0019FF44 005046A8 |MultiByteStr = 005046A8 <-- alloc buffer of 4 bytes 0019FF48 00000004 |MultiByteCount = 4 <-- buffer length 0019FF4C 00000000 |pDefaultChar = NULL 0019FF50 00000000 \pDefaultCharUsed = NULL = EAX = 4 bytes was written into new buffer of MultiByteStr 005046A8 ...so is this correct now so far? I think so. Why did you guys not telling me like this before? What a HARD BIRTH! But in case of using MultiByteToWideChar function I have to double the alloc size like this... invoke lstrlenA,ansi$("TEST") mov esi, eax invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0 add eax, eax <--- double size to alloc mov ebx, eax mov edi, alloc(ebx) invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,edi,ebx ....maybe the best is really just using -1h in all cases of using those 2 functions and let the function handle / calc the length and I just need to sub the 0 termination byte. greetz 1
Teddy Rogers Posted May 4, 2024 Posted May 4, 2024 2 hours ago, LCF-AT said: ...so is this correct now so far? I think so. Why did you guys not telling me like this before? The issue was picked up multiple times. Plus you have the API docs to refer to, with code examples. Looping back a bit, are you sure your text is "ANSI"? You won't be able to convert "ANSI" characters/ control codes passing CP_UTF8 with MultiByteToWideChar. They don't exist in the character set, and you will likely get $FFFD after the ASCII (127) characters. You will need to use CP_ACP... Ted. 2
adoxa Posted May 5, 2024 Posted May 5, 2024 Your MultiByteToWideChar is still wrong: you do need to double the returned length for the allocation (wide characters to bytes), but you still need to pass in the returned length (the buffer is in wide characters, not bytes). 2
LCF-AT Posted May 5, 2024 Author Posted May 5, 2024 Hi, so what about that MultiByteToWideChar function now? Wrong?! Yes, the ansi$("TEST") is ANSI string I want to convert to WideChar using CP_UTF8 CodePage because I did save everything in CP_UTF8 CodePage before and I also want to handle Symbol Chars if any in there like this 💋 to use it in my UNICODE app. In this case I'm using MultiByteToWideChar with CP_UTF8 too. Why do you say it's wrong Ted? It's working for me. Maybe its not important whether using CP_UTF8 / CP_ACP when reading from a ANSI text & CP_UTF8 text which are pretty same. So I did convert my UNICODE text / symbols from my Listview to CP_UTF8 with WideCharToMultiByte function as we talked before all the time about it. Now my content was converted & saved into content file which is using the UTF-8 CodePage (I need to have). Now when I run the app new it must read that content file back into listview and here I'm using the CP_UTF8 again with MultiByteToWideChar function to convert entire content file into WideChar format I need to use in my UNICODE style app. What is wrong here? When I use CP_ACP instead of CP_UTF8 then it will not display the symbols anymore like this 💋.... 0069B090 F0 9F 92 8B 💋 <--- 💋 ReadFile to // when using MultiByteToWideChar with CP_ACP I get that below... 0069B168 F0 00 78 01 19 20 39 20 ð.x 9 <--- Not displaying 💋 in LV 0069B090 F0 9F 92 8B 💋 <--- 💋 ReadFile to // when using MultiByteToWideChar with CP_UTF8 I get that below... 005CB258 3D D8 8B DC =Ø‹Ü <--- Does Display 💋 in LV ...you see? Guys, so when it is so simple as you say / think then just post any short example to see it. @adoxa 16 hours ago, adoxa said: Your MultiByteToWideChar is still wrong: you do need to double the returned length for the allocation (wide characters to bytes), but you still need to pass in the returned length (the buffer is in wide characters, not bytes). I did double the space after MultiByteToWideChar function. Its telling my I need X size (wCHAR) and I do double it via add eax,eax and in EBX I have still the wCHAR size (not doubled) I'm using on second call to MultiByteToWideChar function. Did you not seen this or what do you mean? Common guys, don't make it harder to understand & handle as it should be. Thanks. greetz 2
adoxa Posted May 6, 2024 Posted May 6, 2024 4 hours ago, LCF-AT said: in EBX I have still the wCHAR size (not doubled) Really? Did you change it from what you posted? On 5/5/2024 at 7:09 AM, LCF-AT said: invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0 add eax, eax <--- double size to alloc mov ebx, eax mov edi, alloc(ebx) So maybe this is what you thought you did, not what you actually did? invoke MultiByteToWideChar,CP_UTF8,0,ansi$("TEST"),esi,0,0 mov ebx, eax add eax, eax <--- double size to alloc mov edi, alloc(eax) 1
Teddy Rogers Posted May 6, 2024 Posted May 6, 2024 On 5/6/2024 at 4:06 AM, LCF-AT said: so what about that MultiByteToWideChar function now? Wrong?! Yes, the ansi$("TEST") is ANSI string I want to convert to WideChar using CP_UTF8 CodePage because I did save everything in CP_UTF8 CodePage before and I also want to handle Symbol Chars if any in there like this 💋 to use it in my UNICODE app. In this case I'm using MultiByteToWideChar with CP_UTF8 too. Why do you say it's wrong Ted? It's working for me. Maybe its not important whether using CP_UTF8 / CP_ACP when reading from a ANSI text & CP_UTF8 text which are pretty same. When you are referring to ANSI do you actually mean the displayable/ printable 7-bit ASCII Latin-1 character set within Windows-1252 code-page or the ANSI supplementary set above it? ANSI in Windows-1252 and ISO-8859-1 code-pages is the 8-bit character set that supplements the 7-bit ASCII character set. If you don't know what I mean it may be worth doing a Google and reading up on these code-pages; Windows-1252, UTF-8 and UTF-16. Windows-1252 code-page is segregated like this... ; 0-31 - ASCII Control Characters ; - Control characters (not intended for display or printing). ; 32-126 - ASCII Characters ; - Display and printable characters. ; 127 - ASCII Control Character ; - Control character (not intended for display or printing). ; 128-159 - ANSI Characters ; - Windows-1252 and ISO-8859-1 control characters. ; 160-225 - ANSI Characters ; - Windows-1252 and ISO-8859-1 characters. We can create a Windows-1252 code-page quite easily with a bit of code... cbMultiByte = 256 *lpMultiByteStr_1252 = AllocateMemory(cbMultiByte) If *lpMultiByteStr_1252 For Char = 0 To 255 PokeB(*lpMultiByteStr_1252+Char, Char) Next Char Debug PeekS(*lpMultiByteStr_1252+32, cbMultiByte-32, #PB_Ascii) ShowMemoryViewer(*lpMultiByteStr_1252, cbMultiByte) FreeMemory(*lpMultiByteStr_1252) EndIf We can visually see the results in "Memory Viewer", found in the screenshot below, with the (ASCII and ANSI) character set displayed in the "Debug Output" window (starting at offset 32) the displayable/ printable ASCII characters. If we use MultiByteToWideChar and tell it the text being converted is UTF8 (CP_UTF8) this is how all the ASCII and ANSI characters of a Windows-1252 code-page will be mapped to UTF-16... EnableExplicit Define *lpMultiByteStr Define *lpWideCharStr Define cbMultiByte.i Define cchWideChar.i Define Char.l cbMultiByte = 256 *lpMultiByteStr = AllocateMemory(cbMultiByte) If *lpMultiByteStr ; Create the character set 0 through to 255 (0xFF). For Char = 0 To 255 PokeB(*lpMultiByteStr+Char, Char) Next Char cchWideChar = MultiByteToWideChar_(#CP_UTF8, #MB_PRECOMPOSED, *lpMultiByteStr, cbMultiByte, #Null, #Null) If cchWideChar cchWideChar = cchWideChar * 2 *lpWideCharStr = AllocateMemory(cchWideChar) If *lpWideCharStr Debug MultiByteToWideChar_(#CP_UTF8, #MB_PRECOMPOSED, *lpMultiByteStr, cbMultiByte, *lpWideCharStr, cchWideChar) Debug PeekS(*lpWideCharStr+64, cbMultiByte-32, #PB_Unicode) ShowMemoryViewer(*lpWideCharStr, cchWideChar) EndIf FreeMemory(*lpWideCharStr) EndIf FreeMemory(*lpMultiByteStr) EndIf In the screenshot above you can see the ANSI character/ controls are all 0xFFFD. The reason for this is because those character codes (the number it represents) do not exist within the UTF8 code-page and can't be mapped to UTF-16. 0xFFFD is a UTF-16 replacement for unknown characters/ controls and is visually represented with the question mark within the diamond. How do we map the ANSI code-page to UTF8? First use MultiByteToWideChar to convert to UTF-16 using CP_ACP (or Windows-1252 code-page if not the system default) which then maps the ANSI characters (0xFFFD's) to known locations within the UTF-16 code-page. Once that is complete use WideCharToMultiByte to then on-convert from UTF-16 to UTF-8... EnableExplicit Define *lpMultiByteStr_1252 Define *lpMultiByteStr_UTF8 Define *lpWideCharStr_UTF16 Define cbMultiByte.i Define cbMultiByte_UTF8.i Define cchWideChar.i Define Char.l ; Create the Windows-1252 character set 0 through to 255 (0xFF). ; 0-31 - ASCII Control Characters ; - Control characters (not intended for display or printing). ; 32-126 - ASCII Characters ; - Printable characters. ; 127 - ASCII Control Character ; - Control character (not intended for display or printing). ; 128-159 - ANSI Characters ; - Windows-1252 and ISO-8859-1 control characters. ; 160-225 - ANSI Characters ; - Windows-1252 and ISO-8859-1 characters. cbMultiByte = 256 *lpMultiByteStr_1252 = AllocateMemory(cbMultiByte) If *lpMultiByteStr_1252 For Char = 0 To 255 PokeB(*lpMultiByteStr_1252+Char, Char) Next Char ; Get the required buffer size, in characters, for *lpWideCharStr_UTF16. cchWideChar = MultiByteToWideChar_(#CP_ACP, #MB_PRECOMPOSED, *lpMultiByteStr_1252, cbMultiByte, #Null, #Null) If cchWideChar ; Convert SBCS -> DBCS by multiplying by two (2) and allocate the memory. cchWideChar = cchWideChar * 2 *lpWideCharStr_UTF16 = AllocateMemory(cchWideChar) If *lpWideCharStr_UTF16 ; Convert the string and update cchWideChar with the number of characters written to *lpWideCharStr_UTF16. cchWideChar = MultiByteToWideChar_(#CP_ACP, #MB_PRECOMPOSED, *lpMultiByteStr_1252, cbMultiByte, *lpWideCharStr_UTF16, cchWideChar) If cchWideChar ; Get the required buffer size, in bytes, for *lpMultiByteStr_UTF8. cbMultiByte_UTF8 = WideCharToMultiByte_(#CP_UTF8, #Null, *lpWideCharStr_UTF16, cchWideChar, #Null, #Null, #Null, #Null) If cbMultiByte_UTF8 ; Allocate the memory. *lpMultiByteStr_UTF8 = AllocateMemory(cbMultiByte_UTF8) If *lpMultiByteStr_UTF8 If WideCharToMultiByte_(#CP_UTF8, #Null, *lpWideCharStr_UTF16, cchWideChar, *lpMultiByteStr_UTF8, cbMultiByte_UTF8, #Null, #Null) Debug "Win-1252: " + PeekS(*lpMultiByteStr_1252+32, 224, #PB_Ascii) Debug "UTF-16 : " + PeekS(*lpWideCharStr_UTF16+64, cbMultiByte-32, #PB_Unicode) Debug "UTF-8 : " + PeekS(*lpMultiByteStr_UTF8+32, cbMultiByte_UTF8-32, #PB_UTF8 | #PB_ByteLength) ;ShowMemoryViewer(*lpMultiByteStr_1252, 256) ;ShowMemoryViewer(*lpWideCharStr_UTF16, cbMultiByte) ShowMemoryViewer(*lpMultiByteStr_UTF8, cbMultiByte_UTF8) EndIf FreeMemory(*lpMultiByteStr_UTF8) EndIf EndIf EndIf EndIf FreeMemory(*lpWideCharStr_UTF16) EndIf FreeMemory(*lpMultiByteStr_1252) EndIf To sum up all the above... if you are sure your text contains ANSI characters give MultiByteToWideChar the correct code-page to use. If you intend converting ASCII/ ANSI strings you may be better using the ENG functions as they have a simpler parameter requirement; EngWideCharToMultiByte, EngMultiByteToWideChar, EngMultiByteToUnicodeN, see attached example... Ted. EngMultiByteToUnicodeN.zip 3
LCF-AT Posted May 6, 2024 Author Posted May 6, 2024 Hey Ted, thanks again Ted. Lots of input what does confusing me more & more now. Alright, so I thought ANSI = ASCII just other term. So at the moment I'm just using codepage flag CP_UTF8 for both API's MultiByteToWideChar (reading my content file I did save before) & WideCharToMultiByte (convet my app internal UNICODE text / chars to export it in my content file). The text I have to deal with is already just in UNICODE only and all chars getting displayed alright so far like edit controls / listview etc. UNICODE text / char || WideCharToMultiByte,CP_UTF8 || WriteFile = Content Export text is CP_UTF8 CP_UTF8 text / char || ReadFile || MultiByteToWideChar,CP_UTF8 = Content Export is UNICODE So this is how I use the function now and it seems to work so I don't see or get any issues yet. Beside, that problem (I still don't know how to handle it correctly for 100% - above) I found another displaying problem. Look at this image below... ...so here you can see my listview above with text / symbols I did add and all are displaying correctly so far also in the EDIT control below = selected command with ƒƒƒƒƒƒ is displaying in the EDIT control below and all using same buffer. Now I want to edit the entry 5 in my listview and double click it to call the new EditBox DialogBox where you can see 2 EDIT controls displaying the same content of selected LV entry 5 but in this case (new dialogbox Edit controls) its not displaying the ƒƒƒƒƒƒ chars correctly! But why? What is the problem here? In the EDIT control under the LV it does display the ƒƒƒƒƒƒ and in the other one not but both using same style / ex values and using same buffer. Do you know what the problem in this case is? greetz 1
Teddy Rogers Posted May 7, 2024 Posted May 7, 2024 18 hours ago, LCF-AT said: Do you know what the problem in this case is? What is the character code/ hex value of the incorrect character? Assuming all the functions used are Unicode and the edit control styles are all good I would check the font is capable of displaying character "ƒ", try another font.... Ted. 1 1
LCF-AT Posted May 7, 2024 Author Posted May 7, 2024 Hey Ted, you are right! In the other dialog window its using a other font as in main dialog window.... FONT 8,"Tahoma" <-- IDD_MAIN DIALOGEX FONT 8,"MS Sans Serif" <-- IDD_DLGEDIT DIALOGEX .... ...somehow funny. Just because of the used font. Thank you for that info Ted. So I don't wanna bother you go on because of that WideChar / Multi function stuff but do you have some another advice how to deal with it? I know you told already lots of thing about it and it looks like my method using the function as I told you works so far but still not sure about it because you said something else you know. I would like to handle those different string stuff / codepage things. greetz 1
Teddy Rogers Posted May 8, 2024 Posted May 8, 2024 On 5/8/2024 at 2:42 AM, LCF-AT said: do you have some another advice how to deal with it? Read the API docs and function parameters carefully. What else did you want to know? Ted. 1
LCF-AT Posted May 9, 2024 Author Posted May 9, 2024 Hey Ted, 20 hours ago, Teddy Rogers said: What else did you want to know? everything I need to know to prevent possibly random error's / crash during using those function to read / save text / chars etc. One more question about that small app you made to show all ASCII Characters 0-127 / 128-255 (ANSI). So all of those chars I can use without any issue as ASCII or UNICODE. Before you said I have to use CP_ACP instead of CP_UTF8 but in this case my app does display wrong chars. Have a look on the image below I made... ....above you can see the content file I created with my app (UNICODE) to export all text / chars to UTF8 (just ignore the /*=\ combo which are just markers to read start / end of content parts). In the middle you can see the read content when using also CP_UTF8 flag then everything gets converted fine and all looks same as in NOTEPAD itself. Below you can see what happens when I use CP_ACP flag and in this case it does not display all chars correctly as before. Yes of course its mixed but this is just so when using text + symbol things in titles or text like "ABC 𝐔𝐋𝐓𝐑𝐀 💋 is ". In this case I can not just use ACP right. Mail goal is just to read / save / display all text / chars / symbol things whatever it called correctly as they really look like without to display any strange symbols things like in my listview image below using CP_ACP. That's all. My question now is whether I'm right so far just using the CP_UTF8 flag to save / read those random mixed text etc? So far I see it seem to work. So what do you say now? AM'I safe now with that method without to run into an error xy or not? greetz 1
Teddy Rogers Posted May 10, 2024 Posted May 10, 2024 2 hours ago, LCF-AT said: Before you said I have to use CP_ACP instead of CP_UTF8 That was in the context if the encoding used characters within Windows-1252 code-page above the Latin-1 ASCII character set - the actual ANSI characters. 2 hours ago, LCF-AT said: My question now is whether I'm right so far just using the CP_UTF8 flag to save / read those random mixed text etc? So far I see it seem to work. So what do you say now? AM'I safe now with that method without to run into an error xy or not? If you are sure the encoding of the source format it UTF8 then you should be fine specifying it as the code-page. If the source is a HTML page you can try checking in your code if it has declared the character encoding in the header using, "meta charset", like one of these values... <meta charset="UTF-8"> <meta charset="Windows-1252"> <meta charset="ISO-8859-1"> You could also check for a byte order mark (BOM) set in the HTML page... https://www.w3schools.com/html/html_charset.asp Ted. 1 1
LCF-AT Posted May 10, 2024 Author Posted May 10, 2024 Hey Ted, yes, in my own case I export / import just by using the CP_UTF8 flag. Good idea to check the charset of HTML page. Maybe asking again about text I have to deal with which is not from me (unknown) and having no meta data / response header information's about it. How to deal with that? I know you did show already a example of using IsTextUnicode function (RtlIsTextUnicode in my case) but is it possible to just auto convert the text I have to anything like UNICODE / ASCII / ANSI without to do any check of the text itself? greetz 2
Teddy Rogers Posted May 13, 2024 Posted May 13, 2024 If you cannot detect the meta charset or a BOM consider - as of 13th May 2024 - 98.2% of all websites are encoded in UTF8 with Windows 1252 and ISO-8859-1 making up 1.5%... Ted. 1
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now