LCF-AT Posted April 19 Posted April 19 Hi guys, so I got a little problem again with those UNICODE / SYMBOL chars in text / buffer I want to format to readable text and print that on a static control. So first I got some text issues showing me some strange symbol chars instead of text like this below... "Youâ€" is "You’ve" ...and I was then using the MultiByteToWideChar function with CodePage CP_UTF8 to change my ANSI text buffer to UNICODE. After that the text was displaying correctly using SetDlgItemTextW function. FIne so far I thought. Then I found another problem with a other symbol like this... Q&A is Q&A ...and I tried to use the same function as above but in this case I got this results back... Qamp;A !? My question now is...when I have any unknown text in buffer as ASCII / ANSI style then I want to format / convert this text buffer into 100 % readable / Symbol buffer I want to use with SetDlgItemTextW (Unicode) function to display the text 100 % correctly as original etc. What is the right method for this? greetz 1
Teddy Rogers Posted April 20 Posted April 20 I don't know where you are sourcing your text from, possibly you can check the BOM - if it exists. If the text is a reliable source you could try utilising the IsTextUnicode function... Ted. 1
adoxa Posted April 20 Posted April 20 If you want to change "standard" (&<>) HTML entities it would be simplest to search and replace manually; not sure what the best approach would be if you want to convert unknown HTML to text. Dialog text uses & to underline the next character, so they should be replaced with && for a literal &. 1
LCF-AT Posted April 20 Author Posted April 20 Hi guys, I was downloading some text file from internet and when I print it into static control etc then I got that not correctly wrong mixed symbols or letters etc I would like to prevent but how is the questions. By the way, I tried using that IsTextUnicode function but I can only use the RtlIsTextUnicode function and this does crash always inside... invoke RtlIsTextUnicode,addr STRINGBUFFER ,sizeof STRINGBUFFER, IS_TEXT_UNICODE_ASCII16 ....so I don't know about all those specific Text Symbol styles things whatever they called etc but it really sux and I just want to have & use some simple format / fix functions I can run over my textbuffer to make them OK. @adoxa Yes it seems I have to remove those HTML entities from text buffer to format them correctly but how? Is there no ready function already I could use? Otherwise I have to make it myself. How much HTML Entities are there I have to check for? Or what are the most common used? I made this quick function... Remove_HTML_Entities proc uses edi esi ebx _buffer:DWORD invoke szRep,_buffer,_buffer,chr$("<"), chr$("<") invoke szRep,_buffer,_buffer,chr$(">"), chr$(">") invoke szRep,_buffer,_buffer,chr$("&"), chr$('&') invoke szRep,_buffer,_buffer,chr$("""), chr$('"') invoke szRep,_buffer,_buffer,chr$("'"), chr$("'") invoke szRep,_buffer,_buffer,chr$("¢"), chr$("¢") invoke szRep,_buffer,_buffer,chr$("£"),chr$("£") invoke szRep,_buffer,_buffer,chr$("¥"), chr$("¥") invoke szRep,_buffer,_buffer,chr$("€"), chr$("€") invoke szRep,_buffer,_buffer,chr$("©"), chr$("©") invoke szRep,_buffer,_buffer,chr$("®"), chr$("®") Ret Remove_HTML_Entities endp ...to remove some of those Entities. Seems to work OK so far but NOW I found another problem. When the entitie & was found and replaced with & and I do send that string buffer into my static control then the "&" is not displaying!=? Why? When I do messagebox that string buffer then the "&" gets displayed. So why is the & not showing when using it in a string? Also this fails... invoke SendMessage,STATIC_HANDLE,WM_SETTEXT,0,chr$("You & Me") = "You Me" and not "You & Me" Why? Is there any style I have to enable to make it work to display also the "&"? greetz 1
Teddy Rogers Posted April 25 Posted April 25 On 4/21/2024 at 6:18 AM, LCF-AT said: I tried using that IsTextUnicode function but I can only use the RtlIsTextUnicode function and this does crash always inside Check the third parameter, it is an in/ out... Quote [in, out, optional] lpiResult https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode Ted. 1
LCF-AT Posted April 25 Author Posted April 25 OK thanks Ted, So it seems to work OK so far when using 0 as last parameter. Not sure whether it will also work in case of those Symbol stuff in the string 💋🖤🧛♀️ etc. I can not use those symbols in WinASM itself to text it quickly so I also got just ??????? to see there. By the way, I have another question. So i was trying to display some same text on a static control & edit control and I got an issue. The text is not displaying same when the text has only an LF (10 / 0Ah) byte instead of CRLF (13,10 / 0Dh 0Ah). Why? In static control it does display with new lines and in edit control all is displaying in one line. That's some strange or? Is there any extra flag I have to use for Edit control to thread the LF / 0Ah byte like an CRLF? At the moment I wrote a function to replace all 0Ah bytes in text with 0D 0A to make it work but for this I have to alloc a new buffer. Just wanna know whether I can skip that part to handle it manually like this and just telling the edit control to display LF also as CRLF etc you know. greetz 1
Teddy Rogers Posted April 27 Posted April 27 On 4/26/2024 at 5:54 AM, LCF-AT said: In static control it does display with new lines and in edit control all is displaying in one line. Have you tried setting the edit control with EM_FMTLINES? On 4/26/2024 at 5:54 AM, LCF-AT said: I also got just ??????? to see there. I have some test code using IsTextUnicode that may be of help, see code block below. Raymond Chen suggests some alternate options... https://devblogs.microsoft.com/oldnewthing/20150223-00/?p=44613 EnableExplicit ; *********************************************************************************************************************** ;- Enumerations ; *********************************************************************************************************************** #IS_TEXT_UNICODE_ASCII16 = $0001 #IS_TEXT_UNICODE_REVERSE_ASCII16 = $0010 #IS_TEXT_UNICODE_STATISTICS = $0002 #IS_TEXT_UNICODE_REVERSE_STATISTICS = $0020 #IS_TEXT_UNICODE_CONTROLS = $0004 #IS_TEXT_UNICODE_REVERSE_CONTROLS = $0040 ;#IS_TEXT_UNICODE_BUFFER_TOO_SMALL = $0000 ; MSDN has this documented yet the value does not exist? #IS_TEXT_UNICODE_SIGNATURE = $0008 #IS_TEXT_UNICODE_REVERSE_SIGNATURE = $0080 #IS_TEXT_UNICODE_ILLEGAL_CHARS = $0100 #IS_TEXT_UNICODE_ODD_LENGTH = $0200 #IS_TEXT_UNICODE_DBCS_LEADBYTE = $0400 #IS_TEXT_UNICODE_NULL_BYTES = $1000 #IS_TEXT_UNICODE_UNICODE_MASK = $000F #IS_TEXT_UNICODE_REVERSE_MASK = $00F0 #IS_TEXT_UNICODE_NOT_UNICODE_MASK = $0F00 #IS_TEXT_UNICODE_NOT_ASCII_MASK = $F000 ; *********************************************************************************************************************** ;- Declarations ; *********************************************************************************************************************** Declare.i IsTextUnicodeV2(*String, *End) Declare.s ConcatenateFlags(FlagString.s, Flag.s, ValueToSearch, ValueToFind) Declare.s ConcatenateValues(lpiResult) ; *********************************************************************************************************************** ;- IsTextUnicode - File Tests ; *********************************************************************************************************************** Debug "ANSI" IsTextUnicodeV2(?ANSII_Start, ?ANSII_End) Debug "UTF8" IsTextUnicodeV2(?UTF8_Start, ?UTF8_End) Debug "UTF8_BOM" IsTextUnicodeV2(?UTF8_BOM_Start, ?UTF8_BOM_End) Debug "UTF16_BE" IsTextUnicodeV2(?UTF16_BE_Start, ?UTF16_BE_End) Debug "UTF16_LE" IsTextUnicodeV2(?UTF16_LE_Start, ?UTF16_LE_End) ; *********************************************************************************************************************** ;- IsTextUnicode - Datasection Tests ; *********************************************************************************************************************** Debug "DATA_ASCII" IsTextUnicodeV2(?DATA_ASCII_Start, ?DATA_ASCII_End) Debug "DATA_UNICODE" IsTextUnicodeV2(?DATA_UNICODE_Start, ?DATA_UNICODE_End) Debug "DATA_STRING" IsTextUnicodeV2(?DATA_STRING_Start, ?DATA_STRING_End) ; *********************************************************************************************************************** ;- IsTextUnicode - Memory Tests ; *********************************************************************************************************************** Define String1.s = "abcdefghijklmnopqrstuvwxyz" Define String2.s = "Bush hid the facts" ; https://en.wikipedia.org/wiki/Bush_hid_the_facts Define *Memory Define iSize Debug "MEMORY_ABC_ASCII" *Memory = Ascii(String1.s) iSize = StringByteLength(String1.s, #PB_Ascii) ;ShowMemoryViewer(*Memory, iSize) ;Debug PeekS(*Memory, Len(String1.s), #PB_Ascii) IsTextUnicodeV2(*Memory, *Memory+iSize) FreeMemory(*Memory) Debug "MEMORY_ABC_UTF8" *Memory = UTF8(String1.s) iSize = StringByteLength(String1.s, #PB_UTF8) ;Debug PeekS(*Memory, Len(String1.s), #PB_UTF8) ;ShowMemoryViewer(*Memory, iSize) IsTextUnicodeV2(*Memory, *Memory+iSize) FreeMemory(*Memory) ; https://en.wikipedia.org/wiki/Bush_hid_the_facts Debug "MEMORY_BUSH_ASCII" *Memory = Ascii(String2.s) iSize = StringByteLength(String2.s, #PB_Ascii) ;ShowMemoryViewer(*Memory, iSize) ;Debug PeekS(*Memory, Len(String2.s), #PB_Ascii) IsTextUnicodeV2(*Memory, *Memory+iSize) FreeMemory(*Memory) Debug "MEMORY_BUSH_UTF8" *Memory = UTF8(String2.s) iSize = StringByteLength(String2.s, #PB_UTF8) ;Debug PeekS(*Memory, Len(String2.s), #PB_UTF8) ;ShowMemoryViewer(*Memory, iSize) IsTextUnicodeV2(*Memory, *Memory+iSize) FreeMemory(*Memory) End Procedure.i IsTextUnicodeV2(*StringStart, *StringEnd) Protected lpiResult.l, Result, a Protected iSize = *StringEnd - *StringStart ; Create a temporary array structure. Structure istextunicode_value Align #PB_Structure_AlignC Flag.l EndStructure Protected Dim istextunicode_value.istextunicode_value(15) ; Add in the flag values. istextunicode_value(00)\Flag = #IS_TEXT_UNICODE_ASCII16 istextunicode_value(01)\Flag = #IS_TEXT_UNICODE_REVERSE_ASCII16 istextunicode_value(02)\Flag = #IS_TEXT_UNICODE_STATISTICS istextunicode_value(03)\Flag = #IS_TEXT_UNICODE_REVERSE_STATISTICS istextunicode_value(04)\Flag = #IS_TEXT_UNICODE_CONTROLS istextunicode_value(05)\Flag = #IS_TEXT_UNICODE_REVERSE_CONTROLS istextunicode_value(06)\Flag = #IS_TEXT_UNICODE_SIGNATURE istextunicode_value(07)\Flag = #IS_TEXT_UNICODE_REVERSE_SIGNATURE istextunicode_value(08)\Flag = #IS_TEXT_UNICODE_ILLEGAL_CHARS istextunicode_value(09)\Flag = #IS_TEXT_UNICODE_ODD_LENGTH istextunicode_value(10)\Flag = #IS_TEXT_UNICODE_DBCS_LEADBYTE istextunicode_value(11)\Flag = #IS_TEXT_UNICODE_NULL_BYTES istextunicode_value(12)\Flag = #IS_TEXT_UNICODE_UNICODE_MASK istextunicode_value(13)\Flag = #IS_TEXT_UNICODE_REVERSE_MASK istextunicode_value(14)\Flag = #IS_TEXT_UNICODE_NOT_UNICODE_MASK istextunicode_value(15)\Flag = #IS_TEXT_UNICODE_NOT_ASCII_MASK Debug "**********" Debug "iSize: " + iSize ; Cycle through all the flags stored in the array. For a = 0 To 15 lpiResult.l = istextunicode_value(a)\Flag Result = IsTextUnicode_(*StringStart, iSize, @lpiResult) If Result Debug "Flag being checked: $" + Hex(istextunicode_value(a)\Flag, #PB_Long) + " | lpiResult: $" + Hex(lpiResult, #PB_Long) Debug ConcatenateValues(lpiResult) EndIf Next a EndProcedure Procedure.s ConcatenateValues(lpiResult) Protected FlagString.s FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_ASCII16", lpiResult, #IS_TEXT_UNICODE_ASCII16) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_ASCII16", lpiResult, #IS_TEXT_UNICODE_REVERSE_ASCII16) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_STATISTICS", lpiResult, #IS_TEXT_UNICODE_STATISTICS) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_STATISTICS", lpiResult, #IS_TEXT_UNICODE_REVERSE_STATISTICS) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_CONTROLS", lpiResult, #IS_TEXT_UNICODE_CONTROLS) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_CONTROLS", lpiResult, #IS_TEXT_UNICODE_REVERSE_CONTROLS) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_SIGNATURE", lpiResult, #IS_TEXT_UNICODE_SIGNATURE) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_SIGNATURE", lpiResult, #IS_TEXT_UNICODE_REVERSE_SIGNATURE) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_ILLEGAL_CHARS", lpiResult, #IS_TEXT_UNICODE_ILLEGAL_CHARS) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_DBCS_LEADBYTE", lpiResult, #IS_TEXT_UNICODE_DBCS_LEADBYTE) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NULL_BYTES", lpiResult, #IS_TEXT_UNICODE_NULL_BYTES) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_UNICODE_MASK", lpiResult, #IS_TEXT_UNICODE_UNICODE_MASK) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_MASK", lpiResult, #IS_TEXT_UNICODE_REVERSE_MASK) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NOT_UNICODE_MASK", lpiResult, #IS_TEXT_UNICODE_NOT_UNICODE_MASK) FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NOT_ASCII_MASK", lpiResult, #IS_TEXT_UNICODE_NOT_ASCII_MASK) If FlagString.s = "" FlagString.s = "#NULL" EndIf ProcedureReturn FlagString.s EndProcedure Procedure.s ConcatenateFlags(FlagString.s, Flag.s, ValueToSearch, ValueToFind) Protected Space.s = " | " If ValueToSearch & ValueToFind = ValueToFind If Not FlagString.s = "" FlagString.s = FlagString.s + Space.s EndIf FlagString.s = FlagString.s + Flag.s EndIf ProcedureReturn FlagString.s EndProcedure ; *********************************************************************************************************************** ;- Datasection ; *********************************************************************************************************************** DataSection ANSII_Start: IncludeBinary "readme_ANSI.txt" ANSII_End: UTF8_Start: IncludeBinary "readme_UTF8.txt" UTF8_End: UTF8_BOM_Start: IncludeBinary "readme_UTF8_BOM.txt" UTF8_BOM_End: UTF16_BE_Start: IncludeBinary "readme_UTF16_BE.txt" UTF16_BE_End: UTF16_LE_Start: IncludeBinary "readme_UTF16_LE.txt" UTF16_LE_End: DATA_ASCII_Start: Data.a "abcdefghijklmnopqrstuvwxyz" DATA_ASCII_End: DATA_UNICODE_Start: Data.u "abcdefghijklmnopqrstuvwxyz" DATA_UNICODE_End: DATA_STRING_Start: Data.s "abcdefghijklmnopqrstuvwxyz" DATA_STRING_End: EndDataSection Ted. 2
LCF-AT Posted April 27 Author Posted April 27 Hey Ted, thanks for new info's so far. I tried using that EM_FMTLINES message with TRUE & FALSE but in both cases my text in Edit control still look same without any line breaks all is displaying in one line. Multiline is enabled to see the entire full long text inside of Edit control and I also did set VSCROLL to scroll though the whole text but as I said, all gets displayed with one line etc. When I do copy the one line text and paste it into notepad then all is showing correctly. Seems that Notepad does handle this by itself already but how to handle it in my Edit control? About that IsTextUnicode issue. The "where it guesses wrong." website on microsoft is forbidden. Otherwise that code of yours looks pretty strange. So I have to check all possible flags and then? That entire string / page code thing is really confusing. As I said, normally I was just using ASNI strings in my codes but it's limited when I have to deal with UNICODE / SYMBOL chars. I still don't know what I should do in this case. 1.) Set __UNICODE__ EQU 1 in my source to make all unicode + double sizes after all functions who return the length of strings (lstrlen / wsprintf / other functions). Also in case of wsprintf I have to check buffer sizes extra because its limited to 1024 bytes only. 2.) Using mixed style of ANSI & UNICODE strings. Not so sure about it. How do you handle that problem guys? How do you that in PureBasic or C / C++ etc? Do you need to care about that or not? Oh by the way, I made my one app we talked a while ago to total Unicode and it seems to work but I had to change a lot of thing in the code and still don't know whether I did everything correctly so far. greetz 1
adoxa Posted April 28 Posted April 28 Regarding EM_FMTLINES, from the 2003 R2 PSDK help: Quote This message affects only the buffer returned by the EM_GETHANDLE message and the text returned by the WM_GETTEXT message. It has no effect on the display of the text within the edit control. The EM_FMTLINES message does not affect a line that ends with a hard line break. A hard line break consists of one carriage return and a line feed. Vista and later have an extended style to recognise LF: ES_EX_ALLOWEOL_LF; 10 has an option to set the EOL character: EM_SETENDOFLINE. 2 1
Teddy Rogers Posted April 28 Posted April 28 On 4/28/2024 at 7:19 AM, LCF-AT said: About that IsTextUnicode issue. The "where it guesses wrong." website on microsoft is forbidden. Have a look on Wikipedia, it has an entry all about it, also check the references at the bottom of the page... https://en.wikipedia.org/wiki/Bush_hid_the_facts On 4/28/2024 at 7:19 AM, LCF-AT said: Otherwise that code of yours looks pretty strange. So I have to check all possible flags and then? That entire string / page code thing is really confusing. The code were tests I completed of the function a long time ago to check its results against different encoded text sources. Each flag is passed to the function to check against the string. It is probably a little convoluted to use for reference. To simplify the code, in short it looks like this... #IS_TEXT_UNICODE_STATISTICS = $0002 lpiResult = #IS_TEXT_UNICODE_STATISTICS If IsTextUnicode_(?STRING_Start, ?STRING_End-?STRING_Start, @lpiResult) If lpiResult = #IS_TEXT_UNICODE_STATISTICS Debug "The text is probably Unicode, with the determination made by applying statistical analysis." Else Debug lpiResult EndIf EndIf DataSection STRING_Start: Data.a "Bush hid the facts" STRING_End: EndDataSection Ted. 1
LCF-AT Posted April 29 Author Posted April 29 Hi again, @adoxa Thanks for the info. I could not test it yet because I don't have the define dec or hex values for those new styles? WS_EX_ACCEPTFILES equ 10h .... ES_EX_ALLOWEOL_CR ES_EX_ALLOWEOL_LF ES_EX_ALLOWEOL_ALL ES_EX_CONVERT_EOL_ON_PASTE ES_EX_ZOOMABLE EM_SETENDOFLINE EC_ENDOFLINE_DETECTFROMCONTENT EC_ENDOFLINE_CRLF EC_ENDOFLINE_CR EC_ENDOFLINE_LF I don't have them or found them yet. Just found a old commctrl.h file from 2017. Other problem is that it's just supported by some latest Vista or later. No Win7. Hmm. OK, the problem what could happen is that the text is not displaying as I want right. OK, lets test it. Just post the values next time or tell me where to find them etc. Thanks. @Teddy Rogers OK thanks you too. I will try to check that function some more with different string types to see how to use it in the best manner. greetz 1
Teddy Rogers Posted April 29 Posted April 29 4 hours ago, LCF-AT said: OK thanks you too. I will try to check that function some more with different string types to see how to use it in the best manner. I think we went off on a tangent anyway, I was trying to suggest checking the source file/s you are downloading before converting to Unicode. If you are confident the source file is an ANSI file there is function EngMultiByteToUnicodeN that you can try using. Can you attach the text file you are trying to convert, I'll have a look here... Ted. 2
adoxa Posted April 30 Posted April 30 (edited) If you want Win7 support then you have no choice: replace LF with CRLF. EM_SETEXTENDEDSTYLE (ECM_FIRST + 10) ES_EX_ALLOWEOL_CR 0x0001L ES_EX_ALLOWEOL_LF 0x0002L ES_EX_ALLOWEOL_ALL (ES_EX_ALLOWEOL_CR | ES_EX_ALLOWEOL_LF) ES_EX_CONVERT_EOL_ON_PASTE 0x0004L EM_SETENDOFLINE (ECM_FIRST + 12) EC_ENDOFLINE_DETECTFROMCONTENT 0 EC_ENDOFLINE_CRLF 1 EC_ENDOFLINE_CR 2 EC_ENDOFLINE_LF 3 Edited April 30 by adoxa 2
LCF-AT Posted May 1 Author Posted May 1 HI again guys, @adoxa What value is ECM_FIRST? What does this "L" mean? 0x0001L <-- what is this for a value? Just can use equ valueX hex or dec you know. Alright, I found some strange issue about that CR LF thing. I made a app in ASNI & UNICODE style and in both I do use an edit control to display a URL which I did break by the used URL parameters "?" "&" to display them line by line. So in case of the ANSI app it works but in case of UNICODE app it does not work == ? Why! I do same but get different results out. So this is my small code.... ANSI Version CONTROL "",IDC_INPUTPARAMETERS,"Edit",0x56a11004,5,90,400,140,0x00020200 .elseif ax == IDC_BreakParam ; break URL at params and print them out invoke GetDlgItem,hWin,IDC_URL mov esi, eax invoke SendMessage,esi,WM_GETTEXTLENGTH,0,0 .if eax invoke RtlZeroMemory,addr DropNameBuffer,sizeof DropNameBuffer invoke SendMessage,esi,WM_GETTEXT,sizeof DropNameBuffer,addr DropNameBuffer lea edi, DropNameBuffer xor ebx, ebx .while byte ptr [edi] != NULL .if byte ptr [edi] == '?' || byte ptr [edi] == '&' mov byte ptr [edi], 10 ; cr inc ebx .endif inc edi .endw invoke SetDlgItemText,hWin,IDC_INPUTPARAMETERS,addr DropNameBuffer UNICODE Version CONTROL "",IDC_INPUTPARAMETERS,"Edit",0x56a11004,5,90,400,140,0x00020200 .elseif ax == IDC_BreakParam ; break URL at params and print them out invoke GetDlgItem,hWin,IDC_URL mov esi, eax invoke SendMessage,esi,WM_GETTEXTLENGTH,0,0 .if eax invoke RtlZeroMemory,addr DropNameBuffer,sizeof DropNameBuffer invoke SendMessage,esi,WM_GETTEXT,sizeof DropNameBuffer,addr DropNameBuffer lea edi, DropNameBuffer xor ebx, ebx .while word ptr [edi] != NULL .if byte ptr [edi] == '?' || byte ptr [edi] == '&' mov byte ptr [edi], 10 ; cr inc ebx .endif inc edi inc edi .endw invoke SetDlgItemText,hWin,IDC_INPUTPARAMETERS,addr DropNameBuffer ...so in both cases I do check for '?' and '&' and when I found it I do change it by value 10 (dec) 0A (hex) for CR or LF (always forget that what is what). Then I just send the changed string to the edit control via SetDlgItemText function. In case of ANSI version it works to display all parameters line by line but in case of UNICODE all gets displayed in one line again. So what is here the reason why it works for A but not for W? The controls using same style. Makes me also wonder why it does work in case of ANSI. @Teddy Rogers So in first place I'm trying to get a rid of all those text styles. Main goal was it to change my apps from ANSI to UNICODE / SYMBOL support to display ALL my text I have correctly like in Notepad etc you know. I also found out that in case of using UNICODE app "__UNICODE__ EQU 1" I have to read / save my own text content as UTF-8 CodePage using EXTRA the WideCharToMultiByte function for all text I want to export to file and using MultiByteToWideChar function to read all text from output file into my app back to make it work to display all right. That's pretty annoying. Also in this case I have questions how to find out the right buffer length when using those functions. I see I can call them like this... The text has a length of 00003AF1h bytes (15089) edi = Buffer with text from extern file UTF-8 Function below should get length I need... invoke MultiByteToWideChar,CP_UTF8,0,edi,-1h,0,0 = 00003AE6h bytes (15078) = (11 bytes less) Then I double that space to alloc a free section Then calling same function with all parameters 0019FA10 00404631 /CALL to MultiByteToWideChar from bones.0040462C 0019FA14 0000FDE9 |CodePage = FDE9 0019FA18 00000000 |Options = 0 0019FA1C 007FE940 |StringToMap = "......stringstuff....." 0019FA20 00003AF1 |StringSize = 3AF1 (15089.) 0019FA24 00802BB8 |WideCharBuf = 00802BB8 0019FA28 000075CC \WideBufSize = 75CC (30156.) = eax 00003AE5h bytes / double size was written to buffer ...now in case of saving the text content... 0019F4AC 004043BB /CALL to WideCharToMultiByte from bones.004043B6 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0045FAFC |WideCharStr = "/*=\" 0019F4BC FFFFFFFF |WideCharCount = FFFFFFFF (-1.) 0019F4C0 00000000 |MultiByteStr = NULL 0019F4C4 00000000 |MultiByteCount = 0 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL = eax 5 but it should be 4!=? Why 5? 0019F4AC 004043E1 /CALL to WideCharToMultiByte from bones.004043DC 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0045FB08 |WideCharStr = "/*=\" 0019F4BC 00000008 |WideCharCount = 8 0019F4C0 0019F4D8 |MultiByteStr = 0019F4D8 0019F4C4 00000004 |MultiByteCount = 4 <---- I set to 4 not 5 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL = eax 0 <-- why 0? / ERROR_INSUFFICIENT_BUFFER So the text was written to buffer! 0019F4AC 004043FE /CALL to WideCharToMultiByte from bones.004043F9 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "Curl DL" 0019F4BC FFFFFFFF |WideCharCount = FFFFFFFF (-1.) 0019F4C0 00000000 |MultiByteStr = NULL 0019F4C4 00000000 |MultiByteCount = 0 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL = eax 8 but should be 7 = ? 0019F4AC 0040442F /CALL to WideCharToMultiByte from bones.0040442A 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "Curl DL" 0019F4BC 0000000E |WideCharCount = E (14.) 0019F4C0 00839D80 |MultiByteStr = 00839D80 0019F4C4 00000007 |MultiByteCount = 7 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL = eax 0 but text was written to buffer! Somehow it seems not to work correctly to get the right buffer size out I need to use for the translated strings and even when I use the buffer with correct size of buffer it does return 0 even the string was written into buffer correctly. It returns fail 0 but when I use more buffer size then it will also copy more bytes into new buffer I don't want to have there like any random or 0 bytes etc. So my question is how to use those 2 function correctly to get first the bytes I need and to call the function in second round correctly? Any advice would be welcome. PS: I'm still not finished with my UNICODE app to handle everything to double the sizes. Also in case of WM_GETTEXTLENGTH I need to double the result to alloc correctly buffer length to work go on with that. Those changes from ANSI to make a app source into UNICODE it really super PITA. Not sure how you guys handle that problem in your coding languages like C++ and others. greetz 1
atom0s Posted May 1 Posted May 1 5 minutes ago, LCF-AT said: What value is ECM_FIRST? What does this "L" mean? 0x0001L <-- what is this for a value? Just can use equ valueX hex or dec you know. ECM_FIRST is equal to 0x1500 and is defined in the CommCtrl.h header of the Windows SDK. A numerical value with the L suffix is used to have the value be interpreted as a 'long' value instead of the 'int' default. 2
LCF-AT Posted May 1 Author Posted May 1 Do you mean... ECM_FIRST equ 1500 or ECM_FIRST equ 1500h ...as I said, the CommCtrl.h I found on my computer is from 2017 and has not those defines inside. Long or int...just know hex or dec. 1
adoxa Posted May 2 Posted May 2 You've really never seen 0x before? You can't infer that it means hex? I have a CommCtrl.h from 2006 (2003 R2 PSDK) that defines ECM_FIRST. Your Unicode version isn't working because you're still using bytes, not words (Windows Unicode uses 16-bit characters, UTF-16). CR is 13 (0Dh), LF is 10 (0Ah). Your length is one more because -1 means determine length automatically, including the NUL terminator. The length error is because the wide versions still use characters, not bytes. 4
LCF-AT Posted May 3 Author Posted May 3 Hi, of I have seen 0x before! Just don't keep everything in mind forever. The CommCtrl.h file I found on my PC is from MinGW include and there is nothing with ECM inside. The file has 122 KB. Back to the Wide / Multi functions. So the question is how I have to handle the bytes value I get back from those function with writefile functions. How to handle it for 100% sure? I have a listview with 2 rows and in both I have for testing insert this... 💋 ....symbol. Now when I try to export this content using CP_UTF8 code page I would like to check the exact size I need to alloc free space / copy the symbol as UTF-8 in my buffer and export it via writefile function etc. How to handle it is the question. So my app is UNICODE. Look... ...so you can see the 2 symbols right. My exported file should at the end look like this... /*=\💋/*=\/*=\💋/*=\ ...when I open it in Notepad and it should has the UFT-8 CodePage. The first what I do is calling ListViewGetItemText / lstrlen function and it does return 2 in eax what means 2 bytes length has that content / symbol in this case without 0 termination. As next I want to check the byte length of this symbol when I use the CP_UTF8 what I want to write into export file. For this I do call... invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,NULL,NULL,NULL,NULL 0019F4AC 00404626 /CALL to WideCharToMultiByte from bones.00404621 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "??" <-- _namebuffer 0019F4BC FFFFFFFF |WideCharCount = FFFFFFFF (-1.) 0019F4C0 00000000 |MultiByteStr = NULL 0019F4C4 00000000 |MultiByteCount = 0 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL _namebuffer below 0019F518 3D D8 8B DC 00 00 =Ø‹Ü.. eax = 5 with 0 termination ....so it tells me that I would need 4 bytes raw to use it with writefile function. OK so far. As next step I want to alloc free space to copy that new UTF8 string into buffer I want to use with writefile function. First question here is how MUCH buffer space I must alloc. So it tells me 5 bytes or 4 without termination. So here I just alloc for testing 1000 bytes and see what happens next. Now I want to write the string symbol into my new buffer as UTF8.... mov _new, alloc(1000) invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,_namelenghtnew,_new,1000,NULL,NULL 0019F4AC 0040465A /CALL to WideCharToMultiByte from bones.00404655 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "??" <--- _namebuffer 0019F4BC 00000005 |WideCharCount = 5 0019F4C0 05525E80 |MultiByteStr = 05525E80 0019F4C4 000003E8 |MultiByteCount = 3E8 (1000.) 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL _namebuffer 0019F518 3D D8 8B DC 00 00 =Ø‹Ü.. eax = 7 / below filled buffer 05525E80 F0 9F 92 8B 00 7E 00 00 💋.~.. ....so why does it NOW return a length of 7 so it did told me before it would have a length of 5 + termination and now I got 7 in eax back. Why does it differ? Do I have to use a byte length of 4 or 6 without termination? That's one thing I don't understand yet. Now remember, there is another same string / symbol...I run the same functions again and the first check functions tells me again the string would have 5 bytes + termination (like before too) and then I call the second API to write the string in buffer as UTF8 (same as before) but now I get a other results in eax back... nvoke WideCharToMultiByte,CP_UTF8,0,_isCommand,_isCommandlenghtnew,_new_2,1000,NULL,NULL 0019F4AC 004046A7 /CALL to WideCharToMultiByte from bones.004046A2 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 05520B00 |WideCharStr = "??" <--- _isCommand 0019F4BC 00000005 |WideCharCount = 5 0019F4C0 05526280 |MultiByteStr = 05526280 0019F4C4 000003E8 |MultiByteCount = 3E8 (1000.) 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL _isCommand 05520B00 3D D8 8B DC 00 00 =Ø‹Ü.. _new_2 / buffer filled below 05526280 F0 9F 92 8B 00 00 EA AE AB 00 00 00 00 00 00 00 💋..ꮫ....... eax = 9 ...so what's this now? Is total different. Why did it fill more into buffer and returns 9 in eax? Its the same step as before too. So I don't see what I did wrong to get that strange too long filled buffer on second function loop. Do you see what wrong is? So my goal is super simple. Just reading my LV content / getting exact byte length I would need for UTF8 / alloc the free space / copy string buffer int new bufffer as UTF8 / write UTF8 buffer to file via WriteFIle function where I need exact byte length of the buffer I want to copy without termination. So I ask again, do you see anything I did wrong to get those strange results? Otherwise you maybe have a better idea how to handle that export to file via UTF8 + Reading that content back via MultiByteToWideChar UTF8 etc. So that size / length problem I don't get handled correctly for 100% yet. Total confusing too. Any good idea / help to manage that would be helpfully. Or a tiny example how to call the functions right etc. greetz 1
adoxa Posted May 3 Posted May 3 Again, WideCharCount is wide characters, not bytes, so it should be two, as lstrlen told you (0D83Dh, 0DC8Bh, for a UTF-16 surrogate pair; they make the four UTF-8 bytes 0F0h, 09Fh, 092h, 08Bh). Five is three more than you want, so you're also converting whatever else happens to be in the buffer. 2
LCF-AT Posted May 3 Author Posted May 3 Hi again, normally I should use those function ALSO to get the string length back but somehow it messed up. When I just use the return value from lstrlen function x2 then = 4 bytes in this case of 💋 and alloc also 4 or some more and using it with WideCharToMultiByte function then it will return eax 0 ERROR_INSUFFICIENT_BUFFER even it did copy the 4 bytes into that buffer. Somehow that makes not much sense for me. 0019F4AC 0040462F /CALL to WideCharToMultiByte from bones.0040462A 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "??" <--- 💋 0019F4BC 00000004 |WideCharCount = 4 <--- 0019F4C0 006FA300 |MultiByteStr = 006FA300 <--- 0019F4C4 00000004 |MultiByteCount = 4 <--- 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL 💋 0019F518 3D D8 8B DC 00 =Ø‹Ü. buffer to return 006FA300 00 00 00 00 00 00 00 00 ........ = eax 0 (ERROR_INSUFFICIENT_BUFFER (0000007A)) buffer was written 4 bytes 💋 UTF-8 006FA300 F0 9F 92 8B 00 00 00 00 💋.... When I use the higher buffer value for MultiByteCount like 16 then it will return eax 7. That are 5 bytes with 0 termination + extra trash. Why? Is it just me or makes it just no sense!? When I just the use right value of 4 count then it returns eax 0 + Error buffer bla & blub. How to make a correctly eax check after calling that function to verify whether it did work or fail? Do you understand what I mean? greetz 1
CodeExplorer Posted May 3 Posted May 3 https://stackoverflow.com/questions/215963/how-do-you-properly-use-widechartomultibyte for the size just use first time: int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL); 2
Teddy Rogers Posted May 3 Posted May 3 9 hours ago, LCF-AT said: 0019F4AC 0040462F /CALL to WideCharToMultiByte from bones.0040462A 0019F4B0 0000FDE9 |CodePage = FDE9 0019F4B4 00000000 |Options = 0 0019F4B8 0019F518 |WideCharStr = "??" <--- 0019F4BC 00000004 |WideCharCount = 4 <--- INCORRECT! This is character count, not bytes. Parameter should be 2. 0019F4C0 006FA300 |MultiByteStr = 006FA300 <--- 0019F4C4 00000004 |MultiByteCount = 4 <--- INCORRECT! If you first use the function it should return 2 for cbMultiByte. 0019F4C8 00000000 |pDefaultChar = NULL 0019F4CC 00000000 \pDefaultCharUsed = NULL See my corrections above. Your code should look like this... EnableExplicit Define *lpMultiByteStr Define cbMultiByte.i Define lpMultiByteStr.s Define cchWideChar.i lpMultiByteStr = "??" cchWideChar = Len(lpMultiByteStr) cbMultiByte = WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, #Null, #Null, #Null, #Null) If cbMultiByte *lpMultiByteStr = AllocateMemory(cbMultiByte) If *lpMultiByteStr Debug WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, *lpMultiByteStr, cbMultiByte, #Null, #Null) Debug PeekS(*lpMultiByteStr, cchWideChar, #PB_UTF8) ShowMemoryViewer(*lpMultiByteStr, cbMultiByte) EndIf FreeMemory(*lpMultiByteStr) EndIf WideCharToMultiByte - You seem to be mixing up bytes and characters and possibly the purpose of the function. Another thing to keep in mind is the function is not a fixed length. The characters you are using "??" are in the first 255 bytes of the UTF-8 code-page and requires two bytes. If you were to use "ЃЃ" the function would return four bytes required for the buffer, six if the characters were "ﳝﳝ", eight if "𐌀𐌀"... Ted. 2
LCF-AT Posted May 3 Author Posted May 3 Hi guys, so I still get that Error about the Buffer ERROR_INSUFFICIENT_BUFFER even I alloc enough free bytes as the new string needs = Why? Also in your code @Teddy Rogers I don't see any check what you got in EAX after second call to WideCharToMultiByte. Also you do len the string using the value with the first call to WideCharToMultiByte instead of using -1h. Why? So in other words I can do just those steps below... invoke lstrlen,addr _namebuffer <-- Unicode string any add eax, eax <-- double bytes in EAX mov _namelenght, eax mov _new, alloc(eax) invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,_namelenght,_new,_namelenght,NULL,NULL eax = 0 but string was written / who cares about function checking right!? add _writtenbytes, fwrite (_filehandle,_new,_namelenght) ...! That's pretty uncool of course. So I really try to understand that functions but without success in case of the bytes I need and to alloc the right space etc. Look at this example I made with 2 different strings...also the WideCharToMultiByte function tells me this... WideCharToMultiByte ------------------------------ Return Values If the function succeeds, and cchMultiByte is nonzero, the return value is the number of bytes written to the buffer pointed to by lpMultiByteStr. If the function succeeds, and cchMultiByte is zero, the return value is the required size, in bytes, for a buffer that can receive the translated string. If the function fails, the return value is zero. To get extended error information, call GetLastError. GetLastError may return one of the following error codes: ERROR_INSUFFICIENT_BUFFER ERROR_INVALID_FLAGS ERROR_INVALID_PARAMETER ...now my short example test code... the chr$("TEST") is UNICODE string 0045F3C4 >54 00 45 00 53 00 54 00 00 T.E.S.T.. invoke WideCharToMultiByte,CP_UTF8,0,chr$("TEST"),-1h,0,0,0,0 eax = 5 <-- length I need + NULL termination mov ebx, eax mov edi, alloc(eax) <--- alloc 5 bytes invoke WideCharToMultiByte,CP_UTF8,0,chr$("TEST"),ebx,edi,ebx,0,0 eax = 5 bytes was written to buffer in EDI 005E8390 54 45 53 54 00 TEST. ...so in this case it seems to work to read required size / alloc it / convert it. Now I try same with the Symbol... _some_ db 3Dh,0D8h,8Bh,0DCh,0,0 <-- Symbol of 💋 I read from Listview invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,-1h,0,0,0,0 eax = 5 mov ebx, eax mov edi, alloc(eax) <-- alloc 5 bytes invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,ebx,edi,ebx,0,0 eax = 0 / GLE: ERROR_INSUFFICIENT_BUFFER (0000007A) EDI buffer is written 006B46A8 F0 9F 92 8B 00 💋. ...so you see the different. In first example it works and in second not because of not enough buffer BUT why is the question. The converted string needs 5 bytes too (+ 0T) and not more but the function fails telling me not enough buffer. Now I tried to increase the buffer value of cchMultiByte from 5 to 6 to 7 to 0Ah and it still fails. Just when I use a value of 0Bh instead of 5 (the function told me I could use) it does work. The question is why does it need 0Bh bytes to write lots of bytes I even don't need!? Below the bytes which get converted with 0Bh. I just need the first 4 bytes only so what is the rest? For what? Why does this function in this case not work correctly? 006C43A8 F0 9F 92 8B 00 E4 95 94 E5 91 93 00 💋.䕔呓. So in my understanding I would say that the function does just fail anyhow because of whatever reason. So maybe you can just try the same using same string / symbol bytes as me and see what YOU get. If you understand why you get 05 on testing & 0Bh on converting then try to explain it to me if possible. One more time, I just wanna use the functions correctly to prevent errors and that's all you know. PS: Does anyone know or has a tiny app (x86) what is using those functions? Maybe if I debug it to see it in action I would understand / follow it better to see what it does. greetz 1
LCF-AT Posted May 3 Author Posted May 3 Hi again, I think I found the solution now. All works fine in case of using -1h for both API calls for ccWideChar. No idea why but this works. invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,NULL,NULL,NULL,NULL mov _namelenght, eax ; use for WideCharToMultiByte ccMultiByte dec eax ; sub 0 termination byte mov _namelenghtnew, eax ; raw size use for write function add eax, eax ; extra space to be safe mov _new, alloc(eax) invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,_new,_namelenght,NULL,NULL ... add _writtenbytes, fwrite (_filehandle,_new,_namelenghtnew) In this case it works that the function does return not NULL anymore with Error Buffer. Just don't why it did not work when using the right value for ccWideChar in the examples I did post before. Now just using the -1h does solve it. Anyone knows why? greetz 1
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now