How to format specific text to correctly displaying text?

April 19, 2024

Hi guys,

so I got a little problem again with those UNICODE / SYMBOL chars in text / buffer I want to format to readable text and print that on a static control. So first I got some text issues showing me some strange symbol chars instead of text like this below...

"Youâ€"
is
"You’ve"

...and I was then using the MultiByteToWideChar function with CodePage CP_UTF8 to change my ANSI text buffer to UNICODE. After that the text was displaying correctly using SetDlgItemTextW function. FIne so far I thought. Then I found another problem with a other symbol like this...

Q&amp;A
is
Q&A

...and I tried to use the same function as above but in this case I got this results back...

Qamp;A

!? My question now is...when I have any unknown text in buffer as ASCII / ANSI style then I want to format / convert this text buffer into 100 % readable / Symbol buffer I want to use with SetDlgItemTextW (Unicode) function to display the text 100 % correctly as original etc. What is the right method for this?

greetz

April 20, 2024

I don't know where you are sourcing your text from, possibly you can check the BOM - if it exists.

If the text is a reliable source you could try utilising the IsTextUnicode function...

Ted.

April 20, 2024

If you want to change "standard" (&<>) HTML entities it would be simplest to search and replace manually; not sure what the best approach would be if you want to convert unknown HTML to text.

Dialog text uses & to underline the next character, so they should be replaced with && for a literal &.

April 20, 2024

Hi guys,

I was downloading some text file from internet and when I print it into static control etc then I got that not correctly wrong mixed symbols or letters etc I would like to prevent but how is the questions. By the way, I tried using that IsTextUnicode function but I can only use the RtlIsTextUnicode function and this does crash always inside...

invoke RtlIsTextUnicode,addr STRINGBUFFER ,sizeof STRINGBUFFER, IS_TEXT_UNICODE_ASCII16

....so I don't know about all those specific Text Symbol styles things whatever they called etc but it really sux and I just want to have & use some simple format / fix functions I can run over my textbuffer to make them OK.

@adoxa

Yes it seems I have to remove those HTML entities from text buffer to format them correctly but how? Is there no ready function already I could use? Otherwise I have to make it myself. How much HTML Entities are there I have to check for? Or what are the most common used? I made this quick function...

Remove_HTML_Entities proc uses edi esi ebx _buffer:DWORD
	
	invoke szRep,_buffer,_buffer,chr$("&lt;"),   chr$("<")
	invoke szRep,_buffer,_buffer,chr$("&gt;"),   chr$(">")
	invoke szRep,_buffer,_buffer,chr$("&amp;"),  chr$('&')
	invoke szRep,_buffer,_buffer,chr$("&quot;"), chr$('"')
	invoke szRep,_buffer,_buffer,chr$("&apos;"), chr$("'")
	invoke szRep,_buffer,_buffer,chr$("&cent;"), chr$("¢")
	invoke szRep,_buffer,_buffer,chr$("&pound;"),chr$("£")
	invoke szRep,_buffer,_buffer,chr$("&yen;"),  chr$("¥")
	invoke szRep,_buffer,_buffer,chr$("&euro;"), chr$("€")
	invoke szRep,_buffer,_buffer,chr$("&copy;"), chr$("©")
	invoke szRep,_buffer,_buffer,chr$("&reg;"),  chr$("®")
	Ret
Remove_HTML_Entities endp

...to remove some of those Entities. Seems to work OK so far but NOW I found another problem. When the entitie & was found and replaced with & and I do send that string buffer into my static control then the "&" is not displaying!=? Why? When I do messagebox that string buffer then the "&" gets displayed. So why is the & not showing when using it in a string? Also this fails...

invoke SendMessage,STATIC_HANDLE,WM_SETTEXT,0,chr$("You & Me")
=
"You Me"
and not
"You & Me"

Why? Is there any style I have to enable to make it work to display also the "&"?

greetz

April 21, 2024

Read the last sentence of my previous message again...

April 25, 2024

On 4/21/2024 at 6:18 AM, LCF-AT said:

I tried using that IsTextUnicode function but I can only use the RtlIsTextUnicode function and this does crash always inside

Check the third parameter, it is an in/ out...

Quote

[in, out, optional] lpiResult

https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode

Ted.

April 25, 2024

OK thanks Ted,

So it seems to work OK so far when using 0 as last parameter. Not sure whether it will also work in case of those Symbol stuff in the string 💋🖤🧛‍♀️ etc. I can not use those symbols in WinASM itself to text it quickly so I also got just ??????? to see there.

By the way, I have another question. So i was trying to display some same text on a static control & edit control and I got an issue. The text is not displaying same when the text has only an LF (10 / 0Ah) byte instead of CRLF (13,10 / 0Dh 0Ah). Why? In static control it does display with new lines and in edit control all is displaying in one line. That's some strange or? Is there any extra flag I have to use for Edit control to thread the LF / 0Ah byte like an CRLF? At the moment I wrote a function to replace all 0Ah bytes in text with 0D 0A to make it work but for this I have to alloc a new buffer. Just wanna know whether I can skip that part to handle it manually like this and just telling the edit control to display LF also as CRLF etc you know.

greetz

April 27, 2024

On 4/26/2024 at 5:54 AM, LCF-AT said:

In static control it does display with new lines and in edit control all is displaying in one line.

Have you tried setting the edit control with EM_FMTLINES?

On 4/26/2024 at 5:54 AM, LCF-AT said:

I also got just ??????? to see there.

I have some test code using IsTextUnicode that may be of help, see code block below.

Raymond Chen suggests some alternate options...

https://devblogs.microsoft.com/oldnewthing/20150223-00/?p=44613

EnableExplicit

; ***********************************************************************************************************************
;- Enumerations
; ***********************************************************************************************************************

#IS_TEXT_UNICODE_ASCII16            = $0001
#IS_TEXT_UNICODE_REVERSE_ASCII16    = $0010

#IS_TEXT_UNICODE_STATISTICS         = $0002
#IS_TEXT_UNICODE_REVERSE_STATISTICS = $0020

#IS_TEXT_UNICODE_CONTROLS           = $0004
#IS_TEXT_UNICODE_REVERSE_CONTROLS   = $0040
;#IS_TEXT_UNICODE_BUFFER_TOO_SMALL  = $0000   ; MSDN has this documented yet the value does not exist?

#IS_TEXT_UNICODE_SIGNATURE          = $0008
#IS_TEXT_UNICODE_REVERSE_SIGNATURE  = $0080

#IS_TEXT_UNICODE_ILLEGAL_CHARS      = $0100
#IS_TEXT_UNICODE_ODD_LENGTH         = $0200
#IS_TEXT_UNICODE_DBCS_LEADBYTE      = $0400
#IS_TEXT_UNICODE_NULL_BYTES         = $1000

#IS_TEXT_UNICODE_UNICODE_MASK       = $000F
#IS_TEXT_UNICODE_REVERSE_MASK       = $00F0
#IS_TEXT_UNICODE_NOT_UNICODE_MASK   = $0F00
#IS_TEXT_UNICODE_NOT_ASCII_MASK     = $F000

; ***********************************************************************************************************************
;- Declarations
; ***********************************************************************************************************************

Declare.i IsTextUnicodeV2(*String, *End)
Declare.s ConcatenateFlags(FlagString.s, Flag.s, ValueToSearch, ValueToFind)
Declare.s ConcatenateValues(lpiResult)

; ***********************************************************************************************************************
;- IsTextUnicode - File Tests
; ***********************************************************************************************************************

Debug "ANSI"
IsTextUnicodeV2(?ANSII_Start, ?ANSII_End)

Debug "UTF8"
IsTextUnicodeV2(?UTF8_Start, ?UTF8_End)

Debug "UTF8_BOM"
IsTextUnicodeV2(?UTF8_BOM_Start, ?UTF8_BOM_End)

Debug "UTF16_BE"
IsTextUnicodeV2(?UTF16_BE_Start, ?UTF16_BE_End)

Debug "UTF16_LE"
IsTextUnicodeV2(?UTF16_LE_Start, ?UTF16_LE_End)

; ***********************************************************************************************************************
;- IsTextUnicode - Datasection Tests
; ***********************************************************************************************************************

Debug "DATA_ASCII"
IsTextUnicodeV2(?DATA_ASCII_Start, ?DATA_ASCII_End)

Debug "DATA_UNICODE"
IsTextUnicodeV2(?DATA_UNICODE_Start, ?DATA_UNICODE_End)

Debug "DATA_STRING"
IsTextUnicodeV2(?DATA_STRING_Start, ?DATA_STRING_End)

; ***********************************************************************************************************************
;- IsTextUnicode - Memory Tests
; ***********************************************************************************************************************

Define String1.s = "abcdefghijklmnopqrstuvwxyz"
Define String2.s = "Bush hid the facts"              ; https://en.wikipedia.org/wiki/Bush_hid_the_facts
Define *Memory
Define iSize

Debug "MEMORY_ABC_ASCII"
*Memory = Ascii(String1.s)
iSize = StringByteLength(String1.s, #PB_Ascii)
;ShowMemoryViewer(*Memory, iSize)
;Debug PeekS(*Memory, Len(String1.s), #PB_Ascii)
IsTextUnicodeV2(*Memory, *Memory+iSize)
FreeMemory(*Memory)

Debug "MEMORY_ABC_UTF8"
*Memory = UTF8(String1.s)
iSize = StringByteLength(String1.s, #PB_UTF8)
;Debug PeekS(*Memory, Len(String1.s), #PB_UTF8)
;ShowMemoryViewer(*Memory, iSize)
IsTextUnicodeV2(*Memory, *Memory+iSize)
FreeMemory(*Memory)

; https://en.wikipedia.org/wiki/Bush_hid_the_facts

Debug "MEMORY_BUSH_ASCII"
*Memory = Ascii(String2.s)
iSize = StringByteLength(String2.s, #PB_Ascii)
;ShowMemoryViewer(*Memory, iSize)
;Debug PeekS(*Memory, Len(String2.s), #PB_Ascii)
IsTextUnicodeV2(*Memory, *Memory+iSize)
FreeMemory(*Memory)

Debug "MEMORY_BUSH_UTF8"
*Memory = UTF8(String2.s)
iSize = StringByteLength(String2.s, #PB_UTF8)
;Debug PeekS(*Memory, Len(String2.s), #PB_UTF8)
;ShowMemoryViewer(*Memory, iSize)
IsTextUnicodeV2(*Memory, *Memory+iSize)
FreeMemory(*Memory)

End

Procedure.i IsTextUnicodeV2(*StringStart, *StringEnd)
  Protected lpiResult.l, Result, a
  Protected iSize = *StringEnd - *StringStart
  
  ; Create a temporary array structure.
  
  Structure istextunicode_value Align #PB_Structure_AlignC
    Flag.l
  EndStructure

  Protected Dim istextunicode_value.istextunicode_value(15)
  
  ; Add in the flag values.
  
  istextunicode_value(00)\Flag = #IS_TEXT_UNICODE_ASCII16
  istextunicode_value(01)\Flag = #IS_TEXT_UNICODE_REVERSE_ASCII16
  istextunicode_value(02)\Flag = #IS_TEXT_UNICODE_STATISTICS
  istextunicode_value(03)\Flag = #IS_TEXT_UNICODE_REVERSE_STATISTICS
  istextunicode_value(04)\Flag = #IS_TEXT_UNICODE_CONTROLS
  istextunicode_value(05)\Flag = #IS_TEXT_UNICODE_REVERSE_CONTROLS
  istextunicode_value(06)\Flag = #IS_TEXT_UNICODE_SIGNATURE
  istextunicode_value(07)\Flag = #IS_TEXT_UNICODE_REVERSE_SIGNATURE
  istextunicode_value(08)\Flag = #IS_TEXT_UNICODE_ILLEGAL_CHARS
  istextunicode_value(09)\Flag = #IS_TEXT_UNICODE_ODD_LENGTH
  istextunicode_value(10)\Flag = #IS_TEXT_UNICODE_DBCS_LEADBYTE
  istextunicode_value(11)\Flag = #IS_TEXT_UNICODE_NULL_BYTES
  istextunicode_value(12)\Flag = #IS_TEXT_UNICODE_UNICODE_MASK
  istextunicode_value(13)\Flag = #IS_TEXT_UNICODE_REVERSE_MASK
  istextunicode_value(14)\Flag = #IS_TEXT_UNICODE_NOT_UNICODE_MASK
  istextunicode_value(15)\Flag = #IS_TEXT_UNICODE_NOT_ASCII_MASK

  Debug "**********"
  Debug "iSize: " + iSize
  
  ; Cycle through all the flags stored in the array.
  
  For a = 0 To 15
    lpiResult.l = istextunicode_value(a)\Flag
    Result = IsTextUnicode_(*StringStart, iSize, @lpiResult)
    
    If Result
      Debug "Flag being checked: $" + Hex(istextunicode_value(a)\Flag, #PB_Long) + " | lpiResult: $" + Hex(lpiResult, #PB_Long)
      Debug ConcatenateValues(lpiResult)
    EndIf
  Next a

EndProcedure

Procedure.s ConcatenateValues(lpiResult)
  Protected FlagString.s
  
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_ASCII16",             lpiResult, #IS_TEXT_UNICODE_ASCII16)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_ASCII16",     lpiResult, #IS_TEXT_UNICODE_REVERSE_ASCII16)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_STATISTICS",          lpiResult, #IS_TEXT_UNICODE_STATISTICS)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_STATISTICS",  lpiResult, #IS_TEXT_UNICODE_REVERSE_STATISTICS)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_CONTROLS",            lpiResult, #IS_TEXT_UNICODE_CONTROLS)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_CONTROLS",    lpiResult, #IS_TEXT_UNICODE_REVERSE_CONTROLS)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_SIGNATURE",           lpiResult, #IS_TEXT_UNICODE_SIGNATURE)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_SIGNATURE",   lpiResult, #IS_TEXT_UNICODE_REVERSE_SIGNATURE)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_ILLEGAL_CHARS",       lpiResult, #IS_TEXT_UNICODE_ILLEGAL_CHARS)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_DBCS_LEADBYTE",       lpiResult, #IS_TEXT_UNICODE_DBCS_LEADBYTE)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NULL_BYTES",          lpiResult, #IS_TEXT_UNICODE_NULL_BYTES)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_UNICODE_MASK",        lpiResult, #IS_TEXT_UNICODE_UNICODE_MASK)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_REVERSE_MASK",        lpiResult, #IS_TEXT_UNICODE_REVERSE_MASK)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NOT_UNICODE_MASK",    lpiResult, #IS_TEXT_UNICODE_NOT_UNICODE_MASK)
  FlagString.s = ConcatenateFlags(FlagString.s, "#IS_TEXT_UNICODE_NOT_ASCII_MASK",      lpiResult, #IS_TEXT_UNICODE_NOT_ASCII_MASK)
  
  If FlagString.s = ""
    FlagString.s = "#NULL"
  EndIf
  
  ProcedureReturn FlagString.s
EndProcedure

Procedure.s ConcatenateFlags(FlagString.s, Flag.s, ValueToSearch, ValueToFind)
  Protected Space.s = " | "
  
  If ValueToSearch & ValueToFind = ValueToFind
    If Not FlagString.s = ""
      FlagString.s = FlagString.s + Space.s
    EndIf
    
    FlagString.s = FlagString.s + Flag.s
  EndIf  
  
  ProcedureReturn FlagString.s
EndProcedure

; ***********************************************************************************************************************
;- Datasection
; ***********************************************************************************************************************

DataSection
  ANSII_Start:
  IncludeBinary "readme_ANSI.txt"
  ANSII_End:
  
  UTF8_Start:
  IncludeBinary "readme_UTF8.txt"
  UTF8_End:
  
  UTF8_BOM_Start:
  IncludeBinary "readme_UTF8_BOM.txt"
  UTF8_BOM_End:
  
  UTF16_BE_Start:
  IncludeBinary "readme_UTF16_BE.txt"
  UTF16_BE_End:
  
  UTF16_LE_Start:
  IncludeBinary "readme_UTF16_LE.txt"
  UTF16_LE_End:
  
  DATA_ASCII_Start:
  Data.a "abcdefghijklmnopqrstuvwxyz"
  DATA_ASCII_End:
  
  DATA_UNICODE_Start:
  Data.u "abcdefghijklmnopqrstuvwxyz"
  DATA_UNICODE_End:
  
  DATA_STRING_Start:
  Data.s "abcdefghijklmnopqrstuvwxyz"
  DATA_STRING_End:
EndDataSection

Ted.

April 27, 2024

Hey Ted,

thanks for new info's so far. I tried using that EM_FMTLINES message with TRUE & FALSE but in both cases my text in Edit control still look same without any line breaks all is displaying in one line. Multiline is enabled to see the entire full long text inside of Edit control and I also did set VSCROLL to scroll though the whole text but as I said, all gets displayed with one line etc. When I do copy the one line text and paste it into notepad then all is showing correctly. Seems that Notepad does handle this by itself already but how to handle it in my Edit control?

About that IsTextUnicode issue. The "where it guesses wrong." website on microsoft is forbidden. Otherwise that code of yours looks pretty strange. So I have to check all possible flags and then? That entire string / page code thing is really confusing.

As I said, normally I was just using ASNI strings in my codes but it's limited when I have to deal with UNICODE / SYMBOL chars. I still don't know what I should do in this case.

1.) Set __UNICODE__ EQU 1 in my source to make all unicode + double sizes after all functions who return the length of strings (lstrlen / wsprintf / other functions). Also in case of wsprintf I have to check buffer sizes extra because its limited to 1024 bytes only.

2.) Using mixed style of ANSI & UNICODE strings.

Not so sure about it. How do you handle that problem guys? How do you that in PureBasic or C / C++ etc? Do you need to care about that or not?

Oh by the way, I made my one app we talked a while ago to total Unicode and it seems to work but I had to change a lot of thing in the code and still don't know whether I did everything correctly so far.

greetz

April 28, 2024

Regarding EM_FMTLINES, from the 2003 R2 PSDK help:

Quote

This message affects only the buffer returned by the EM_GETHANDLE message and the text returned by the WM_GETTEXT message. It has no effect on the display of the text within the edit control.

The EM_FMTLINES message does not affect a line that ends with a hard line break. A hard line break consists of one carriage return and a line feed.

Vista and later have an extended style to recognise LF: ES_EX_ALLOWEOL_LF; 10 has an option to set the EOL character: EM_SETENDOFLINE.

April 28, 2024

On 4/28/2024 at 7:19 AM, LCF-AT said:

About that IsTextUnicode issue. The "where it guesses wrong." website on microsoft is forbidden.

Have a look on Wikipedia, it has an entry all about it, also check the references at the bottom of the page...

https://en.wikipedia.org/wiki/Bush_hid_the_facts

On 4/28/2024 at 7:19 AM, LCF-AT said:

Otherwise that code of yours looks pretty strange. So I have to check all possible flags and then? That entire string / page code thing is really confusing.

The code were tests I completed of the function a long time ago to check its results against different encoded text sources. Each flag is passed to the function to check against the string. It is probably a little convoluted to use for reference.

To simplify the code, in short it looks like this...

#IS_TEXT_UNICODE_STATISTICS = $0002

lpiResult = #IS_TEXT_UNICODE_STATISTICS

If IsTextUnicode_(?STRING_Start, ?STRING_End-?STRING_Start, @lpiResult)
  If lpiResult = #IS_TEXT_UNICODE_STATISTICS
    Debug "The text is probably Unicode, with the determination made by applying statistical analysis."
  Else
    Debug lpiResult
  EndIf
EndIf

DataSection
  STRING_Start:
  Data.a "Bush hid the facts"
  STRING_End:
EndDataSection

Ted.

April 29, 2024

Hi again,

@adoxa Thanks for the info. I could not test it yet because I don't have the define dec or hex values for those new styles?

WS_EX_ACCEPTFILES                    equ 10h
....
ES_EX_ALLOWEOL_CR
ES_EX_ALLOWEOL_LF
ES_EX_ALLOWEOL_ALL
ES_EX_CONVERT_EOL_ON_PASTE
ES_EX_ZOOMABLE

EM_SETENDOFLINE

EC_ENDOFLINE_DETECTFROMCONTENT
EC_ENDOFLINE_CRLF
EC_ENDOFLINE_CR
EC_ENDOFLINE_LF

I don't have them or found them yet. Just found a old commctrl.h file from 2017. Other problem is that it's just supported by some latest Vista or later. No Win7. Hmm. OK, the problem what could happen is that the text is not displaying as I want right. OK, lets test it. Just post the values next time or tell me where to find them etc. Thanks.

@Teddy Rogers OK thanks you too. I will try to check that function some more with different string types to see how to use it in the best manner.

greetz

April 29, 2024

4 hours ago, LCF-AT said:

OK thanks you too. I will try to check that function some more with different string types to see how to use it in the best manner.

I think we went off on a tangent anyway, I was trying to suggest checking the source file/s you are downloading before converting to Unicode.

If you are confident the source file is an ANSI file there is function EngMultiByteToUnicodeN that you can try using.

Can you attach the text file you are trying to convert, I'll have a look here...

Ted.

April 30, 2024

If you want Win7 support then you have no choice: replace LF with CRLF.

EM_SETEXTENDEDSTYLE     (ECM_FIRST + 10)
ES_EX_ALLOWEOL_CR             0x0001L
ES_EX_ALLOWEOL_LF             0x0002L
ES_EX_ALLOWEOL_ALL            (ES_EX_ALLOWEOL_CR | ES_EX_ALLOWEOL_LF)
ES_EX_CONVERT_EOL_ON_PASTE    0x0004L

EM_SETENDOFLINE         (ECM_FIRST + 12)
EC_ENDOFLINE_DETECTFROMCONTENT  0
EC_ENDOFLINE_CRLF               1
EC_ENDOFLINE_CR                 2
EC_ENDOFLINE_LF                 3

Edited April 30, 2024 by adoxa

May 1, 2024

HI again guys,

@adoxa What value is ECM_FIRST? What does this "L" mean? 0x0001L <-- what is this for a value? Just can use equ valueX hex or dec you know.

Alright, I found some strange issue about that CR LF thing. I made a app in ASNI & UNICODE style and in both I do use an edit control to display a URL which I did break by the used URL parameters "?" "&" to display them line by line. So in case of the ANSI app it works but in case of UNICODE app it does not work == ? Why! I do same but get different results out. So this is my small code....

ANSI Version
CONTROL "",IDC_INPUTPARAMETERS,"Edit",0x56a11004,5,90,400,140,0x00020200
		.elseif ax == IDC_BreakParam	; break URL at params and print them out
			invoke GetDlgItem,hWin,IDC_URL
			mov esi, eax
			invoke SendMessage,esi,WM_GETTEXTLENGTH,0,0	
			.if eax
				invoke RtlZeroMemory,addr DropNameBuffer,sizeof DropNameBuffer
				invoke SendMessage,esi,WM_GETTEXT,sizeof DropNameBuffer,addr DropNameBuffer
				lea edi, DropNameBuffer
				xor ebx, ebx
				.while byte ptr [edi] != NULL
					.if byte ptr [edi] == '?' || byte ptr [edi] == '&'
						mov byte ptr [edi], 10 ; cr
						inc ebx
					.endif
					inc edi
				.endw
					invoke SetDlgItemText,hWin,IDC_INPUTPARAMETERS,addr DropNameBuffer



UNICODE Version
CONTROL "",IDC_INPUTPARAMETERS,"Edit",0x56a11004,5,90,400,140,0x00020200
		.elseif ax == IDC_BreakParam	; break URL at params and print them out
			invoke GetDlgItem,hWin,IDC_URL
			mov esi, eax
			invoke SendMessage,esi,WM_GETTEXTLENGTH,0,0	
			.if eax
				invoke RtlZeroMemory,addr DropNameBuffer,sizeof DropNameBuffer
				invoke SendMessage,esi,WM_GETTEXT,sizeof DropNameBuffer,addr DropNameBuffer
				lea edi, DropNameBuffer
				xor ebx, ebx
				.while word ptr [edi] != NULL
					.if byte ptr [edi] == '?' || byte ptr [edi] == '&'
						mov byte ptr [edi], 10 ; cr
						inc ebx
					.endif
					inc edi
					inc edi
				.endw
					invoke SetDlgItemText,hWin,IDC_INPUTPARAMETERS,addr DropNameBuffer

...so in both cases I do check for '?' and '&' and when I found it I do change it by value 10 (dec) 0A (hex) for CR or LF (always forget that what is what). Then I just send the changed string to the edit control via SetDlgItemText function. In case of ANSI version it works to display all parameters line by line but in case of UNICODE all gets displayed in one line again. So what is here the reason why it works for A but not for W? The controls using same style. Makes me also wonder why it does work in case of ANSI.

@Teddy Rogers So in first place I'm trying to get a rid of all those text styles. Main goal was it to change my apps from ANSI to UNICODE / SYMBOL support to display ALL my text I have correctly like in Notepad etc you know. I also found out that in case of using UNICODE app "__UNICODE__ EQU 1" I have to read / save my own text content as UTF-8 CodePage using EXTRA the WideCharToMultiByte function for all text I want to export to file and using MultiByteToWideChar function to read all text from output file into my app back to make it work to display all right. That's pretty annoying. Also in this case I have questions how to find out the right buffer length when using those functions. I see I can call them like this...

The text has a length of 00003AF1h bytes (15089)

edi = Buffer with text from extern file UTF-8
Function below should get length I need...
invoke MultiByteToWideChar,CP_UTF8,0,edi,-1h,0,0
=
00003AE6h bytes (15078) = (11 bytes less)
Then I double that space to alloc a free section
Then calling same function with all parameters
0019FA10   00404631  /CALL to MultiByteToWideChar from bones.0040462C
0019FA14   0000FDE9  |CodePage = FDE9
0019FA18   00000000  |Options = 0
0019FA1C   007FE940  |StringToMap = "......stringstuff....."
0019FA20   00003AF1  |StringSize = 3AF1 (15089.)
0019FA24   00802BB8  |WideCharBuf = 00802BB8
0019FA28   000075CC  \WideBufSize = 75CC (30156.)
=
eax 00003AE5h bytes / double size was written to buffer

...now in case of saving the text content...

0019F4AC   004043BB  /CALL to WideCharToMultiByte from bones.004043B6
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0045FAFC  |WideCharStr = "/*=\"
0019F4BC   FFFFFFFF  |WideCharCount = FFFFFFFF (-1.)
0019F4C0   00000000  |MultiByteStr = NULL
0019F4C4   00000000  |MultiByteCount = 0
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
=
eax 5 but it should be 4!=? Why 5?

0019F4AC   004043E1  /CALL to WideCharToMultiByte from bones.004043DC
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0045FB08  |WideCharStr = "/*=\"
0019F4BC   00000008  |WideCharCount = 8
0019F4C0   0019F4D8  |MultiByteStr = 0019F4D8
0019F4C4   00000004  |MultiByteCount = 4  <---- I set to 4 not 5
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
=
eax 0 <-- why 0? / ERROR_INSUFFICIENT_BUFFER
So the text was written to buffer!

0019F4AC   004043FE  /CALL to WideCharToMultiByte from bones.004043F9
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "Curl DL"
0019F4BC   FFFFFFFF  |WideCharCount = FFFFFFFF (-1.)
0019F4C0   00000000  |MultiByteStr = NULL
0019F4C4   00000000  |MultiByteCount = 0
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
=
eax 8 but should be 7 = ?

0019F4AC   0040442F  /CALL to WideCharToMultiByte from bones.0040442A
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "Curl DL"
0019F4BC   0000000E  |WideCharCount = E (14.)
0019F4C0   00839D80  |MultiByteStr = 00839D80
0019F4C4   00000007  |MultiByteCount = 7
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
=
eax 0 but text was written to buffer!

Somehow it seems not to work correctly to get the right buffer size out I need to use for the translated strings and even when I use the buffer with correct size of buffer it does return 0 even the string was written into buffer correctly. It returns fail 0 but when I use more buffer size then it will also copy more bytes into new buffer I don't want to have there like any random or 0 bytes etc. So my question is how to use those 2 function correctly to get first the bytes I need and to call the function in second round correctly? Any advice would be welcome.

PS: I'm still not finished with my UNICODE app to handle everything to double the sizes. Also in case of WM_GETTEXTLENGTH I need to double the result to alloc correctly buffer length to work go on with that. Those changes from ANSI to make a app source into UNICODE it really super PITA. :kick: Not sure how you guys handle that problem in your coding languages like C++ and others.

greetz

May 1, 2024

5 minutes ago, LCF-AT said:

What value is ECM_FIRST? What does this "L" mean? 0x0001L <-- what is this for a value? Just can use equ valueX hex or dec you know.

ECM_FIRST is equal to 0x1500 and is defined in the CommCtrl.h header of the Windows SDK.

A numerical value with the L suffix is used to have the value be interpreted as a 'long' value instead of the 'int' default.

May 1, 2024

Do you mean...

ECM_FIRST	equ 1500
or
ECM_FIRST	equ 1500h

...as I said, the CommCtrl.h I found on my computer is from 2017 and has not those defines inside. Long or int...just know hex or dec.

May 2, 2024

You've really never seen 0x before? You can't infer that it means hex? I have a CommCtrl.h from 2006 (2003 R2 PSDK) that defines ECM_FIRST.

Your Unicode version isn't working because you're still using bytes, not words (Windows Unicode uses 16-bit characters, UTF-16). CR is 13 (0Dh), LF is 10 (0Ah).

Your length is one more because -1 means determine length automatically, including the NUL terminator.

The length error is because the wide versions still use characters, not bytes.

May 3, 2024

Hi,

of I have seen 0x before! Just don't keep everything in mind forever. The CommCtrl.h file I found on my PC is from MinGW include and there is nothing with ECM inside. The file has 122 KB.

Back to the Wide / Multi functions. So the question is how I have to handle the bytes value I get back from those function with writefile functions. How to handle it for 100% sure? I have a listview with 2 rows and in both I have for testing insert this...

💋

....symbol. Now when I try to export this content using CP_UTF8 code page I would like to check the exact size I need to alloc free space / copy the symbol as UTF-8 in my buffer and export it via writefile function etc. How to handle it is the question. So my app is UNICODE. Look...

...so you can see the 2 symbols right. My exported file should at the end look like this...

/*=\💋/*=\/*=\💋/*=\

...when I open it in Notepad and it should has the UFT-8 CodePage. The first what I do is calling ListViewGetItemText / lstrlen function and it does return 2 in eax what means 2 bytes length has that content / symbol in this case without 0 termination. As next I want to check the byte length of this symbol when I use the CP_UTF8 what I want to write into export file. For this I do call...

invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,NULL,NULL,NULL,NULL

0019F4AC   00404626  /CALL to WideCharToMultiByte from bones.00404621
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "??" <-- _namebuffer
0019F4BC   FFFFFFFF  |WideCharCount = FFFFFFFF (-1.)
0019F4C0   00000000  |MultiByteStr = NULL
0019F4C4   00000000  |MultiByteCount = 0
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
_namebuffer below
0019F518  3D D8 8B DC 00 00                                =Ø‹Ü..

eax = 5 with 0 termination

....so it tells me that I would need 4 bytes raw to use it with writefile function. OK so far. As next step I want to alloc free space to copy that new UTF8 string into buffer I want to use with writefile function. First question here is how MUCH buffer space I must alloc. So it tells me 5 bytes or 4 without termination. So here I just alloc for testing 1000 bytes and see what happens next. Now I want to write the string symbol into my new buffer as UTF8....

mov _new, alloc(1000)
invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,_namelenghtnew,_new,1000,NULL,NULL

0019F4AC   0040465A  /CALL to WideCharToMultiByte from bones.00404655
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "??" <--- _namebuffer
0019F4BC   00000005  |WideCharCount = 5
0019F4C0   05525E80  |MultiByteStr = 05525E80
0019F4C4   000003E8  |MultiByteCount = 3E8 (1000.)
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
_namebuffer
0019F518  3D D8 8B DC 00 00                                =Ø‹Ü..                                              

eax = 7 / below filled buffer
05525E80  F0 9F 92 8B 00 7E 00 00                          ðŸ’‹.~..

....so why does it NOW return a length of 7 so it did told me before it would have a length of 5 + termination and now I got 7 in eax back. Why does it differ? Do I have to use a byte length of 4 or 6 without termination? That's one thing I don't understand yet.

Now remember, there is another same string / symbol...I run the same functions again and the first check functions tells me again the string would have 5 bytes + termination (like before too) and then I call the second API to write the string in buffer as UTF8 (same as before) but now I get a other results in eax back...

nvoke WideCharToMultiByte,CP_UTF8,0,_isCommand,_isCommandlenghtnew,_new_2,1000,NULL,NULL

0019F4AC   004046A7  /CALL to WideCharToMultiByte from bones.004046A2
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   05520B00  |WideCharStr = "??"  <--- _isCommand
0019F4BC   00000005  |WideCharCount = 5
0019F4C0   05526280  |MultiByteStr = 05526280
0019F4C4   000003E8  |MultiByteCount = 3E8 (1000.)
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL

_isCommand                                               
05520B00  3D D8 8B DC 00 00                                =Ø‹Ü..
_new_2 / buffer filled below
05526280  F0 9F 92 8B 00 00 EA AE AB 00 00 00 00 00 00 00  ðŸ’‹..ê®«.......
eax = 9

...so what's this now? Is total different. Why did it fill more into buffer and returns 9 in eax? Its the same step as before too. So I don't see what I did wrong to get that strange too long filled buffer on second function loop. Do you see what wrong is?

So my goal is super simple. Just reading my LV content / getting exact byte length I would need for UTF8 / alloc the free space / copy string buffer int new bufffer as UTF8 / write UTF8 buffer to file via WriteFIle function where I need exact byte length of the buffer I want to copy without termination. So I ask again, do you see anything I did wrong to get those strange results? Otherwise you maybe have a better idea how to handle that export to file via UTF8 + Reading that content back via MultiByteToWideChar UTF8 etc. So that size / length problem I don't get handled correctly for 100% yet. Total confusing too. Any good idea / help to manage that would be helpfully. Or a tiny example how to call the functions right etc.

greetz

May 3, 2024

Again, WideCharCount is wide characters, not bytes, so it should be two, as lstrlen told you (0D83Dh, 0DC8Bh, for a UTF-16 surrogate pair; they make the four UTF-8 bytes 0F0h, 09Fh, 092h, 08Bh). Five is three more than you want, so you're also converting whatever else happens to be in the buffer.

May 3, 2024

Hi again,

normally I should use those function ALSO to get the string length back but somehow it messed up. When I just use the return value from lstrlen function x2 then = 4 bytes in this case of 💋 and alloc also 4 or some more and using it with WideCharToMultiByte function then it will return eax 0 ERROR_INSUFFICIENT_BUFFER even it did copy the 4 bytes into that buffer. Somehow that makes not much sense for me.

0019F4AC   0040462F  /CALL to WideCharToMultiByte from bones.0040462A
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "??"  <--- 💋
0019F4BC   00000004  |WideCharCount = 4   <---
0019F4C0   006FA300  |MultiByteStr = 006FA300 <---
0019F4C4   00000004  |MultiByteCount = 4  <---
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL
💋
0019F518  3D D8 8B DC 00                                   =Ø‹Ü.
buffer to return
006FA300  00 00 00 00 00 00 00 00                          ........
=
eax 0 (ERROR_INSUFFICIENT_BUFFER (0000007A))
buffer was written 4 bytes 💋 UTF-8
006FA300  F0 9F 92 8B 00 00 00 00                          ðŸ’‹....

When I use the higher buffer value for MultiByteCount like 16 then it will return eax 7. That are 5 bytes with 0 termination + extra trash. Why? Is it just me or makes it just no sense!? When I just the use right value of 4 count then it returns eax 0 + Error buffer bla & blub. How to make a correctly eax check after calling that function to verify whether it did work or fail? Do you understand what I mean?

greetz

May 3, 2024

https://stackoverflow.com/questions/215963/how-do-you-properly-use-widechartomultibyte

for the size just use first time:
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);

May 3, 2024

9 hours ago, LCF-AT said:

0019F4AC   0040462F  /CALL to WideCharToMultiByte from bones.0040462A
0019F4B0   0000FDE9  |CodePage = FDE9
0019F4B4   00000000  |Options = 0
0019F4B8   0019F518  |WideCharStr = "??"  <--- 
0019F4BC   00000004  |WideCharCount = 4   <--- INCORRECT! This is character count, not bytes. Parameter should be 2.
0019F4C0   006FA300  |MultiByteStr = 006FA300 <---
0019F4C4   00000004  |MultiByteCount = 4  <--- INCORRECT! If you first use the function it should return 2 for cbMultiByte.
0019F4C8   00000000  |pDefaultChar = NULL
0019F4CC   00000000  \pDefaultCharUsed = NULL

See my corrections above. Your code should look like this...

EnableExplicit

Define *lpMultiByteStr
Define cbMultiByte.i
Define lpMultiByteStr.s
Define cchWideChar.i

lpMultiByteStr = "??"
cchWideChar = Len(lpMultiByteStr)

cbMultiByte = WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, #Null, #Null, #Null, #Null)

If cbMultiByte
  
  *lpMultiByteStr = AllocateMemory(cbMultiByte)
  
  If *lpMultiByteStr
    
    Debug WideCharToMultiByte_(#CP_UTF8, #Null, @lpMultiByteStr, cchWideChar, *lpMultiByteStr, cbMultiByte, #Null, #Null)
    
    Debug PeekS(*lpMultiByteStr, cchWideChar, #PB_UTF8)
    ShowMemoryViewer(*lpMultiByteStr, cbMultiByte)
    
  EndIf
  FreeMemory(*lpMultiByteStr)
EndIf

WideCharToMultiByte - You seem to be mixing up bytes and characters and possibly the purpose of the function.

Another thing to keep in mind is the function is not a fixed length. The characters you are using "??" are in the first 255 bytes of the UTF-8 code-page and requires two bytes. If you were to use "ЃЃ" the function would return four bytes required for the buffer, six if the characters were "ﳝﳝ", eight if "𐌀𐌀"...

Ted.

May 3, 2024

Hi guys,

so I still get that Error about the Buffer ERROR_INSUFFICIENT_BUFFER even I alloc enough free bytes as the new string needs = Why? Also in your code @Teddy Rogers I don't see any check what you got in EAX after second call to WideCharToMultiByte. Also you do len the string using the value with the first call to WideCharToMultiByte instead of using -1h. Why? So in other words I can do just those steps below...

invoke lstrlen,addr _namebuffer			<-- Unicode string any
add eax, eax					<-- double bytes in EAX
mov _namelenght, eax
mov _new, alloc(eax)
invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,_namelenght,_new,_namelenght,NULL,NULL
eax = 0 but string was written / who cares about function checking right!?

add _writtenbytes, fwrite (_filehandle,_new,_namelenght)

...! That's pretty uncool of course. So I really try to understand that functions but without success in case of the bytes I need and to alloc the right space etc. Look at this example I made with 2 different strings...also the WideCharToMultiByte function tells me this...

WideCharToMultiByte
------------------------------
Return Values

If the function succeeds, and cchMultiByte is nonzero, the return value is the number of bytes written to the buffer pointed to by lpMultiByteStr. 
If the function succeeds, and cchMultiByte is zero, the return value is the required size, in bytes, for a buffer that can receive the translated string. 
If the function fails, the return value is zero. To get extended error information, call GetLastError. GetLastError may return one of the following error codes:
ERROR_INSUFFICIENT_BUFFER
ERROR_INVALID_FLAGS
ERROR_INVALID_PARAMETER

...now my short example test code...

the chr$("TEST") is UNICODE string
0045F3C4 >54 00 45 00 53 00 54 00 00                       T.E.S.T..

invoke WideCharToMultiByte,CP_UTF8,0,chr$("TEST"),-1h,0,0,0,0
	eax = 5 <-- length I need + NULL termination
mov ebx, eax
mov edi, alloc(eax) <--- alloc 5 bytes
invoke WideCharToMultiByte,CP_UTF8,0,chr$("TEST"),ebx,edi,ebx,0,0
	eax = 5 bytes was written to buffer in EDI
	005E8390  54 45 53 54 00                                   TEST.

...so in this case it seems to work to read required size / alloc it / convert it. Now I try same with the Symbol...

_some_	db 3Dh,0D8h,8Bh,0DCh,0,0  <-- Symbol of 💋 I read from Listview
                                      
invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,-1h,0,0,0,0
	eax = 5
mov ebx, eax
mov edi, alloc(eax) <-- alloc 5 bytes
invoke WideCharToMultiByte,CP_UTF8,0,addr _some_,ebx,edi,ebx,0,0
	eax = 0 / GLE: ERROR_INSUFFICIENT_BUFFER (0000007A)
	EDI buffer is written
	006B46A8  F0 9F 92 8B 00                                   ðŸ’‹.

...so you see the different. In first example it works and in second not because of not enough buffer BUT why is the question. The converted string needs 5 bytes too (+ 0T) and not more but the function fails telling me not enough buffer. Now I tried to increase the buffer value of cchMultiByte from 5 to 6 to 7 to 0Ah and it still fails. Just when I use a value of 0Bh instead of 5 (the function told me I could use) it does work. The question is why does it need 0Bh bytes to write lots of bytes I even don't need!? Below the bytes which get converted with 0Bh. I just need the first 4 bytes only so what is the rest? For what? Why does this function in this case not work correctly?

006C43A8  F0 9F 92 8B 00 E4 95 94 E5 91 93 00              ðŸ’‹.ä•”å‘“.

So in my understanding I would say that the function does just fail anyhow because of whatever reason. So maybe you can just try the same using same string / symbol bytes as me and see what YOU get. If you understand why you get 05 on testing & 0Bh on converting then try to explain it to me if possible. One more time, I just wanna use the functions correctly to prevent errors and that's all you know.

PS: Does anyone know or has a tiny app (x86) what is using those functions? Maybe if I debug it to see it in action I would understand / follow it better to see what it does.

greetz

May 3, 2024

Hi again,

I think I found the solution now. All works fine in case of using -1h for both API calls for ccWideChar. No idea why but this works.

invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,NULL,NULL,NULL,NULL
mov _namelenght, eax			; use for WideCharToMultiByte ccMultiByte
dec eax					; sub 0 termination byte
mov _namelenghtnew, eax			; raw size use for write function
add eax, eax				; extra space to be safe
mov _new, alloc(eax)
invoke WideCharToMultiByte,CP_UTF8,0,addr _namebuffer,-1h,_new,_namelenght,NULL,NULL
...
add _writtenbytes, fwrite (_filehandle,_new,_namelenghtnew)

In this case it works that the function does return not NULL anymore with Error Buffer. Just don't why it did not work when using the right value for ccWideChar in the examples I did post before. Now just using the -1h does solve it. Anyone knows why?

greetz

Sign In

How to format specific text to correctly displaying text?

Recommended Posts

LCF-AT

Teddy Rogers

adoxa

LCF-AT

adoxa

Teddy Rogers

LCF-AT

Teddy Rogers

LCF-AT

adoxa

Teddy Rogers

LCF-AT

Teddy Rogers

adoxa

LCF-AT

atom0s

LCF-AT

adoxa

LCF-AT

adoxa

LCF-AT

CodeExplorer

Teddy Rogers

LCF-AT

LCF-AT

Create an account or sign in to comment

Create an account

Sign in

Community

Search Engines

Code Search

File Search

Search Engines