Jump to content
Tuts 4 You

How to handle UNICODE chars?


LCF-AT

Recommended Posts

Hi guys,

don't remember anymore whether I did ask about it in past already (can not find post) so I need to ask now about it. A while ago I did already notice that I got trouble with filenames which using specific chars like symbols in the name. Mostly you can see that on youtube title names of any video using funny symbols etc, The problem is that I was just using normal ASCII functions in my MASM codes and I can not handle those symbol chars etc you know. Now I would like to fix that but how?

Example: How to get the entire correctly filename (with symbols) on WM_DROPFILES message / edit controls etc? How to read those strange filename and how to handle them?

Filename 💘 With Strange 💌 Symbols.file

Need to handle that name above and
export this name too later like this..

Filename 💘 With Strange 💌 Symbols_Output.file

All in all I just forgot it (think so) or don't remember anymore but need to know it now to fix those problems. :) Maybe you can help quickly. Thanks.

greetz

  • Like 1
Link to comment

Hello LCF , 

declare the string as dw and use unicode version of the winapi i guess they can handle filenames which contains emoji (I didn't try it) .

NL.

  • Like 2
Link to comment

You could either handle the unicode or strip out non ansi chars. Personally if it wasn't in a different language then I would strip them out / change them because emojis for example will only work on the site you are getting the files from and won't be shown in your filename, even if you handle unicode

Edited by NOP
  • Like 1
Link to comment

Hi guys,

maybe some words from me about my sources in MASM. Normaly I just use ASCII functions (A) instead of (W) functions since I did started coding. Yesterday I did remember anything what fearless said in the past about using a specific command in MASM source to make the app compiled in Unicode mode. I found this command...

__UNICODE__ EQU 1

...I have to set directly where I have stored my inc/lib paths and before .const / .data / .code. So I did set this command in my source and tried to compile and got some error because in my source I also using some marcos like "chr$" to set strings directly. So I did remove those macros and set the strings under .data. After those changes the exe was compiled and did run but not correctly and also my LISTVIEW is no more to see anymore and other things etc = all in all not working normally now with those changes. I made a backup of my normal ASCII source so that I can use this again. So the main question is how to handle those UNICODE or SYMBOL mixed filenames?

@NewLearner

So you mean I have to set the string like this...

.data
string_1			db '$url',0
string_2			db '$name',0
string_3			db '$path',0
string_4			db '$ext',0

to this?

.data
string_1			dw '$url',0
string_2			dw '$name',0
string_3			dw '$path',0
string_4			dw '$ext',0

....dw = dword. What should this bring?

@NOP

Ok, how to strip out those special chars? Also in this case the question is how to get / use the direct filepath? If I strip out the symbols from name then I can not access this file anymore because the filepath dosen't match.

Example: I do drag & drop any file which has some special chars into my app and this path / filename I need to use for another processes I want to run like ffmpeg and others. So I have already the problem reading the filename IF it has those special chars and I fail already on WM_DROPFILES / DragQueryFile function. :( Just wonder how to make it work to handle those filenames. Do you guys have any tiny example source / exe where you can drag a file into & output this file as new file etc? Each time if I have to work with those strange filename my apps do fail to read them and I need to rename those files whats really bad. Just need to find any working solution for this.

greetz

Link to comment

PS: By the way, I did check my new compiled file in Olly and see that only my own strings are changed from ASCII to UNICODE but all functions are still calling the ASCII (A) version instead of W. Why? Should this command "__UNICODE__ EQU 1" not also change the function too?

Link to comment

Can't you strip out the characters before saving, so saving as new filename and then any processing you need to do is with the saved file with new filename?

Or do you process directly via url? in which case you have no option but to use original as it is and need to use unicode

Link to comment

Somehow I have to read the file first (CreateFile etc) but how? At the end I also wanna use the real filename as save file xy. No idea how to handle that stuff. :(

Link to comment

Hi again,

in the MASM help file I found this info...

Unicode Support



The MASM32 include files now have system wide unicode support by the inclusion of the equate __UNICODE__ at the start of the source code BEFORE any include files. This factor is critical as the include files need to know which prototype system is being used .

The form of the equate is as follows,

__UNICODE__ equ 1

The Windows API prototypes in the include file occur in this form,

AddAtomA PROTO STDCALL :DWORD
IFNDEF __UNICODE__
  AddAtom equ <AddAtomA>
ENDIF

AddAtomW PROTO STDCALL :DWORD
IFDEF __UNICODE__
  AddAtom equ <AddAtomW>
ENDIF

If you define the equate __UNICODE__ the UNICODE  form of the API is provided in source code, if the equate is not defined the ASCII form of the API is provided.

NOTE that all existing ASCII code written before unicode support work as they have always worked, the inclusion of the __UNICODE__ equate offers the additional option of coding directly in unicode without having to use the direct API name with the trailing "W".

...so I did set that command at the top of my source __UNICODE__ equ    1 but it does not create W functions and still using A functions. Does anyone know how to make it work? The compiler options in my project in WInASM are these...

Assemble=/c /coff /Cp  /nologo
Link=/SUBSYSTEM:WINDOWS /RELEASE /VERSION:4.0

...maybe anyone could give some advise.

greetz

Link to comment

Hi @jackyjask,

I'am not using MASM64. Just using x86.

\masm32\Bin
\masm32\include
\masm32\Lib

Something isn't working right. I have a source I tried using that command __UNICODE__ EQU 1 and then the most ASCII strtings was changed to UNICODE strings but no API was changed...

004013CC  MOV DWORD PTR DS:[47D5D4],OFFSET ??00EA        ;  UNICODE "Name"  <-----
004013D6  MOV DWORD PTR DS:[47D5D0],7E
004013E0  PUSH OFFSET lvc                                ; /lParam = 47D5C8
004013E5  PUSH 1                                         ; |wParam = 1
004013E7  PUSH 101B                                      ; |Message = MSG(101B)
004013EC  PUSH DWORD PTR DS:[LISTVIEW]                   ; |hWnd = NULL
004013F2  CALL _OpenClipboard@4                          ; \SendMessageA  <---- Still A not W

0045F2EC=OFFSET ??00EA (UNICODE "Name")
DS:[0047D5D4]=00000000
bones.asm:611.  mov     lvc.pszText,chr$("Name")  <----

....and in my other test sources it does change nothing when using the command __UNICODE__ EQU 1. No idea what the problem is. Have I to use some specific include files?

Example:

__UNICODE__ EQU 1

.data
SOMETEST			db "StringHere",0		<--- no changes in of these strings!?
SOMETEST2			db 'StringHere2',0
CMD_1				db "cmd.exe /k %s",0
CMD_2				db 'cmd.exe /c %s',0


The strings was not changed from ASCII to UNICODE
-----------------------------------------------------------
00402B5A  PUSH 0045F0A9                                  ; |Format = "cmd.exe /k %s"
00402B5F  PUSH EDI                                       ; |s = bones.<ModuleEntryPoint>
00402B60  CALL <JMP.&user32.wsprintfA>                   ; \wsprintfA
00402B65  ADD ESP,0C
00402B68  JMP SHORT 00402B7D
00402B6A  PUSH 0045F0A7                                  ; /<%s> = """
00402B6F  PUSH 0045F0B7                                  ; |Format = "cmd.exe /c %s"
00402B74  PUSH EDI                                       ; |s = bones.<ModuleEntryPoint>
00402B75  CALL <JMP.&user32.wsprintfA>                   ; \wsprintfA

Now, otherwise if I use the chr$ marco with '' or with "" then I get this results...

invoke SendMessage,esi,EM_REPLACESEL,FALSE,chr$("$ext")	<-- does change to UNICODE
invoke SendMessage,esi,EM_REPLACESEL,FALSE,chr$('$ext') <-- does not or print error

error A2071: initializer magnitude too large for specified size

anyhow strange. So the UNICODE command does just change all ASCII strings in my source which using the chr$ marco but the functions are still A and in the .data sections the string are also still ASCII. So how to make it work now? Can anyone post a tiny example source which I could try to see whether it works etc?

greetz

Link to comment

Hi again,

here a example....

__UNICODE__ EQU 1

include 	\masm32\include\masm32rt.inc

.data
T1	db "123",0
T2  db 'test1',0

.data?

.code
start:
	invoke MessageBox,0,chr$("ABC"),addr T2,MB_ICONINFORMATION
	invoke ExitProcess,eax

end start

In Olly...

00B31000 <ModuleEntryPoint>          PUSH 40                                        ; /Style = MB_OK|MB_ICONASTERISK|MB_APPLMODAL
00B31002                             PUSH 00B33004                                  ; |Title = "test1"
00B31007                             PUSH 00B3300A                                  ; |Text = "A"
00B3100C                             PUSH 0                                         ; |hOwner = NULL
00B3100E                             CALL <JMP.&user32.MessageBoxA>                 ; \MessageBoxA
00B31013                             PUSH EAX                                       ; /ExitCode = 53FDF0
00B31014                             CALL <JMP.&kernel32.ExitProcess>               ; \ExitProcess
00B31019                             INT3
00B3101A                             JMP DWORD PTR DS:[<&user32.MessageBoxA>]       ;  user32.MessageBoxA
00B31020                             JMP DWORD PTR DS:[<&kernel32.ExitProcess>]     ;  KERNEL32.ExitProcess

Only the chr$("ABC") was changed to unicode. Whats the problem?

greetz

Link to comment

OK, masm32 case

Installed from official site

tried to build unicode example from this dir: c:\masm32\examples\unicode_generic\template\

it has got a

    __UNICODE__ equ 1           ; uncomment to enable UNICODE build

at the very top of  template.asm

unicode build:

image.png.25d1077e3c7a94ef4732e9881b668d78.png

 

 

Now, if you comment out that line like

 

;    __UNICODE__ equ 1           ; uncomment to enable UNICODE build

 

you got ...A (x86) build, eg:

image.png.28a06f41f993e7c68d2ca40fb9002746.png

 

 

so not sure what's going on your side,sorry... until you share your project

or maybe try to get vanilla  masm32 and try again?

 

 

  • Like 1
Link to comment

Hi @jackyjask,

thanks for the info. So I tried to compile that template source and I get also just A function out. :( Now I tried to install MASM (masm32v11r.zip) in Sandbox of Windows + WinASM and did compile any dialog with  __UNICODE__ equ 1 at the top and messagebox API and there it works and it does create W function. Hhmm! But strings like this..

.data
T1	db "TEST",0
T2  db "CAP",0

...are not changed to unicode and still in ASCII. My question here is, how to deal with strings in sections (.data) and directly in sources? Which string get changed to unicode and which not? Are there any rules etc?

I also get another problem. After installing in Windows SB the compiler command /DYNAMICBASE:NO is no more working! Why?

/SUBSYSTEM:WINDOWS /RELEASE /VERSION:4.0 /DYNAMICBASE:NO "/LIBPATH:\Masm32\Lib" "C:\WinAsm\Templates\Dialog\bones\bones.obj" "C:\WinAsm\Templates\Dialog\bones\bones.res" "/OUT:C:\WinAsm\Templates\Dialog\bones\bones.exe" 
LINK : warning LNK4044: unrecognized option "DYNAMICBASE:NO"; ignored

PS: Will try to install MASM on my main OS fresh.

greetz

Link to comment

EDIT: So I installed that MASM SDK on my main OS too now. I get also that error about DYNAMICBASE on compiling now. Otherwise the functions do change now to W but the strings keep same in section....why? Look..

__UNICODE__ EQU 1

.486                                      ; create 32 bit code
.model flat, stdcall                      ; 32 bit memory model
option casemap :none                      ; case sensitive 

include \masm32\include\windows.inc       ; main windows include file
include \masm32\macros\macros.asm         ; masm32 macro file

include \masm32\include\masm32.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc

includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib


; This strings keep ASCII
.data
T1	db "123",0
T2  db 'test1',0

.data?
buf	dd ?
.code
start:
	lea eax, chr$("ANI")	<-- does change to unicode
	lea eax, chr$("ANI2")	<-- does change to unicode
;	lea eax, chr$('ANI3')	<-- not working to compile if __UNICODE__  is set
	invoke MessageBox,0,offset T1,offset T2,MB_ICONINFORMATION
	invoke MessageBox,0,chr$("Test"),chr$("Caption"),MB_ICONINFORMATION
	invoke ExitProcess,eax
	invoke wsprintf,addr buf,addr T1,eax
	invoke wsprintf,addr buf, chr$("123 %s"),eax

end start

...the API do change to W = OK and the strings set with macro chr$("XY") also change to unicode but not the strings in section what means that W function do use ASCII strings. How to change the ASCII string in .data section all at once to unicode? Is there any command I can set without to change every single string? Below Olly...

00401000 <ModuleEntryPoint>        LEA EAX,DWORD PTR DS:[??0019]
00401006                           MOV EDI,EDI                                   ; SINA.<ModuleEntryPoint>
00401008                           LEA EAX,DWORD PTR DS:[??002C]
0040100E                           PUSH 40
00401010                           PUSH OFFSET T2                                ; ASCII "test1"
00401015                           PUSH OFFSET T1                                ; ASCII "123"
0040101A                           PUSH 0
0040101C                           CALL <JMP.&user32.MessageBoxW>
00401021                           LEA ECX,DWORD PTR DS:[ECX]
00401024                           PUSH 40
00401026                           PUSH 0040302C                                 ; UNICODE "Caption"
0040102B                           PUSH 00403020                                 ; UNICODE "Test"
00401030                           PUSH 0
00401032                           CALL <JMP.&user32.MessageBoxW>
00401037                           PUSH EAX
00401038                           CALL <JMP.&kernel32.ExitProcess>
0040103D                           PUSH EAX
0040103E                           PUSH OFFSET T1                                ; ASCII "123"
00401043                           PUSH 00403050
00401048                           CALL <JMP.&user32.wsprintfW>
0040104D                           ADD ESP,0C
00401050 _start                    PUSH EAX
00401051                           PUSH 0040303C                                 ; UNICODE "123 %s"
00401056                           PUSH 00403050
0040105B                           CALL <JMP.&user32.wsprintfW>
00401060                           ADD ESP,0C
00401063                           INT3
00401064                           JMP DWORD PTR DS:[<&user32.MessageBoxW>]      ; user32.MessageBoxW
0040106A                           JMP DWORD PTR DS:[<&user32.wsprintfW>]        ; user32.wsprintfW
00401070                           JMP DWORD PTR DS:[<&kernel32.ExitProcess>]    ; KERNEL32.ExitProcess

greetz

Link to comment

Good progress!

regarding

>; This strings keep ASCII .data T1 db "123",0 T2 db 'test1',0

yeah, thats true...  there is no any auto-magic that will convert ASCII  strings defined as db (define byte) into unicode ones...

you have to manually replace all your db statements into some macros for unicode strings...

or you could do even more hardcore:  add  0 between chars, eg:

Unicode db 55h,0,6Eh,0,69h,0,63h,0,6Fh,0,64h,0,65h,0,0,0

but this is very brutal way, I don't like it

 

 

Also insdie masm32 help files *.chm  you coud find a lot of info about asm macroses...

eg:

sasAssign a string to a LOCAL variable.  __UNICODE__ aware.

cstCopy one zero terminated string to another.  __UNICODE__ aware.

also this:

image.png.1a241191debc84190d3ef6c6fc1fdefd.png

 

File location: c:\masm32\help\hlhelp.chm 

  • Like 1
Link to comment

Upd regarding /DYNAMICBASE  linker option

just run link.exe /?

and check it out if you see that option, if not - you are using very old MS linker

you have to "borrow" some more modern files from VS

eg: c:\VS2019\VC\Tools\MSVC\14.29.30133\bin\Hostx86\x86\link.exe 

 

Edited by jackyjask
  • Like 1
Link to comment

Hi @jackyjask,

thanks for trying to help me with that problem. :) So to change all ASCII strings in my sections + all other added ASM / Included files sounds like horror! Somehow a bad idea to manage that manually. Just will try this. By the way, so makes it more sense not ONLY to code / compile in UNICODE mode instead of ASCII?

How to handle now those UNICODE string? Do you have a list of marcos & functions I can use instead of ASCII macros? I see there are also some issues using some UNICODE macro names like WSTR = problem sometimes instead using UCSTR etc. Do you have some helpfully infos which could help me to handle all in UNICODE style correctly? As I said, I was just only using normal ASCII style. :)

I did check out the linker version / support and see that the MASM linker version 5.12 does not support that command. Also found another linker files on my HDD from here..

C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\Hostx86\x86\

...what has a linker version 14.29 with many more commands I can use. Just did copy that linker.exe + some dll files (mspdb140.dll, tbbmalloc.dll) into my MASM bin folder and now it works to compile it.

greetz

Link to comment

Hello!

well, ideally I guess you need to wrap all your old good ASCII strings into some kind of macros that will be AI enough (hehe) to go either into ASCII or UNICODE depending on the  __UNICODE__   defined or not..

so yeah, some manual work will be in need

I'm not very much up to asm, but mostly to C/C++, it also has got similar approach, eg:  https://devblogs.microsoft.com/oldnewthing/20040212-00/?p=40643

If you from now on decide to always use unicode - you could start using WSTR  macros

if you want to be flexible, you could use your own

anyway, I encourage you to read the \macros\macros.asm  file - it has tons of useful macroses...

also don't hesitate to read the official (and very old school) forum, eg: https://masm32.com/board/index.php?topic=2054.msg29631#msg29631

 

2) regarding linker - someone like Pelles C binaries (asm/linker/res util) - you could grab that from off. site  and use in your .bat file :)

eg I like these:

image.png.35b78bbcfee9a15ee6decce2c9fe7c9d.png

but i fyou like MS products - up to you

 

  • Like 1
Link to comment
  • 11 months later...

Hi again,

today I was looking again on my test project trying to make my ASCII app into an UNICOCDE app. Still not working and sure whether it would be possible to change everything in my source anyhow etc. Now I have a problem trying to use an API function from shlwapi module called StrFormatByteSize64. So normally it works to compile the file in normal ASCII mode but when I set the "__UNICODE__ EQU 1" then I get an error info...

error A2006: undefined symbol : StrFormatByteSize64

....but why? The inc & lib are declared and inside I can read this...

StrFormatByteSize64A PROTO STDCALL :DWORD,:DWORD,:DWORD,:DWORD
IFNDEF __UNICODE__
  StrFormatByteSize64 equ <StrFormatByteSize64A>
ENDIF

...so why does it fail when I use the __UNICODE variable?

greetz

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...