RYDB3RG Posted December 21, 2018 Posted December 21, 2018 Lets assume we have this code: test_proc proc VM_EAGLE_BLACK_START add rax, rcx add rax, rdx add rax, rsi add rax, rdi ret VM_EAGLE_BLACK_END test_proc endp So we have a single basicblock with multiple inputs: RAX, RCX, RDX, RSI, RDI and a single output: RAX. The protected version of that has about 10.000.000 instructions (Themida 2.4.6.0 demo). Lets run it through Unicorn and connect instructions via their sideeffects. While we are at it, lets assume we have an unlimited number of registers so we can remove memory indirections and connect instructions directly. Out of the initial 10mio instructions, how many contribute directly or indirectly to the output in RAX? About 300.000 (log_ins.txt). Thats a little better, but still too much. An instruction is considered constant when all its inputs are constant, because they are either literal constant values or they were produced by instructions that are constant themself. A memory read that wasnt written by ourself is also considered constant (it might be written by pre-OEP program, but this is irrelevant). Using this definition of constant, out of the 300k instructions that contribute to the result, how many instructions are constant and thus can be const-folded? A lot: themida.svg. This graph contains everything that contributes to the final value of RAX, but is not constant. The left number is a step number which corresponds to log_ins.txt, the right number is the result of the instruction. You can already kind of see the original program, but its still a bit obfuscated. You can also kind of see what Themida does most of the time: modifying values. Again and again. And every modification relies on thousands of instructions to produce the values that are used for the modification. And they manage somehow that everything depends on everything else, which is kind of impressive. It certainly makes it hard to identify opcode handlers or eliminate dead stores. Now the final question: can this graph be optimized? Yes. (I have created the IR program out of the initial 300k instructions without dropping the consts or any other optimization. LLVM's IR api does const folding itself, otherwise it would be the full 300k instructions) Using unicorn to extract the instructiondependencies obviously works only for basicblocks, it doesn't work for controlflows without additional work. But this is another topic. themida.svg, log_ins.txt and ir.ll are attached. tuts4you_themida.rar 4
deepzero Posted December 22, 2018 Posted December 22, 2018 Quote Now the final question: can this graph be optimized? Yes. How exactly did you do that? It's not clear to me. And Doesnt Themida use opaque predicates and (pseudo-)random input? How did you deal with that?
RYDB3RG Posted December 23, 2018 Author Posted December 23, 2018 ir.ll is a literal translation of log_ins.txt to LLVM For example: %node_449381 = add i64 %node_44931f, 1481190108 corresponds to log_ins.txt: 449381 000000014004bf0d add r12, 0x58492adc r12 = 11111110b8c7e635 -> r12 = 1111111111111111, rflags = 216 (1481190108 == 0x58492adc) ir.ll is a lot smaller than log_ins.txt because everything that is const is constfolded immediately (LLVM's IR api does that. Its not even an optimization pass). Another example: %node_44a169 = add i64 %node_44a121, 140697368289557 corresponds to: 44a169 0000000140073976 add rcx, rbx rcx = 111111118281a713, rbx = 7ff6a8a86115 -> rcx = 111191082b2a0828, rflags = 206 log_ins.txt tells us the value of RBX, which is 7ff6a8a86115, which corresponds to 140697368289557. But why was it constfolded? If you track back the creation of RBX, you'll find, that it doesnt depend on anything that is not const (check log_ins_44a169_rbx.txt for full dependency trace of RBX) Regarding predicates. You are right, Themida is full of JCC. I made an optimized version of all of them in jcc_ir.ll (the predicate funcs are selfcontained, so they will contain the same stuff again and again). You'll see that most of them are const. Some of them depend on input registers (search for "node_" in jcc_ir.ll). For example: define i1 @func_1696b_4564e5(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %rflags) local_unnamed_addr #2 { entry: %node_456430 = add i64 %rax, -3965984730278271229 %node_456431 = and i64 %node_456430, -140697950694205 %node_456457 = xor i64 %node_456431, 140698832079904 %node_456462 = add i64 %node_456457, -140697951312666 %node_45649c = and i64 %node_456462, 135 %node_4564a2 = sub i64 %node_45649c, %node_456462 %0 = trunc i64 %node_4564a2 to i8 %cmp11.i.i = icmp eq i8 %0, -126 ret i1 %cmp11.i.i } Since it depends on RAX, Themida cannot predict the result either, so both outcomes need to be valid. My tool doesn't support that yet, but my guess is that if i follow both paths and generate corresponding IR, LLVM will be able to optimize that as well and my guess is that either one path will become a giant NOP or will be a duplicate of the other path. Btw, Themida also adds predicates to the actual application code. In this case, node_45bbd2 is part of the "fruit", yet Themida uses its result for a JCC: define i1 @func_1883f_4b1adf(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %rflags) local_unnamed_addr #2 { entry: %node_45bbd2 = add i64 %rcx, %rax %0 = trunc i64 %node_45bbd2 to i8 %cmp11.i.i = icmp eq i8 %0, 122 ret i1 %cmp11.i.i } more_logs.rar 3
Etor Madiv Posted December 23, 2018 Posted December 23, 2018 Very nice results. So you generated the log based on Unicorn Engine and not used Pin ? And also how much time did it take to generate the whole log ? For building the ir, does your tool parses the log file and feed it to LLVM through the API, or you have used some sort of Symbolic Expressions that do this just in time ? Too much question
RYDB3RG Posted December 23, 2018 Author Posted December 23, 2018 I use a Pintool to run the executable to OEP, then dump all modules and heap memory to file. Then I load everything into Unicorn and jump to the protected function. In UC_HOOK_CODE handler, I then inspect the current instruction and create/connect corresponding nodes. When the protected function returns, I'll grab the node(s) that wrote RAX last (or any other register) and walk backwards to generate dotgraph or IR. Theres a bunch of different node implementations, each one knows how to generate itself as IR and tries to const-evaluate itself (given its inputs are const themself). I prefer Unicorn because its so much easier to debug than a Pintool. My tool needs about 35seconds to run the protected function from above and generate opimized IR. 1 1
null_endian Posted December 23, 2018 Posted December 23, 2018 This is awesome work! I really want to get to reversing Themida again but haven't had much time this last year. We should create a Themida reversing group & channel, that'd be fun. Never used Unicorn, gonna check it out.
deepzero Posted December 23, 2018 Posted December 23, 2018 Quote ir.ll is a literal translation of log_ins.txt to LLVM How do you do that? Last time I checked they were using all kinds of x86-oddities and eflag fiddling to make this difficult. LLVM proved somewhat resistant to "seeing through" and optimizing out my manual emulation of the behavior. You are right that non-const predicates are executing the same code in both cases. Often the whole thing, both paths, is just one big NOP. At least for Themida Versions <3.0. And how long does the initial pin-tracing of the 10 mil. Instructions take?
RYDB3RG Posted December 23, 2018 Author Posted December 23, 2018 I have a bunch of funcs that compute EFLAGS for me, which I compile to bitcode and link with the IR of the protected function. Example for 8bit ADD EFLAGS: https://godbolt.org/z/j82G0u. As you can see, LLVM has no problem optimizing that and produces a very good result. 2 hours ago, deepzero said: And how long does the initial pin-tracing of the 10 mil. Instructions take? The 10million instructions run in Unicorn, not Pin, and it takes 35seconds. Pin is only used to get an OEP dump. Is there a demo of Themida 3.0 ? What did they change? 1
Etor Madiv Posted December 24, 2018 Posted December 24, 2018 So LLVM does these optimizations out of the box. It is really impressive. I have more questions for you. I think it is off-topic since the virtualized instructions does not contain JCC instructions. When a virtualized code contains a real JCC instruction, for instance JE or JNE, how those instructions are virtualized ? are they virtualized the same way as junk opaque predicates, unless you can differentiate them from junk opaque predicates by the fact that both paths are not a big nop and also the two paths does not contain identical code ?
VirtualPuppet Posted December 24, 2018 Posted December 24, 2018 3 hours ago, Etor Madiv said: So LLVM does these optimizations out of the box. It is really impressive. I have more questions for you. I think it is off-topic since the virtualized instructions does not contain JCC instructions. When a virtualized code contains a real JCC instruction, for instance JE or JNE, how those instructions are virtualized ? are they virtualized the same way as junk opaque predicates, unless you can differentiate them from junk opaque predicates by the fact that both paths are not a big nop and also the two paths does not contain identical code ? In the older CISC machines, there's a specific handler that stores eflags, and a handler that jumps to the specified address if the eflag stored meets specific criterias. In the newer FISH machines, etc., there's two distinct handler types that compares for specific presets and then jump under specific conditions (one for internal JCCs whose target(s) are inside the VM, and one for external JCCs whose target(s) are outside the VM) 1
deepzero Posted December 24, 2018 Posted December 24, 2018 Quote I have a bunch of funcs that compute EFLAGS for me, which I compile to bitcode and link with the IR of the protected function. I was more thinking of patterns like pushfd ... tons of junk xor dword [esp + eflags_offset], ebx ... tons of junk and more bit fiddling popfd jcc xyz There are a bunch of other things like this that are difficult to translate away from x86 (which I called x86 oddities above). Another one would be excessive use of SBB and these kinds of rare instructions. I tried something similar to you (as far as I can tell, and a lot less sucessfully it seems), but translating this stuff to llvmir always threw me off. I then went for translating asm instructions to C and emulating the eflags fiddling (and other difficult things) in highlevel C code. Then I fed the whole thing to clang and hoped clang would be smart enough to collapse and optimize it fully. It didnt go so well. If you dont mind I would like to know more about the translating asm to llvm ir bit. Really interesting stuff you have going on there and impressive results! Also, Merry Christmas.
RYDB3RG Posted December 27, 2018 Author Posted December 27, 2018 On 12/24/2018 at 8:51 PM, Etor Madiv said: are they virtualized the same way as junk opaque predicates, unless you can differentiate them from junk opaque predicates by the fact that both paths are not a big nop and also the two paths does not contain identical code ? I currently have no way of telling if a non-const predicate is garbage or the real, but the idea is that LLVM cleans that up for me by dropping the duplicate/empty ones. I've attached an IR version that includes all non-const JCCs, but unfortunately, my tool follows only the true path (thats why ir.ll is full of unreachable's) ADD EAX, EAX ... pushfd ... tons of junk xor dword [esp + eflags_offset], ebx ... tons of junk and more bit fiddling popfd jcc xyz My tool works like this: when seeing the ADD instruction, it creates 2 nodes: one that represents the ADD instruction and one that represents the EFLAGS value of the ADD instruction. The first node overwrites the current EAX node, the second node overwrites the current EFLAGS node. PUSHFD depends on the node that wrote EFLAGS last, which is the ADD.FLAGS node I've created before and it updates my 'virtual mem' with that. When the tool sees XOR, it knows which node the mem addr is referring to, which is the ADD.FLAGS node and it simply wraps it in a XOR node. POPFD also knows which node its referring to and will get the XOR(ADD.FLAGS) node and overwrite the current EFLAGS node with that. JCC finally knows that it depends on EFLAGS, and what bits in particular and use that to produce a bool value. So my tool has 2 CPUs: one for Unicorn and one that I update. My CPU is not as sophisticated as Unicorn's, its basically just general purpose registers and EFLAGS, the difference is that my CPU holds nodes that point to other nodes instead of concrete values In a perfect tool, I could get rid of Unicorn and do all the updating based on my nodes only. This is what I did for VMPROTECT vs. LLVM but Themida's handlers aren't as clean as VMP so I'm not sure thats doable without basically reimplementing Unicorn. ir_with_jcc_one_path.rar 2
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now