A Venture and Adventure into Decompilation of Self-Modifying Code

Research Statement and ProposalGregory [email protected]

1 Research StatementSince the advent of modern programming language compilers whereby a set of human readable instructions are syntactically and semantically parsed and then translated and optimized to a binary format readable by a machine or an interpreter, there has been a need for the reversal of the process which is generally known as decompilation. Yet wide gaps of knowledge have remained in decompilation given that it is can be modeled as an identical process to that performed by a compiler except the input and output take on a different appearance. The Von Neumann architecture which modern computers are still based on, requires that the code and data remain in memory and operate on that which is contained within that memory as well yielding the possibility for code to modify itself which is merely a form of compression or obfuscation of the original code. By analyzing self-modifying code, and its implications for declarative programming languages as well as temporally, a model for decompilation can be described which will generalize and completely match the problem description handling the most complicated and generalized situations which are possible.

2 Past and Future ResearchWhile at Queensland University of Technology, Cristina Cifuentes described in great detail various processes for decompilation including structure of the graphs, and definitions of various structures and elements that are required during the process. Definitions such as “basic blocks” and algorithms to produce the various C-language code-flow structuring from a general graph are foundational elements which can be built upon.A means of self-modifying code restructuring has been attempted in a paper by Bertrand Anckaert, Matias Madou, and Koen De Bosschere in A Model for Self-Modifying Code, yet the ideas here try to separate out the areas which are self-modifying and specific types of code can break assumptions to the point that a high-level code translation can only be rendered by putting a mathematical description of the entire instruction set and the actual data being executed along side of it. At certain times, there are no assumptions which can be made yielding a problematic situation when there is potential for code modification as for example if an external and unavailable library provides input to a routine, even the most advanced mathematical analysis may not be able to simplify certain self-modifying code down any simpler than such an instruction set description in code. Constraints would need to be provided such as that which could be done by hand or through detailed analysis of external components to provide constraints. Constraints can mathematical reduce or yield complete code restructuring possibilities and are a crucial subject in generalizing decompilation.Other research efforts and papers in the field of decompilation, incremental and full dynamic algorithms for properties of directed graphs including loop nesting forests, dominator trees, and topological ordering which is still a topic open for research.The topic will come up time and again, as it has practical applications as simple as source code recovery or as obscure as validation of code through self-checksums. It can be used as an optimization

mailto:[email protected]

http://www.gmorsecode.com/

tool or as a means of obfuscation sometimes by those protecting their software and at other times by malicious software writers as a means of avoiding detection.Dynamic Decompilation The idea behind this proposal is to create a decompilation algorithm which is generalized enough that every other algorithm to date is merely a simplified subset of it. The incremental and full dynamic algorithms although not required, must be highlighted for efficiency of eliminating static-pass analysis in decompilation and moving towards a one-pass no assumption algorithm. Self-modifying code even in the absolute worst case scenarios where no determinations and optimizations can be made will be handled and in cases where any sort of significant optimization is possible, a temporal analysis algorithm would be applied to achieve optimal code structuring and data-flow optimization that can be expressed in a high-level language. In the worst case, a mathematical description of the processor instruction set or a partial description if any simplification is possible would appear.Complexity analysis for self-modifying code By temporally analyzing self-modifying code fragments or their interactions with each other, a complexity can be determined which can be a useful indicator for automated scanning or as a theoretical research topic in itself. Where there are no constraints present, unbounded complexity to the order of the complexity of the processor instruction set itself would be taken into consideration. Given the extraordinary facilities on-board a modern day processor chip with multiple stages, multiple cores, pipelines, caching, predictive branching, non-uniform numbers of clock cycles and other considerations, determining the complexity of a modern processor is a research field in it of its own right as simplification generally requires context. Furthermore, parallelism is important while on this topic as whether on multiple, cores or threads or utilizing a single atomic pathway of execution would change the implications of self-modifying code where it could in certain cases yield strange race conditions where very complex behavior would result.

3 Research HighlightsMathematical descriptions of processor instruction sets The utilization of a pseudo-code high-level description of the entire processor instruction set would allow for in the most naïve sense, a generated code which simply defines the code being decompiled as a data set input to this processor emulator loop. The equivalence and compilability is maintained yet the efficiency would be called into question. Given that high-level languages often have no way to express self-modifying code, a special compiler would be needed to translate such code back to its original binary form for sake of optimization.

Temporal analysis of code which is stored as data A novel algorithm which tracks self-modifying code by treating it in a similar way to loop cycles where it is modeled parametrically as a temporal function such that simplification or transformation can be done through a system of parametric equations and the ability to make use of partial derivatives with respect to various time parameters given that there could be any number of independent time variables depending on the complixity of the algorithm utilizing the self-modifying code.

Uniformity of compilation and decompilation by merging generality There has been little attempt due to the difficulty of decompilation and the difference in expressivity of machine code verses high-level source code at combining the process into a procedure which goes both ways. Yet the principles of compilation are fundamentally tied to those of decompilation given that it is merely an optional verification followed by translation and optimization process in either direction. This could yield better

compilers that have more generalized structuring and optimization algorithms as well as better test coverage for the tool produced.

The necessary reduction of overhead through incremental or full dynamic graph algorithmsDecompilation cannot rely on syntax to accurately use multiple stages or “passes” to divide up

the work like can be done due to the stringent rules of high-level languages. Instead the entire decompiled graph ready to be translated to any other form should be maintained incrementally as the code is analyzed such that no part of the code is ever analyzed more than once, and no assumptions are ever made. Static code-flow analysis makes a great number of assumptions even beyond merely self-modifying code but also that of reach-ability of code which may not be reachable logically speaking. Incremental analysis should be coupled with incremental or even full dynamic algorithms which handle the deletion of edges to a graph where appropriate so that topological orders, dominator trees and other important connected structures can be maintained efficiently given that the code and data flow graphs will grow and divide appropriately while many different properties must be maintained to allow for the structuring and simplifications or analysis which must take place to proceed with certainty in the decompilation process.

Heuristical approaches to structuring code functions The idea of functions or reusable units of code is one that must be defined by heuristics as it is an arbitrary distinction often based on the stack but given the popular optimization of inline-functions, one that requires further heuristical analysis to properly and efficiently do correctly. It is of course an absolute requirement of a decompiler given that recursion would otherwise yield and infinitely large source code output yet one that if done too aggressively might make the usefulness of the output more confusing and less readable. What heuristic tools can be used to allow for various user-defined levels of source code optimization is worth analyzing as function definition could be seen as likely the most arbitrary distinction in the entire process.

4 Motivations for Future ResearchUntil readily available decompilers which can produce compilable and accurate code is available, this area will always be an active research topic. Theoretical assessments of the problem must be well understood on a practically implementable level before development of decompilers will become abundant on the market. The prevalence and rise in use of interpreted languages which allow certain important reductions through various assumptions has caused interest in the more general Von Neumann problem to be reduced. Yet the problem shall remain a valid one given that self-modifying code has implications in source code recovery, security, malicious software, compression, obfuscation and other areas which software engineers will continue to maintain as being critical to their profession. The topic remains an interest in ACM Transactions on Programming Languages and Systems (TOPLAS), IEEE Transactions on Computers and various conferences and journals on computing theory. Some future applications are:

Design of high-level languages which make productive use of self-modifying code No programming languages are designed around making use of self-modifying code for security, integrity, compression and other unique features that it could offer. This in part is because it depends on the instruction set and high-level languages are by their very definition processor-independent. Yet optimization is a feature which is highly processor-dependent and self-modifying code could be used to

categorize aspects of a processor that are not normally thought of.

Translation between high-level languages Given the abundance of high-level languages on any given platform nowadays, there is constant interest in supporting more languages or going between them with relative ease and simplicity as well as tasks like changing the bit size whereby the code is equivalent yet the processor uses a different size data and/or address bus.

Translation between machine languages Often times, there are situations especially with legacy products where code developed for one processor must be run on another environment. If there is no source code, strictly performing binary translation becomes an option and is more efficient than the overhead of using an interpreter given that one interpretation would be enough to produce an equivalent set of binary instructions. Going back to a source code is not necessary but the challenges that

Finding new uses of self-modifying code If self-modifying code was more maintainable, well-understood and practical, then much new interest in development in that area could resume which could potentially unlock more efficient and clever methods of programming. The processor manufacturers could also see new ways of architecting their instruction sets and chips to take advantage of self-modifying code programming patterns that could potentially reduce clock times, allow for different parallel programming patterns and increase efficiency of caching and predictive pathways. Processor manufacturers are typically facing “Moore's law” in terms of increasing the clock-speed of chips based on the reduction of the size of transistors yet processors designed around self-modifying code could allow for groundbreaking reduction in lengths of pathways for various operations. The instruction set itself could become self-modifying in the same spirit if more was understood in this area which could potentially create a very secured and protected environment for computing or allow for a very significant content management control system as an example.

Documents

A Venture and Adventure into Decompilation of Self-Modifying Code