A Venture and Adventure Into Decompilation of Self-Modifying Code

Research Statement and Proposal
Gregory Morse
www.gmorsecode.com
gregory.morse@live.com
1 Research Statement
Since the advent of modern programming language compilers whereby a set of human readable
instructions are syntactically and semantically parsed and then translated and optimized to a binary
format readable by a machine or an interpreter, there has been a need for the reversal of the process
which is generally known as decompilation. Yet wide gaps of knowledge have remained in
decompilation given that it is can be modeled as an identical process to that performed by a compiler
except the input and output take on a different appearance. The Von eumann architecture which
modern computers are still based on, re!uires that the code and data remain in memory and operate on
that which is contained within that memory as well yielding the possibility for code to modify itself
which is merely a form of compression or obfuscation of the original code. "y analyzing self#
modifying code, and its implications for declarative programming languages as well as temporally, a
model for decompilation can be described which will generalize and completely match the problem
description handling the most complicated and generalized situations which are possible.
2 Past and Future Research
$hile at %ueensland &niversity of Technology, 'ristina 'ifuentes described in great detail various
processes for decompilation including structure of the graphs, and definitions of various structures and
elements that are re!uired during the process. (efinitions such as )basic blocks* and algorithms to
produce the various '#language code#flow structuring from a general graph are foundational elements
which can be built upon.
+ means of self#modifying code restructuring has been attempted in a paper by "ertrand +nckaert,
,atias ,adou, and -oen (e "osschere in + ,odel for Self#,odifying 'ode, yet the ideas here try to
separate out the areas which are self#modifying and specific types of code can break assumptions to the
point that a high#level code translation can only be rendered by putting a mathematical description of
the entire instruction set and the actual data being executed along side of it. +t certain times, there are
no assumptions which can be made yielding a problematic situation when there is potential for code
modification as for example if an external and unavailable library provides input to a routine, even the
most advanced mathematical analysis may not be able to simplify certain self#modifying code down
any simpler than such an instruction set description in code. 'onstraints would need to be provided
such as that which could be done by hand or through detailed analysis of external components to
provide constraints. 'onstraints can mathematical reduce or yield complete code restructuring
possibilities and are a crucial sub.ect in generalizing decompilation.
/ther research efforts and papers in the field of decompilation, incremental and full dynamic
algorithms for properties of directed graphs including loop nesting forests, dominator trees, and
topological ordering which is still a topic open for research.
The topic will come up time and again, as it has practical applications as simple as source code
recovery or as obscure as validation of code through self#checksums. 0t can be used as an optimization
tool or as a means of obfuscation sometimes by those protecting their software and at other times by
malicious software writers as a means of avoiding detection.
Dynamic Decompilation The idea behind this proposal is to create a decompilation algorithm
which is generalized enough that every other algorithm to date is merely a simplified subset of it. The
incremental and full dynamic algorithms although not re!uired, must be highlighted for efficiency of
eliminating static#pass analysis in decompilation and moving towards a one#pass no assumption
algorithm. Self#modifying code even in the absolute worst case scenarios where no determinations and
optimizations can be made will be handled and in cases where any sort of significant optimization is
possible, a temporal analysis algorithm would be applied to achieve optimal code structuring and data#
flow optimization that can be expressed in a high#level language. 0n the worst case, a mathematical
description of the processor instruction set or a partial description if any simplification is possible
would appear.
Complexity analysis for self-modifying code "y temporally analyzing self#modifying code
fragments or their interactions with each other, a complexity can be determined which can be a useful
indicator for automated scanning or as a theoretical research topic in itself. $here there are no
constraints present, unbounded complexity to the order of the complexity of the processor instruction
set itself would be taken into consideration. 1iven the extraordinary facilities on#board a modern day
processor chip with multiple stages, multiple cores, pipelines, caching, predictive branching, non#
uniform numbers of clock cycles and other considerations, determining the complexity of a modern
processor is a research field in it of its own right as simplification generally re!uires context.
2urthermore, parallelism is important while on this topic as whether on multiple, cores or threads or
utilizing a single atomic pathway of execution would change the implications of self#modifying code
where it could in certain cases yield strange race conditions where very complex behavior would result.
Research !ighlights
Mathematical descriptions of processor instruction sets The utilization of a pseudo#code
high#level description of the entire processor instruction set would allow for in the most na3ve sense, a
generated code which simply defines the code being decompiled as a data set input to this processor
emulator loop. The e!uivalence and compilability is maintained yet the efficiency would be called into
!uestion. 1iven that high#level languages often have no way to express self#modifying code, a special
compiler would be needed to translate such code back to its original binary form for sake of
optimization.
"emporal analysis of code #hich is stored as data + novel algorithm which tracks self#
modifying code by treating it in a similar way to loop cycles where it is modeled parametrically as a
temporal function such that simplification or transformation can be done through a system of
parametric e!uations and the ability to make use of partial derivatives with respect to various time
parameters given that there could be any number of independent time variables depending on the
complixity of the algorithm utilizing the self#modifying code.
$niformity of compilation and decompilation %y merging generality There has been little attempt
due to the difficulty of decompilation and the difference in expressivity of machine code verses high#
level source code at combining the process into a procedure which goes both ways. Yet the principles
of compilation are fundamentally tied to those of decompilation given that it is merely an optional
verification followed by translation and optimization process in either direction. This could yield better
compilers that have more generalized structuring and optimization algorithms as well as better test
coverage for the tool produced.
"he necessary reduction of o&erhead through incremental or full dynamic graph algorithms
(ecompilation cannot rely on syntax to accurately use multiple stages or )passes* to divide up
the work like can be done due to the stringent rules of high#level languages. 0nstead the entire
decompiled graph ready to be translated to any other form should be maintained incrementally as the
code is analyzed such that no part of the code is ever analyzed more than once, and no assumptions are
ever made. Static code#flow analysis makes a great number of assumptions even beyond merely self#
modifying code but also that of reach#ability of code which may not be reachable logically speaking.
0ncremental analysis should be coupled with incremental or even full dynamic algorithms which handle
the deletion of edges to a graph where appropriate so that topological orders, dominator trees and other
important connected structures can be maintained efficiently given that the code and data flow graphs
will grow and divide appropriately while many different properties must be maintained to allow for the
structuring and simplifications or analysis which must take place to proceed with certainty in the
decompilation process.
!euristical approaches to structuring code functions The idea of functions or reusable units of
code is one that must be defined by heuristics as it is an arbitrary distinction often based on the stack
but given the popular optimization of inline#functions, one that re!uires further heuristical analysis to
properly and efficiently do correctly. 0t is of course an absolute re!uirement of a decompiler given that
recursion would otherwise yield and infinitely large source code output yet one that if done too
aggressively might make the usefulness of the output more confusing and less readable. $hat heuristic
tools can be used to allow for various user#defined levels of source code optimization is worth
analyzing as function definition could be seen as likely the most arbitrary distinction in the entire
process.
' Moti&ations for Future Research
&ntil readily available decompilers which can produce compilable and accurate code is available, this
area will always be an active research topic. Theoretical assessments of the problem must be well
understood on a practically implementable level before development of decompilers will become
abundant on the market. The prevalence and rise in use of interpreted languages which allow certain
important reductions through various assumptions has caused interest in the more general Von
eumann problem to be reduced. Yet the problem shall remain a valid one given that self#modifying
code has implications in source code recovery, security, malicious software, compression, obfuscation
and other areas which software engineers will continue to maintain as being critical to their profession.
The topic remains an interest in +', Transactions on 4rogramming 5anguages and Systems
6T/45+S7, 0888 Transactions on 'omputers and various conferences and .ournals on computing
theory. Some future applications are9
Design of high-le&el languages #hich ma(e producti&e use of self-modifying code o
programming languages are designed around making use of self#modifying code for security, integrity,
compression and other uni!ue features that it could offer. This in part is because it depends on the
instruction set and high#level languages are by their very definition processor#independent. Yet
optimization is a feature which is highly processor#dependent and self#modifying code could be used to
categorize aspects of a processor that are not normally thought of.
"ranslation %et#een high-le&el languages 1iven the abundance of high#level languages on
any given platform nowadays, there is constant interest in supporting more languages or going between
them with relative ease and simplicity as well as tasks like changing the bit size whereby the code is
e!uivalent yet the processor uses a different size data and:or address bus.
"ranslation %et#een machine languages /ften times, there are situations especially with legacy
products where code developed for one processor must be run on another environment. 0f there is no
source code, strictly performing binary translation becomes an option and is more efficient than the
overhead of using an interpreter given that one interpretation would be enough to produce an
e!uivalent set of binary instructions. 1oing back to a source code is not necessary but the challenges
that
Finding ne# uses of self-modifying code 0f self#modifying code was more maintainable, well#
understood and practical, then much new interest in development in that area could resume which could
potentially unlock more efficient and clever methods of programming. The processor manufacturers
could also see new ways of architecting their instruction sets and chips to take advantage of self#
modifying code programming patterns that could potentially reduce clock times, allow for different
parallel programming patterns and increase efficiency of caching and predictive pathways. 4rocessor
manufacturers are typically facing ),oore;s law* in terms of increasing the clock#speed of chips based
on the reduction of the size of transistors yet processors designed around self#modifying code could
allow for groundbreaking reduction in lengths of pathways for various operations. The instruction set
itself could become self#modifying in the same spirit if more was understood in this area which could
potentially create a very secured and protected environment for computing or allow for a very
significant content management control system as an example.

A Venture and Adventure Into Decompilation of Self-Modifying Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Venture and Adventure Into Decompilation of Self-Modifying Code

Uploaded by

Copyright:

Available Formats

Research Statement and Proposal

You might also like