You are on page 1of 19

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/312297650

Software Analytics and Evolution - Team


Report 2016

Technical Report January 2017

CITATIONS READS

0 187

5 authors, including:

Bernhard Dorninger Michael Pfeiffer


Software Competence Center Hagenberg Software Competence Center Hagenberg
10 PUBLICATIONS 12 CITATIONS 8 PUBLICATIONS 47 CITATIONS

SEE PROFILE SEE PROFILE

Michael Moser Josef Pichler


Software Competence Center Hagenberg Software Competence Center Hagenberg
13 PUBLICATIONS 10 CITATIONS 36 PUBLICATIONS 122 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Domain Knowledge Extraction and Code Generation (NEXT) View project

Analytical and Generative Methods in Software Engineering (AnGenM) View project

All content following this page was uploaded by Josef Pichler on 13 January 2017.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Software Analytics and Evolution

Team Report 2016


Gerald Czech, Bernhard Dorninger,
Michael Pfeiffer, Michael Moser and Josef Pichler
January 13, 2017

1 Introduction and Overview


The SCCH research focus for Software Analytics and Evolution (SAE) employs
modern analytical and design approaches in Software Engineering with the goal
of facilitating the development, maintenance, and evolution of complex technical
software systems while simultaneously ensuring utmost quality.
The analytical part involves code analysis, architectural analysis, reverse
engineering, and software testing; the design part includes software architec-
ture, domain specific languages, and code generation. This report introduces
into activities, projects, software, and research topics with respect to the soft-
ware analysis. Software analysis at SCCH is applied for Static Code Analysis,
Architecture Conformance Checking, Documentation Generation, and Program
Comprehension.

Static Code Analysis. Static code analysis is a widely used quality as-
surance measure for detecting a variety of defects, for example, violations
of coding conventions and programming rules, array boundary overruns,
division by zero, infinite loops, data races, and dead code.

Architecture Conformance Checking. Deviations of the actual im-


plementations from the initial defined architecture throughout a software
systems development and evolution lead to architectural erosion. Continu-
ous automated conformance checking can be an effective counter measure.
Thereby, static code analysis is used to extract the actual architecture for
evaluating it to the reference architecture.

Documentation Generation. Accurate program documentation is in-


dispensable for software maintenance and evolution. However, manu-
ally written documentation often becomes outdated when the software
is changed. This challenge can be tackled by generating program docu-
mentation from source code following the idea of literate programming.
For instance, the Rulebook Generator generates technical documentation
from source code using a minimal set of additional annotations to control
the generation process.

1
Program Comprehension Static code analysis can be used to facilitate
program comprehension in several ways. For instance, we have developed
a toolkit to propagate domain concepts as well as physical dimensions
to cryptic local and global program identifiers along the data flow of a
program. This approach facilitates comprehension of legacy systems, for
which neither documentation nor the original author are usually available
anymore.

In the next section, we briefly describe ongoing research projects with respect
to software analysis. In these projects we usually build and enhance software
tools facilitating certain Software Engineering activities. We give an overview
of our tool landscape in Section 3. In Section 4, we highlight emerging research
topics that eventually become part of SCCHs research agenda in future.

2 Projects and Research Tracks


The focus of SAE is on cooperative research together with industrial partners
as well as with academic partners, in particular with research institutes at Jo-
hannes Kepler University Linz. Application-oriented research within the frame
of COMET is organized in multi-firm research projects with several industrial
partners. The majority of our teams research is conducted in the multi-firm
COMET project NEXT (Domain Knowledge Extraction and Generation of Soft-
ware) but SAE competences are also used in other projects.
The aim of the project NEXT is to bridge the gap between existing (legacy)
software predominant in industry and the model-based construction of software
systems. In this project we develop and apply an industrial-strength DSL-
based approach to recover specifications from (legacy) systems and to generate
software systems based on extracted or manually created specifications.

2.1 A DSL for Rapid Development and Adaption of Soft-


ware Components
The COMET partner Siemens AG Austria - Transformers Linz is a world lead-
ing producer of power transformers. To define electrical, thermal, and geometric
properties of a power transformer, Siemens AG Austria develops and maintains
custom design software. Due to advances in the domain of power transform-
ers and arising customer requirements these software systems are subjected to
constant change and evolution. To rapidly adapt design software to new re-
quirements we successfully used generative and transformational approaches,
e.g. XSD to code transformation, which directly integrate concepts of the un-
derlying domain.
During the last year we made major progress towards a domain-specific lan-
guage (DSL) [1] for the design and definition of power transformers. The DSL is
developed in close cooperation with our industrial partner and enables domain
experts to hierarchically structure the components of a power transformer, define
properties of components and define data binding for these properties. Models,
i.e. programs written in our DSL, are consumed by a domain specific generator
framework that outputs various software artifacts needed to fully automatically
integrate definitions of new power transformers, or components therefrom, with

2
the preexisting design software. By using the created DSL domain experts can
describe new types of power transformers in a very compact and efficient way.
This not only facilitates the rapid development and adaption of software com-
ponents but as well understandability and eventually maintainability of design
software for power transformers.
The achievements of the current year stimulate further insights in an iterative
user-centered development process for domain-specific languages. Furthermore,
we plan to investigate the role of domain-specific modeling and modeling in
general in the field of maintenance and evolution of software systems.

2.2 Documentation Generation from Annotated Source


Code
The COMET partner Siemens AG Austria - Transformers Weiz develops and
maintains a large corpus of software applied in the domain of electrical engi-
neering. Due to our COMET partners high requirements concerning precise
technical documentation, effort to create documentation and effort to keep ex-
isting documentation in sync with code is substantial. In response to this,
Siemens AG Austria - Transformers Weiz launched a research project together
with SCCH aiming to reduce this effort.
Major result of this project was the Rulebook Generator (RbG) [2]. RbG was
originally designed and implemented to automatically extract technical docu-
mentation from Fortran 90 programs in the electrical engineering domain. In
this scenario, a Fortran programmer annotates program code with special tags
and uses RbG to generate the corresponding documentation from annotated
source code.
During the last year we made major progress concerning RbGs out-of-the
box experience and general usability of RbG. Out-of-the box experience could
be improved by introducing a tolerant parsing technique, which we refer to as
try-and-ignore parsing (TAI ) [3]. TAI gracefully deals with incompatibilities
between the Open Fortran Parser, reused by RbG, and parsing technology used
by our COMET partner. We improved the usability of RbG by unifying differ-
ent approaches to configure documentation generation with RbG. Moreover, we
enhanced integration of RbG with the development environments used at our
company partner.
To further broaden RbGs applicability we started to create parsing tech-
nology targeting Common Intermediate Language code (CIL) of the .NET plat-
form. We hope to integrate this promising approach in the upcoming year. Next
to work packages focusing on RbG and documentation generation we partly
shifted our focus on analyzing program execution of technical software. The
analysis iteratively triggers program executions with predefined input values,
collects thereby created program outputs, aggregates, analyzes, and compares
these output values, and provides basic functionality to define presentation of
analysis results. Current effort resulted in a first tool for program run analysis
and delta optimization, called PRADO, which will see major advances in the
upcoming year.

3
2.3 Program Analysis for Understanding Legacy Software
The engineering and production of steel at our COMET partner voestalpine
Stahl is typically controlled by process control software. Process control soft-
ware embeds so called process models which are continuously adapted, improved
and enhanced. Process models are implemented in Fortran, C, and C++ and
maintained over decades since the seventies and eighties of the last century and
are typically developed by a single engineer only. If this person retires, we liter-
ally encounter a legacy code problem, where a different engineer has to take over
a model implementation he/she has (in worst case) never worked with before.
This is what happened to our company partner. Frequently, the source code is
the only reliable documentation of the process model due to outdated manual
documentation. If only source code is available, it is hard to comprehend the
implemented specification, e.g. the mathematical core model.
To overcome this problem and to help our partner to regain knowledge of
some of its core assets, we created an approach and tooling to automatically
re-document process models using static and dynamic code analysis techniques
[4]. The key idea in this project was to extract how output values of a process
model, are computed from a set of input values. Therefore, we statically ana-
lyze dependencies and data-flow in process models and provide filtered views,
represented as tables, on the source code showing all program statements which
influence the computation of a single output value.
Due to the complexity of process models, these tables get very long and pro-
vide only little benefit for domain experts. To further improve understandability
and reduce the complexity of the resulting documentation, we dynamically an-
alyze the execution of process models. To do so, we execute program analysis
with concrete input values provided by domain experts. This leads to more
compact result tables, since only executed program paths are included, which
facilitate overall comprehension of process models.

2.4 Multi-Language Re-Documenation of Legacy Software


The COMET partner OOGKK conducts a software migration project in which
the source code of the COBOL legacy software system LGKK is automatically
transformed into a corresponding Java software system. The new Java system
must substitute the legacy COBOL batch system transparently with respect
to surrounding systems, providing same functionality without altering existing
interfaces. The goal for our research project is to develop an approach and tool
support for automatic re-documentation. The goal for re-documentation was to
generate documentation for both the original COBOL legacy system as well as
the new Java system. Documentation in form of flowchart diagrams showing the
input/output behavior of batch programs was requested and desired to support
programmers and business analysts to maintain and enhance the Java software.
Our approach for documentation generation supports scenarios. In one sce-
nario, generated documentation from both COBOL and Java source code is
used to support the actual transformation. In the other scenario, continuous
generated documentation from Java source code shall facilitate maintenance of
the migrated Java software. A feasibility prototype was developed so far which
generates desired documentation. Generated documentation was positively eval-
uated by potential stakeholders of the industrial partner [5].

4
2.5 Knowledge Extraction from Industrial User Interfaces
Human machine interfaces (HMI) play an essential role in operating industrial
facilities and machines. Depending on the range and variability of a manufac-
turers product portfolio, a huge library of GUI software may exist. This poses
quite a challenge when it comes to testing or re-engineering. Static analysis
helps to unveil valuable, inherent knowledge and prepare it for further analysis
and processing.
Our COMET partner ENGEL Austria is a large injection molding machine
manufacturer, who offers a broad portfolio of molding machines tailorable to
specific customer needs. For this purpose, each machine can be equipped and
configured with numerous different options. Since ENGEL not only builds the
machines, but also develops both machine control and HMI software, the high
product variability of course is reflected in that software as well. Currently
ENGEL is completely re-engineering his HMI framework on the base of a new
UI technology. Certainly, it is desired to reduce the effort to migrate the existing
screens to their new platform and thus ENGEL wants to automate the migration
process as far as possible.
This goal shall be reached by employing static analysis techniques: We ex-
tract the internal structure of the HMI screens and the control system context
they are used in, i.e. which PLC variables they access. This internal structure
also comprises of (hardcoded) variants depending on the presence of PLC vari-
ables or their values, which are evaluated at HMI startup time. Subsequently,
HMI screens are assembled depending on the outcome of this evaluation. Hence,
we filter out these variants and the conditions they lead to. In another step, we
analyze the usage pattern of method calls to certain UI widgets.
The resulting analysis model is then transformed (per screen) into an object
model designed for the new HMI framework and supportive code fragments.

2.6 Static Code Analysis for PLC Programs


The goal of a research project together with the COMET partner TRUMPF
Maschinen Austria is on automatic static analysis of PLC code, for a Ber-
necker & Rainer Automation Studio project, consisting of Structured Text and
C source code. We have provided, in context of code analytics, parsers for
the Automation Studio project, and for structured text, and use our existing
C/C++ frontend to get an abstract syntax tree for the source code. A rule
engine that can be extended by implementing so called rule in Java code, is
used to generate code metrics, and to check coding guidelines. The rule engine
traverses the complete AST for the entire project, and notifies rules, at their
points of interests.
The code guideline rules check naming convention of variable names, and
data type names. There is a rule, that checks that all fields from aggregate data
structure are used. We have implemented a subset of PLC open rules (18 of 63).
Calculated metrics include classical software metrics: Cyclomatic Complexity
by McCabe , Halstead Metric, Lines of Code, and the Microsoft Maintainability
Index.
Different formatters can be used for the output. Our tool can be run as
part of the build process within Automation Studio. The message of a rule
violation, is formatted in such a way, that the developer can click on it, and the

5
corresponding file is opened at the line, that contains the rule violation. We
provided a plugin into SonarQube, that exports the metrics, and rule violations
into a SonarQube instance. Additionally this plugin contains copy-and-paste
detection.
For automatic architecture extraction, we have written an exporter into
Neo4J graph database. It exports the project structure (a project consisting
of package, programs, and libraries, and source files), the call graph between
POU (program organization units are action, function, function block, and pro-
gram), POU read/write relation to global variables, the member-relation of a
field to its containing data structure, the IEC-61131-3 software configuration
that contains the tasks that are run periodically.

3 Software
Figure 1 shows libraries, frameworks, and tools developed in various projects.
In this section, we brievly describe tools developed and used within various
projects (see previous section), mostly together with company partners.

DocIO VARAN RbG

Metamorposis RbG

SeSy @engine TabFlow DefUse

StaticCA CodeClone codeanalytics reUI Metrics

frontends

Figure 1: Libraries, frameworks, and toolkits.

The Rulebook Generator (RbG) [2] is a novel tool intended for the generation
of high-quality documentation from source code of scientific and engineering
applications. RbG extracts mathematical formulae, decision tables and function
plots from program statements by means of static code analysis and generates
corresponding documentation in different formats including the Open Document
Format and LaTeX. Annotations in source code comments are used to define the
structure of the generated documents, include additional textual and graphical
descriptions, and control extraction of formulae on a fine grained level.

6
The VARAN tool [4] is a tool that combines static and dynamic program
analysis for documentation generation. Static code analysis, which is based
on the RbG tool, extracts the input/output behavior in form of formulae and
decision tables from source code. Dynamic program analysis allows developers
to examine input/output behavior for single program executions and thereby
gain insight into standard behavior and exceptional cases.
Metamorphosis [6] is a multi-language toolkit for static code analysis in tech-
nical domains. It provides language independent data structures (e.g. abstract
syntax tree, call graph) and analysis algorithms (e.g. constructing call graph)
based on OMGs ASTM standard for abstract syntax trees. This toolkit was
designed to integrate existing (open source) parsers and to transform language-
specific parser output (e.g. parse tree) into the language-independent AST.
Furthermore, Metamorphosis is based on an annotation engine that is capable
to use annotations on (input) data for symbolic analysis of program code. For
instance, one can annotate program variables with physical units and then an-
alyze the program statements to detect inconsistencies w.r.t to physical units.
The same annotation engine is reused in the DocIO tool that forwards text given
by Doxygen annotations to printf statements to document which variables are
written to which files.
SeSy [7] is an approach and tools support to extract specifications from
source code by means of symbolic execution. Symbolic execution makes it pos-
sible to identify input and output data, the actual computation as well as con-
straints of a particular computation, independently of the program structure.
Due to dynamic symbolic execution, SeSy outperforms RbG when analyzing
unstructured source code.
The TabFlow tool [8] was developed for the reverse engineering of PL/SQL
code into a more abstract and comprehensive representation. For this, the tool
computes the data flow between database tables and reports data values con-
tained in result tables together with conditions detmined by means of symbolic
execution.
The reUI [9] tool extracts the internal structure of GUI screens of an injec-
tion molding machine, their variants and the control system context they are
used in. Furthermore, the usage pattern of method calls to certain UI widgets
are analyzed.
The StaticCodeAnalyzer tool (StaticCA) [10] is a tool for static code anal-
ysis of IEC 61131-3 programs, which is capable of detecting a range of issues
commonly occurring in PLC programming. The tool employs different analysis
methods, like pattern-matching on program structures, control flow and data
flow analysis, and, especially, call graph and pointer analysis techniques.
Our frameworks and toolkits were also used for research activities within and
outside SAE. For instance, the DAS research focus at SCCH used our frame-
works for clone detection in stored procedures (PL/SQL) by means of a tree-
based similarity measure [11]. Further frameworks calculcate simple software
metrics and analyze def/use (definition and usage) relations between (global)
data objects and procedural functions. An online demo of some tools can be
found at http://codeanalytics.scch.at/.

7
4 Future Topics
In this section we enumerate and briefly describe emerging research topics in the
context of SAE. Topics are identified by considering relevant scientific confer-
ences (e.g. SANER, ICSME, ICSE) in context of current work in our research
group. Identified topics are pursued on a strategic level and eventually become
part of our research agenda. For instance, research on source-to-source transla-
tion by statistical machine learning with correct translation results just above
50% cannot be applied to migration projects at the moment. However, we ex-
pect to improve our approaches for extracting knowledge from source code by
integration with machine learning (or software analytics in general).

4.1 Code Analysis for Energy Efficiency


The growth of data centers and limited battery life time of ubiquitous mobile
devices forces owners and makers of such systems to limit energy consumption
caused by the software executed on these systems [12]. The energy consumption
of software is either estimated [13][14][15][16] or measured at hardware level
(i.e. by processing data from a energy measurement chip [17]). Investigations
are done for well known and often used source code chunks such as collection
classes [12][18] and design patterns [19].
Hasan et al. show in [12] that energy consumption of collections in Java
depend on
The used data structures: Lists, Maps and Sets differ widely depending
on their implementations and the used collection framework.
The type of operations: Insertions are more expensive than removal and
traversing.
The input size of elements: Energy consumption raises significantly for
most operations if greater than 500 elements.
They conclude that not the memory consumption is a significant driving factor
behind the differences in energy consumption. Their analysis of the executed
byte code gives hints that increasing workload (execution time) increases energy
consumption [12].
Nevertheless Pinto et al. [18] confine that execution time is not always a
reliable indicator for energy consumption when targeting concurrent programs,
but in most cases. They show a huge increase of energy consumption for op-
erations on thread-safe collections (not considered in [12]), compared to their
non-tread-safe implementations.
In [20] Manotas et al. propose a Energy-Optimization Decision Support
Framework for software engineers. The framework is a method that takes the
application code, potential changes, optimization parameters, and context in-
formation to find an energy optimized application, semantically similar to the
input application code. This is done by profiling several transformations of the
application and selecting the most optimized one according to the defined pa-
rameters. Manotas et al. provide a concrete implementation which switches
Java collection classes in given applications based on their API. They admit
that the overall costs for an energy-optimization can be very high (5 to 110 hrs
for their test applications) for collecting and processing the power samples.

8
4.2 Machine Learning from Source Code
Machine Learning gives computers the ability to learn without being explicitly
programmed. It tries to create programs that improve their performance at spe-
cific, selected task through experience. The concept and application of machine
learning techniques have been around for decades but have seen a significant
boost in recent year due to increased computational power, ubiquitous appears
of computing and new emerging application domains (i.e. big data analytics).
Today a plethora of concepts in machine learning exist, such as artificial neural
networks, deep learning, genetic algorithms, or Bayesian networks.
This trend has been picked up by software engineering research and com-
puter science and today we see numerous research activities from different re-
search fields turning towards the application of machine learning and AI in
general. Machine learning is applied to improve software maintenance tasks,
increase software reuse [21], improve detection and prediction of defaults [22]
[23], program translation [24], program comprehension and software analysis
[25], software engineering reliability predictions [26] and to improve software
development tasks [27] [28].
In code analysis, machine learning has been successfully applied for code
comment analysis. In particular unsupervised clustering techniques are used
for capturing main themes of documents. By this software artifacts can be
grouped into themes (e.g. concurrency) and sub themes [29]. Oda et al. [24] use
a learnable statistical machine translation approach to translate Python code
into pseudo code. This enables readers, unfamiliar with a certain programming
language, to understand program sources. Parr and Vinju [28] create a code
formatter that uses machine learning to abstract formatting rules from a corpus
of source code files. During training phase the formatter is provided with a
language grammar, a corpus of formatted source code files and indentation size
and derives directives on inserting white space, new line injections or indentation
of tokens.

4.3 Natural Language Processing in Software Engineering


In Software Engineering, quite a lot of artifacts contain unstructured textual
information, frequently in the form of natural language. Such artifacts range
from requirements documents, architectural and design documents to user doc-
umentation, such as manuals. Natural language text is also existent in bug
reports, communication messages such as emails and discussion forum entries.
Last but not least natural language is also present in source code like in source
code documentation and comments. The information present in such natural
language fragments may constitute important knowledge being valuable for a
wide variety of tasks in software engineering. To extract that knowledge, tech-
niques from Text Retrieval (TR) and Natural Language Processing (NLP) can
provide the necessary means.
As its name implies, TR is a branch of Information Retrieval (IR) that
concentrates on information stored in textual form. TR techniques (such as e.g.
Text Normalization, Stemming, POS Tagging , Splitting, Stop Words, Acronym
Expansion) and advanced model concepts like the Vector Space Model (VSM) or
Latent Semantic Indexing (LSI) provide an important pillar for NLP. Manning
et.al [30] provide an overview of IR and particularly TR concepts and techniques.

9
According to Allen [31], NLP refers to computer systems that analyze, at-
tempt to understand, or produce one or more human languages, such as En-
glish, Japanese, Italian, or Russian. The input might be text, spoken lan-
guage, or keyboard input. The task might be to translate to another language,
to comprehend and represent the content of text, to build a database or generate
summaries, or to maintain a dialogue with a user as part of an interface for
database/information retrieval.
However, NLP not only relies on the techniques provided by TR, it also uses
methods and techniques from other scientific fields, such as artificial intelligence,
machine learning, computational linguistics, probability theory and statistics.
Amongst others, Manning and Schutze [32] provide a detailed introduction into
NLP.
NLP supports a wide range of tasks covering almost all aspects in Soft-
ware Engineering. It may be applied in Requirements Engineering tasks for a
software system as well as during design, implementation, documentation and
maintenance of a software system.

4.3.1 NLP Concepts and Tasks


In a recent, informal survey Arnaoudova et al. [33] identify some core concepts
and tasks in NLP:
1. Language Models define probabilities for word sequences and allow gener-
ating queries matching the targeted language. See Manning [30], Chapter
12 for an introduction in language models.
2. Syntactic Analysis: The goal of Syntactic Analysis is to identify gram-
matical relations between words, the role of words in a language (by Part-
of-Speech Tagging). POS Tagging relies on probabilistic models concern-
ing neighboring words and word probabilities in general. The Stanford
CoreNLP Toolkit [34] is a collection of software tools providing a pipeline
for language analysis tasks. Amongst others, it contains special parsers
(PCFG - Probabilistic Context Free Grammar) and algorithms for basic
operations like tokenizing, phrase splitting, word tagging, and others.

3. Semantic Analysis has to find out meaningful relations between words. It


is used to build ontologies and lexicons. WordNet [35], an electronic lexi-
cal database, is considered to be the most important resource available to
researchers in computational linguistics, text analysis, and many related
areas. Its design is inspired by current psycholinguistic and computa-
tional theories of human lexical memory. English nouns, verbs, adjectives,
and adverbs are organized into synonym sets. Different relations link the
synonym sets.
4. Sentiment & Emotion Analysis (SEA) aims to find out the polarity of
a text, i.e. finding positive and negative statements. It involves retrieving
information from discussion forums, reviews and social networks. SEA is
deemed less significant in classical Software Engineering tasks than with
other fields of science (social science). However, with social and psycho-
logical aspects becoming more important in Software Engineering, SEA
is expected to gain significance. A good introduction is provided by Liu

10
[36]. An example for an application of SEA in Software Engineering has
been done by Guzman and Maalej [37], who analyze user opinions on soft-
ware app features to allow developers to filter significant and/or irrelevant
reviews.

4.3.2 Software Engineering Tasks Supported by NLP


As mentioned, NLP supports a wide range of tasks in Software Engineering.
The already cited survey of Arnaoudova et al. [33] has identified over 20 tasks,
benefiting from NLP. The following items aim to cover the most considerable
ones.

1. Requirements Engineering and Analysis is a field which traditionally in-


volves heavy use of natural language. Unsurprisingly, there has been a
lot of research regarding the use of NLP in RE. The work of Falessi et
al. [38] provides a good overview of NLP techniques in this field. A
typical applicationpublished by the same authors [39] deals with the
detection of equivalent requirements in specification databases.
2. Reverse Engineering is obviously a task benefiting heavily from NLP tech-
niques. Reverse Engineering usually lays the foundation for other SE
tasks like refactoring, generating documentation or location of features.
Depending on the final purpose, used techniques range from simple text
retrieval methods to building complex Software ontologies. For instance,
Landhauer and Genaid [40] build ontologies of source code and link user
stories using natural language to code artifacts. Another example is de-
livered by Thomas et al. [41], who study the evolution of software on the
base of topic models.

3. Refactoring often targets the monitoring of adherence to code conventions


and the (semi)automatic adjustment of identifiers, e.g. finding poor vari-
able, method or class naming in source code, identifying inconsistencies
and provide suggestions for better naming or even perform changes. Gupta
et.al [42] aid program comprehension by adapting POS tagging specifically
for the SE domain with their results aiding and improving program anal-
ysis or search tasks within source code. Allamanis et al. [43] use NLP
techniques to infer the preferred variable naming by learning from pro-
grammers habits. The same authors also describe an approach to suggest
consistent class and method names by including class/method usage con-
text in their analysis [44]. Bavota et al. [45] employ graph algorithms
for refactoring class relationships by analyzing semantic relationships be-
tween classes and methods with textual information gained from source
code acting as basic information, also program comprehension building
the base for refactoring. Arnoudova et al. [46] introduce and elaborate on
Linguistic Antipatterns (recurring poor practices in the naming, doc-
umentation, and choice of identifiers in an implementation), which may
hinder program comprehension.

4. Documentation Generation is another important field where NLP tech-


niques play a role to generate whole, meaningful sentences and/or to ex-
tract keywords from texts. An example for such an application is provided

11
by Haiduc et al. [47], whose results support program comprehension by
automatic generation of source code summaries. Moreno et al. [48] have is-
sued similar efforts where they generate natural language class summaries
from Java code or deal with the automatic generation of Release Notes
[49]. Panichiella et al. [50] analyze developer communication (emails,
mailing lists, discussion boards) to enhance source code descriptions.
5. Concept Location deals with finding specific issues within the codebase of
a software system or related data, for instance finding the start of a change
to the code based on a certain change request. Other issues to be located
may also include certain functional features or specific bugs. Dit et al.
[51] have published an overview of feature location techniques, albeit not
all relying on NLP. Marcus and Haiduc [52] discuss the application of text
retrieval techniques to support concept location, specifically lightening the
context of software changes. Shepherd et al. [53] employ natural language
processing in conjunction with static analysis for feature location with
their search engine Find-Concept. Hill et al. [54] follow the idea of query
expansion and refinement in their approach to feature location based on
contextual searching. Instead of focusing on verbs and direct objects, their
analysis centers on three types of phrases: noun phrases, verb phrases, and
prepositional phrases.
6. SW Categorization aims to classify aspects of a software-system (e.g.
based on source code, binaries, usage profiles, API defintions). Examples
for work in this field are provided by Huang et al. [55], who classify defects
to aid developers in prioritizing requested changes. In [56], Escobar-Avila
et.al describe their effort, which leverages semantic information extracted
from Java bytecode and software profiles to automatically categorize Java
software projects - aiming to support the clustering of large software repos-
itories like GitHub or SourceForge.

4.3.3 Present and Future of NLP in Software Engineering


According to Arnaoudova et al. [33], NLP is one of the fastest growing research
areas in SE. However, the number of publications peaked in 2012 and is slowly
declining since. Arnaoudova et al. identify the need for more benchmarks and
open data to support comparison to older, previous approaches as the current
burning issue. Other trends include the combination of different approaches
for getting better overall results and adapting NLP and TR techniques to the
properties of concrete tasks and datasets in Software Engineering.

4.4 Dynamic Symbolic Execution


The execution of a program with symbolic values, instead of concrete values,
is called symbolic execution. When program execution branches, conceptually
both branches are followed at once, maintaining on each path the set of con-
straints called the path condition which must hold on execution on this path [57].
When a path terminates, the path condition holds the precondition for this exe-
cution path, and the symbolic result holds the postcondition [58]. Two common
issues are the huge number of potential paths through a program, and the en-
vironment problem (for example system calls that cannot be analyzed)[59].

12
To overcome these challenges, dynamic symbolic execution uses concrete
input values, for execution of a program on exactly one path, and along side
that path performs symbolic execution, with symbolic input values [7].
New concrete input values can be found, by negating the collected constraints
one by one, that can be solved by a constraint solver, that exercise different
control paths in the program [59]. The number of paths through a program can
be overwhelming. Strategies have to be applied to avoid starvation. Maximizing
code-coverage can be a goal.
Tools like SAGE, and KLEE can generated thousands of test cases. Some
might reveal a bug (the program crashes, or throws an exception). The recorded
test case can then be used for reproduction of the bug. However this approach
can not decide, whether a new test case produces a meaningful result.

References
[1] M. Mernik, J. Heering, and A. M. Sloane, When and how to develop
domain-specific languages, ACM Comput. Surv., vol. 37, pp. 316344,
Dec. 2005.
[2] M. Moser, J. Pichler, G. Fleck, and M. Witlatschil, RbG: A documenta-
tion generator for scientific and engineering software, in 2015 IEEE 22nd
International Conference on Software Analysis, Evolution, and Reengineer-
ing (SANER), Institute of Electrical and Electronics Engineers (IEEE), mar
2015.
[3] G. Fleck, W. Kirchmayr, M. Moser, L. Nocke, J. Pichler, R. Tober, and
M. Witlatschil, Experience report on building astm based tools for multi-
language reverse engineering., in SANER, pp. 683687, IEEE Computer
Society, 2016.
[4] W. Kirchmayr, M. Moser, L. Nocke, J. Pichler, and R. Tober, Integra-
tion of static and dynamic code analysis for understanding legacy source
code, in 2016 IEEE International Conference on Software Maintenance
and Evolution (ICSME), pp. 543552, Institute of Electrical and Electron-
ics Engineers (IEEE), 2016.
[5] B. Dorninger, M. Moser, and J. Pichler, Multi-language re-documentation
to support a cobol to java migration project, in 2017 IEEE International
Conference on Software Analysis, Evolution, and Reengineering (SANER
2017), Sept 2017.
[6] C. Klammer and J. Pichler, Towards tool support for analyzing legacy
systems in technical domains, in 2014 Software Evolution Week - IEEE
Conference on Software Maintenance, Reengineering, and Reverse Engi-
neering (CSMR-WCRE), pp. 371374, Feb 2014.
[7] J. Pichler, Specification extraction by symbolic execution, in Reverse
Engineering (WCRE), 2013 20th Working Conference on, pp. 462466,
IEEE, Oct. 2013.

13
[8] M. Habringer, M. Moser, and J. Pichler, Reverse engineering pl/sql legacy
code: An experience report, in 2014 IEEE International Conference on
Software Maintenance and Evolution, pp. 553556, Sept 2014.
[9] B. Dorninger, J. Pichler, and A. Kern, Using static analysis for knowl-
edge extraction from industrial user interfaces, in 2015 IEEE International
Conference on Software Maintenance and Evolution (ICSME), pp. 497500,
Sept 2015.
[10] H. Prahofer, F. Angerer, R. Ramler, and F. Grillenberger, Static code
analysis of iec 61131-3 programs: Comprehensive tool support and ex-
periences from large-scale industrial application, IEEE Transactions on
Industrial Informatics, vol. PP, no. 99, pp. 11, 2016.
[11] R. Stumptner, C. Lettner, B. Freudenthaler, J. Pichler, W. Kirchmayr, and
E. Draxler, maintaining and analyzing production process definitions us-
ing a tree-based similarity measure, in Case-Based Reasoning Research
and Development: 23rd International Conference, ICCBR 2015, Frankfurt
am Main, Germany, September 28-30, 2015. Proceedings (E. Hullermeier
and M. Minor, eds.), (Cham), pp. 366380, Springer International Publish-
ing, 2015.
[12] S. Hasan, Z. King, M. Hafiz, M. Sayagh, B. Adams, and A. Hindle, Energy
profiles of java collections classes, in Proceedings of the 38th International
Conference on Software Engineering - ICSE 16, Association for Computing
Machinery (ACM), 2016.
[13] C. Wilke, S. Richly, S. Gtz, C. Piechnick, and U. Amann, Energy con-
sumption and efficiency in mobile applications: A user feedback study,
The 2013 IEEE International Conference on Green Computing and Com-
munications (GreenCom 2013), pp. 134141, 2013.
[14] D. Li, S. Hao, W. G. Halfond, and R. Govindan, Calculating source line
level energy information for android applications, in Proceedings of the
International Symposium on Software Testing and Analysis (ISSTA), July
2013.
[15] C. Seo, S. Malek, and N. Medvidovic, Component-level energy con-
sumption estimation for distributed java-based software systems, in
Component-Based Software Engineering: 11th International Symposium,
CBSE 2008, Karlsruhe, Germany, October 14-17, 2008. Proceedings
(M. R. V. Chaudron, C. Szyperski, and R. Reussner, eds.), (Berlin, Heidel-
berg), pp. 97113, Springer Berlin Heidelberg, 2008.
[16] K. Liu, G. Pinto, and Y. D. Liu, Data-oriented characterization of
application-level energy optimization, in International Conference on Fun-
damental Approaches to Software Engineering, pp. 316331, Springer Berlin
Heidelberg, 2015.
[17] A. Hindle, A. Wilson, K. Rasmussen, E. J. Barlow, J. C. Campbell, and
S. Romansky, Greenminer: A hardware based mining software reposito-
ries software energy consumption framework, in Proceedings of the 11th
Working Conference on Mining Software Repositories, pp. 1221, ACM,
2014.

14
[18] G. Pinto, K. Liu, F. Castor, and Y. D. Liu, A comprehensive study on
the energy efficiency of java thread-safe collections, in 2016 IEEE Inter-
national Conference on Software Maintenance and Evolution - ICSME 16
- To appear, Association for Computing Machinery (IEEE), 2016.
[19] A. Litke, K. Zotos, A. Chatzigeorgiou, and G. Stephanides, Energy con-
sumption analysis of design patterns, International Journal of Electrical,
Computer, Energetic, Electronic and Communication Engineering, vol. 1,
no. 11, pp. 1655 1659, 2007.
[20] I. Manotas, L. Pollock, and J. Clause, Seeds: A software engineers energy-
optimization decision support framework, in Proceedings of the 36th In-
ternational Conference on Software Engineering, ICSE 2014, (New York,
NY, USA), pp. 503514, ACM, 2014.
[21] S. Maggo and C. Gupta, A machine learning based efficient software
reusability prediction model for java based object oriented software, Inter-
national Journal of Information Technology and Computer Science, vol. 6,
pp. 113, jan 2014.
[22] P. Kumar and Y. Singh, An empirical study of software reliability predic-
tion using machine learning techniques, International Journal of System
Assurance Engineering and Management, vol. 3, no. 3, pp. 194208, 2012.
[23] R. Malhotra, A systematic review of machine learning techniques for soft-
ware fault prediction, Appl. Soft Comput., vol. 27, pp. 504518, Feb. 2015.
[24] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Naka-
mura, Learning to generate pseudo-code from source code using statistical
machine translation (t), in Proceedings of the 2015 30th IEEE/ACM In-
ternational Conference on Automated Software Engineering (ASE), ASE
15, (Washington, DC, USA), pp. 574584, IEEE Computer Society, 2015.
[25] I. Griffith, S. Wahl, and C. Izurieta, Truerefactor: An automated refac-
toring tool to improve legacy system and application comprehensibility.
[26] V. Raychev, M. Vechev, and A. Krause, Predicting program properties
from big code, in Proceedings of the 42Nd Annual ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages, POPL 15,
(New York, NY, USA), pp. 111124, ACM, 2015.
[27] M. Bruch, M. Monperrus, and M. Mezini, Learning from examples to im-
prove code completion systems, in Proceedings of the the 7th Joint Meeting
of the European Software Engineering Conference and the ACM SIGSOFT
Symposium on The Foundations of Software Engineering, ESEC/FSE 09,
(New York, NY, USA), pp. 213222, ACM, 2009.
[28] T. Parr and J. J. Vinju, Towards a universal code formatter through
machine learning, in International Conference on Software Language En-
gineering (SLE) 2016 - To appear, Association for Computing Machinery
(IEEE), 2016.
[29] C. Zhai, A. Velivelli, and B. Yu, A cross-collection mixture model for
comparative text mining, in In Proceedings of KDD 04, pp. 743748, 2004.

15
[30] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information
Retrieval. New York, NY, USA: Cambridge University Press, 2008.

[31] J. F. Allen, Natural language processing, in Encyclopedia of Computer


Science, pp. 12181222, Chichester, UK: John Wiley and Sons Ltd., 2003.
[32] C. D. Manning and H. Schutze, Foundations of statistical natural language
processing, vol. 999. MIT Press, 1999.
[33] V. Arnaoudova, S. Haiduc, A. Marcus, and G. Antoniol, The use of text
retrieval and natural language processing in software engineering, in Pro-
ceedings of the 37th International Conference on Software Engineering -
Volume 2, ICSE 15, (Piscataway, NJ, USA), pp. 949950, IEEE Press,
2015.
[34] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. Mc-
Closky, The Stanford CoreNLP natural language processing toolkit, in
Association for Computational Linguistics (ACL) System Demonstrations,
pp. 5560, 2014.
[35] C. Fellbaum, WordNet. Wiley Online Library, 1998.

[36] B. Liu, Sentiment analysis and opinion mining, Synthesis lectures on


human language technologies, vol. 5, no. 1, pp. 1167, 2012.
[37] E. Guzman and W. Maalej, How do users like this feature? a fine grained
sentiment analysis of app reviews, in 2014 IEEE 22nd international re-
quirements engineering conference (RE), pp. 153162, IEEE, 2014.

[38] D. Falessi, G. Cantone, and G. Canfora, A comprehensive characterization


of nlp techniques for identifying equivalent requirements, in Proceedings
of the 2010 ACM-IEEE international symposium on empirical software en-
gineering and measurement, p. 18, ACM, 2010.
[39] D. Falessi, G. Cantone, and G. Canfora, Empirical principles and an in-
dustrial case study in retrieving equivalent requirements via natural lan-
guage processing techniques, IEEE Transactions on Software Engineering,
vol. 39, no. 1, pp. 1844, 2013.
[40] M. Landhauer and A. Genaid, Connecting user stories and code for test
development, in Proceedings of the Third International Workshop on Rec-
ommendation Systems for Software Engineering, RSSE 12, (Piscataway,
NJ, USA), pp. 3337, IEEE Press, 2012.
[41] S. W. Thomas, B. Adams, A. E. Hassan, and D. Blostein, Studying soft-
ware evolution using topic models, Science of Computer Programming,
vol. 80, Part B, pp. 457 479, 2014.

[42] S. Gupta, S. Malik, L. Pollock, and K. Vijay-Shanker, Part-of-speech tag-


ging of program identifiers for improved text-based software engineering
tools, in 2013 21st International Conference on Program Comprehension
(ICPC), pp. 312, IEEE, 2013.

16
[43] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, Learning natural coding
conventions, in Proceedings of the 22Nd ACM SIGSOFT International
Symposium on Foundations of Software Engineering, FSE 2014, (New York,
NY, USA), pp. 281293, ACM, 2014.
[44] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, Suggesting accurate
method and class names, in Proceedings of the 2015 10th Joint Meeting on
Foundations of Software Engineering, ESEC/FSE 2015, (New York, NY,
USA), pp. 3849, ACM, 2015.
[45] G. Bavota, A. De Lucia, A. Marcus, and R. Oliveto, Automating extract
class refactoring: an improved method and its evaluation, Empirical Soft-
ware Engineering, vol. 19, no. 6, pp. 16171664, 2014.
[46] V. Arnaoudova, M. Di Penta, and G. Antoniol, Linguistic antipatterns:
what they are and how developers perceive them, Empirical Software En-
gineering, vol. 21, no. 1, pp. 104158, 2016.
[47] S. Haiduc, J. Aponte, and A. Marcus, Supporting program comprehension
with source code summarization, in Proceedings of the 32Nd ACM/IEEE
International Conference on Software Engineering - Volume 2, ICSE 10,
(New York, NY, USA), pp. 223226, ACM, 2010.
[48] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-
Shanker, Automatic generation of natural language summaries for java
classes, in 2013 21st International Conference on Program Comprehension
(ICPC), pp. 2332, IEEE, 2013.
[49] L. Moreno, G. Bavota, M. D. Penta, R. Oliveto, A. Marcus, and G. Canfora,
Arena: An approach for the automated generation of release notes, IEEE
Transactions on Software Engineering, vol. PP, no. 99, pp. 11, 2016.
[50] S. Panichella, J. Aponte, M. Di Penta, A. Marcus, and G. Canfora, Min-
ing source code descriptions from developer communications, in Program
Comprehension (ICPC), 2012 IEEE 20th International Conference on,
pp. 6372, IEEE, 2012.
[51] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, Feature location in
source code: a taxonomy and survey, Journal of Software: Evolution and
Process, vol. 25, no. 1, pp. 5395, 2013.
[52] A. Marcus and S. Haiduc, Text Retrieval Approaches for Concept Location
in Source Code, pp. 126158. Berlin, Heidelberg: Springer Berlin Heidel-
berg, 2013.
[53] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker, Using
natural language program analysis to locate and understand action-oriented
concerns, in Proceedings of the 6th International Conference on Aspect-
oriented Software Development, AOSD 07, (New York, NY, USA), pp. 212
224, ACM, 2007.
[54] E. Hill, L. Pollock, and K. Vijay-Shanker, Automatically capturing source
code context of nl-queries for software maintenance and reuse, in 2009
IEEE 31st International Conference on Software Engineering, pp. 232242,
May 2009.

17
[55] L. Huang, V. Ng, I. Persing, M. Chen, Z. Li, R. Geng, and J. Tian, Au-
toodc: Automated generation of orthogonal defect classifications, Auto-
mated Software Engineering, vol. 22, no. 1, pp. 346, 2015.

[56] J. Escobar-Avila, M. Linares-Vasquez, and S. Haiduc, Unsupervised soft-


ware categorization using bytecode, in Proceedings of the 2015 IEEE 23rd
International Conference on Program Comprehension, pp. 229239, IEEE
Press, 2015.

[57] C. Cadar, D. Dunbar, and D. Engler, Klee: Unassisted and automatic


generation of high-coverage tests for complex systems programs, in Pro-
ceedings of the 8th USENIX Conference on Operating Systems Design and
Implementation, OSDI08, (Berkeley, CA, USA), pp. 209224, USENIX
Association, 2008.

[58] C. Csallner, N. Tillmann, and Y. Smaragdakis, DySy: Dynamic symbolic


execution for invariant inference, in Proc. 30th ACM/IEEE International
Conference on Software Engineering (ICSE), pp. 281290, ACM, May 2008.
[59] P. Godefroid, M. Y. Levin, and D. Molnar, Sage: Whitebox fuzzing for
security testing, Queue, vol. 10, pp. 20:2020:27, Jan. 2012.

18

View publication stats

You might also like