Publications
2011
"Scalable Analysis of Conceptual Data Models", , , and . In ISSTA 2011:
2011 International Symposium on Software Testing and Analysis, July 2011.
Conceptual data models describe information
systems without the burden of implementation
details, and are increasingly used to generate
code. They could also be analyzed for consistency
and to generate test data except that the
expressive constraints supported by popular
modeling notations make such analysis
intractable. In an earlier empirical study of
conceptual models created at LogicBlox Inc.,
Smaragdakis, Csallner, and Subramanian found that
a restricted subset of ORM, called ORM-, includes
the vast majority of constraints used in practice
and, moreover, allows scalable analysis. After
that study, however, LogicBlox Inc. obtained a new
ORM modeling tool, which supports discovery and
specification of more complex constraints than the
previous tool. We report findings of a follow-up
study of models constructed using the more
powerful tool. Our study finds that LogicBlox
developers increasingly rely on a small number of
features not in the ORM- subset. We extend ORM-
with support for two of them: objectification and
a restricted class of external uniqueness
constraints. The extensions significantly improve
our ability to analyze the ORM models created by
developers using the new tool. We also show that a
recent change to ORM has rendered the original
ORM- algorithms unsound, in general; but that an
efficient test suffices to show that these
algorithms are in fact sound for the ORM-
constraints appearing in any of the models
currently in use at LogicBlox.
"Approximating Constraint Propagation in
Datalog", , , and . In CICLOPS 2011:
11th International Colloquium on Implementation of Constraint and LOgic Programming Systems, July 2011.
We present a technique exploiting Datalog with aggregates
to improve the performance of programs with arithmetic (in)equalities.
Our approach employs a source-to-source program transformation which
approximates the propagation technique from Constraint Programming.
The experimental evaluation of the approach shows good run time speed-
ups on a range of non-recursive as well as recursive programs. Further-
more, our technique improves upon the previously reported in the liter-
ature constraint magic set transformation approach.
"More efficient Datalog queries:
Subsumptive tabling beats magic sets", , and . In SIGMOD/PODS 2011:
2011 International Conference on Management of
Data, June 2011.
Given a set of Datalog rules, facts, and a query,
answers to the query can be inferred bottom-up
starting with the facts or top-down starting with
the query. The dominant strategies to improve the
performance of answering queries are reusing
answers to subqueries for top-down methods, and
transforming rules based on demand from the query,
such as the well-known magic sets transformation,
for bottomup methods. However, the performance of
these strategies vary drastically, and the most
effective method remained unknown.
This paper describes precise time and space
complexity analysis for efficient implementation
of Datalog queries using subsumptive tabling, a
top-down evaluation method with more reuse of
answers than the dominant tabling strategy, and
shows that subsumptive tabling beats bottom-up
evaluation of rules after magic sets
transformation in both time and space
complexities. It also describes subsumptive demand
transformation, a novel method for transforming
the rules so that bottom-up evaluation of the
transformed rules mimics subsumptive tabling; we
show that the time complexity of bottom-up
evaluation after this transformation is equal to
the the time complexity of top-down evaluation
with subsumptive tabling. The paper further
describes subsumption optimization, an
optimization to increase the use of subsumption in
subsumptive methods, and shows its application in
the derivation of a well-known demand-driven
pointer analysis algorithm. We support our
analyses and comparisons through experiments with
applications in ontology queries and program
analysis.
"Taming the Wildcards: Combining Definition- and Use-Site Variance",
,
, and
. In
PLDI 2011: ACM SIGPLAN Conference on Programming Language Design and Implementation.
Variance allows the safe integration of parametric and subtype
polymorphism. Two flavors of variance, definition-site versus use-
site variance, have been studied and have had their merits hotly
debated. Definition-site variance (as in Scala and C#) offers simple
type-instantiation rules, but causes fractured definitions of naturally
invariant classes; Use-site variance (as in Java) offers simplicity
in class definitions, yet complex type-instantiation rules that elude
most programmers.
We present a unifying framework for reasoning about variance.
Our framework is quite simple and entirely denotational, that is, it
evokes directly the definition of variance with a small core calculus
that does not depend on specific type systems. This general frame-
work can have multiple applications to combine the best of both
worlds: for instance, it can be used to add use-site variance anno-
tations to the Scala type system. We show one such application in
detail: we extend the Java type system with a mechanism that mod-
ularly infers the definition-site variance of type parameters, while
allowing use-site variance annotations on any type-instantiation.
Applying our technique to six Java generic libraries (including
the Java core library) shows that 20-58% (depending on the library)
of generic definitions are inferred to have single-variance; 8-63% of
method signatures can be relaxed through this inference, and up to
91% of existing wildcard annotations are unnecessary and can be
elided.
"Pick Your Contexts Well: Understanding Object-Sensitivity",
,
, and
. In
POPL 2011: ACM SIGACT/SIGPLAN Symposium on Principles of Programming Languages.
Object-sensitivity has emerged as an excellent context abstraction
for points-to analysis in object-oriented languages. Despite its practical
success, however, object-sensitivity is poorly understood. For
instance, for a context depth of 2 or higher, past scalable implementations
deviate quite significantly from the original definition of an
object-sensitive analysis. The reason is that the analysis has many
degrees of freedom, relating to which context elements are picked
at every method call and object creation. We offer a clean model
for the analysis design space, and discuss a formal and informal understanding
of object-sensitivity and of how to create good objectsensitive
analyses. The results are surprising in their extent. We
find that past implementations have made a sub-optimal choice of
contexts, to the severe detriment of precision and performance. We
define a âfull-object-sensitiveâ analysis that results in significantly
higher precision, and often performance, for the exact same context
depth. We also introduce âtype-sensitivityâ as an explicit approximation
of object-sensitivity that preserves high context quality
at substantially reduced cost. A type-sensitive points-to analysis
makes an unconventional use of types as context: the context types
are not dynamic types of objects involved in the analysis, but instead
upper bounds on the dynamic types of their allocator objects.
Our results expose the influence of context choice on the quality
of points-to analysis and demonstrate type-sensitivity to be an idea
with major impact: It decisively advances the state-of-the-art with
a spectrum of analyses that simultaneously enjoy speed (several
times faster than an analogous object-sensitive analysis), high scalability
(comparable to analyses with much less context-sensitivity),
and precision (comparable to the best object-sensitive analysis with
the same context depth).
2010
"SecureBlox: Customizable Secure Distributed Data Processing",
,
,
,
,
, and
In
SIGMOD/PODS 2010: 2010 International Conference on Management of Data,
June 2010.
We present SecureBlox, a declarative system that unifies a distributed query processor with a security policy framework. SecureBlox decouples security concerns from system specification, allowing easy reconfiguration of a system's security properties to suit a given execution environment. Our implementation of SecureBlox is a series of extensions to LogicBlox, an emerging commercial Datalog-based platform for enterprise software systems. SecureBlox enhances LogicBlox to enable distribution and static meta-programmability, and makes novel use of existing LogicBlox features such as integrity constraints. SecureBlox allows meta-programmability via BloxGenerics - a language extension for compile-time code generation based on the security requirements and trust policies of the deployed environment. We present and evaluate detailed use-cases in which SecureBlox enables diverse applications, including an authenticated declarative routing protocol with encrypted advertisements and an authenticated and encrypted parallel hash join operation. Our results demonstrate SecureBlox's abilities to specify and implement a wide range of different security constructs for distributed systems as well as to enable tradeoffs between performance and security.
"Querying Data Provenance",
,
, and
. In
SIGMOD/PODS 2010: 2010 International Conference on Management of Data,
June 2010.
Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results.
In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.
"Morphing: Structurally Shaping a Class by Reflecting on Others",
, and
. To Appear In
TOPLAS: Transactions on Programming Languages and Systems.
Show Abstract
Hide Abstract
We present MorphJ: a language for specifying general classes whose
members are produced by iterating over members of other classes. We
call this technique "class morphing" or just "morphing". Morphing
extends the notion of genericity so that not only types of methods and
fields, but also the structure of a class can vary according to
type variables. This adds a disciplined form of meta-programming to
mainstream languages and allows expressing common programming patterns
in a highly generic way that is otherwise not supported by
conventional techniques. For instance, morphing lets us write generic
proxies (i.e., classes that can be parameterized with another class
and export the same public methods as that class); default
implementations (e.g., a generic do-nothing type, configurable for any
interface); semantic extensions (e.g., specialized behavior for
methods that declare a certain annotation); and more. MorphJ's
hallmark feature is that, despite its emphasis on generality, it
allows modular type checking: a MorphJ class can be checked
independently of its uses. Thus, the possibility of supplying a type
parameter that will lead to invalid code is detected early---an
invaluable feature for highly general components that will be
statically instantiated by other programmers. We demonstrate the
benefits of morphing with several examples, including a MorphJ
reimplementation of DSTM2, a software transactional memory library,
which reduces 1,484
lines of Java reflection and bytecode engineering library calls to
just 586
lines of MorphJ code.
2009
"Automated Test Input Generation for Software That Consumes ORM Models",
,
, and
.
In On the Move to Meaningful Internet Systems: OTM 2009, Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009, November 2009.
Show Abstract
Hide Abstract
Software tools that analyze and generate code from ORM conceptual schemas are highly susceptible to feature interaction bugs. When testing such tools, test suites are needed that cover many combinations of features, including combinations that rarely occur in practice. Manually creating such a test suite is extremely labor-intensive, and the tester may fail to cover feasible feature combinations that are counter-intuitive or that rarely occur. This paper describes ATIG, a prototype tool for automatically generating test suites that cover diverse combinations of ORM features. ATIG makes use of combinatorial testing to optimize coverage of select feature combinations within constraints imposed by the need to keep the sizes of test suites manageable. We have applied ATIG to generate test inputs for an industrial strength ORM-to-Datalog code generator. Initial results suggest that it is useful for finding feature interaction errors in tools that operate on ORM models.
"Strictly Declarative Specification of Sophisticated Points-to Analyses",
, and
. In
OOPSLA '09: Proceedings of the 24th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2009
We present the Doop framework for points-to analysis of Java programs. Doop builds on the idea of specifying pointer analysis algorithms declaratively, using Datalog: a logic-based language for defining (recursive) relations. We carry the declarative approach further than past work by describing the full end-to-end analysis in Datalog and optimizing aggressively using a novel technique that takes into account Datalog incremental evaluation.
"Exception Analysis and Points-to Analysis: Better Together",
, and
. In
ISSTA '09: 2009 International Symposium on Software Testing and Analysis,
July 2009.
Exception analysis and points-to analysis are typically done in complete separation. Past algorithms for precise exception analysis (e.g., pairing throw clauses with catch statements) use pre-computed points-to information. Past points-to analyses either unsoundly ignore exceptions, or conservatively compute a crude approximation of exception throwing (e.g., considering an exception throw as an assignment to a global variable, accessible from any catch clause).
We show that this separation results in significant slowdowns or vast imprecision. The two kinds of analyses are interdependent: neither can be performed accurately without the other. The interdependency leads us to propose a joint handling for performance and precision. We show that our exception analysis is expressible highly elegantly in a declarative form, and can apply to points-to analyses of varying precision. In fact, our specification of exception analysis is ``fully precise'', as it models closely the Java exception handling semantics. The necessary approximation is provided only through whichever abstractions are used for contexts and objects in the base points-to analysis.
Our combined approach achieves similar precision relative to exceptions (exception-catch links) as the best past precise exception analysis, with a runtime of seconds instead of tens of minutes. At the same time, our analysis achieves much higher precision of points-to information (an average of half as many values for each reachable variable for most of the DaCapo benchmarks) than points-to analyses that treat exceptions conservatively, all at a fraction of the execution time.
2008
"Relating Constraint Handling Rules to Datalog",
,
,
and
Molham Aref.
In Fifth Workshop on Constraint Handling Rules,
July 2008
Show Abstract
Hide Abstract
DatalogLB is an extension of Datalog supporting global stratification
of negation and functional dependencies, designed for use in industrial-scale decision
automation applications. Constraint Handling Rules (CHR) is a declarative
rule-based programming language, particularly suitable for specifying custom
constraint solvers at a high level. Our goal is to enhance DatalogLB with
CHR-like capabilities in order to improve its expressive power and open it to
specification of general-purpose constraint solvers for industrial applications. In
this paper we relate the two formalisms and define a translation of a significant
class of CHR programs into DatalogLB. It turns out that the translation enables
reasoning about the properties of CHR programs at a high level of Datalog logic.
"Experience Report: Using Tools and Domain Expertise to Remediate Architectural Violations in the LogicBlox Software Base",
,
,
and
. In
ICSE '09: Proceedings of the IEEE International Conference on Software Engineering, May 2009
Show Abstract
Hide Abstract
When modeling the architecture of an existing software system, developers often find inconsistencies between the conceptual and the as-built architecture. To impose the conceptual view on the code often involves large refactoring to remediate architectural violations. This paper reports our experience applying large refactoring to remediate an architectural violation in LogicBlox, a large, multi-language multi-platform system. We used DSM-based analysis in conjunction with a suite of code analysis tools to identify and effect large refactorings. A key insight of this experience is the value of automatically generated proto-interfaces, which may help experts identify standard abstractions around which to structure the refactoring effort.We contribute a process for refactoring that includes the generation of proto-interfaces and the explicit inclusion of expert review.