Publications

2011

"Scalable Analysis of Conceptual Data Models", Matthew McGill, Laura Dillon, and Kurt Stirewalt. In ISSTA 2011: 2011 International Symposium on Software Testing and Analysis, July 2011.
PDF Show Abstract Hide Abstract
Conceptual data models describe information systems without the burden of implementation details, and are increasingly used to generate code. They could also be analyzed for consistency and to generate test data except that the expressive constraints supported by popular modeling notations make such analysis intractable. In an earlier empirical study of conceptual models created at LogicBlox Inc., Smaragdakis, Csallner, and Subramanian found that a restricted subset of ORM, called ORM-, includes the vast majority of constraints used in practice and, moreover, allows scalable analysis. After that study, however, LogicBlox Inc. obtained a new ORM modeling tool, which supports discovery and specification of more complex constraints than the previous tool. We report findings of a follow-up study of models constructed using the more powerful tool. Our study finds that LogicBlox developers increasingly rely on a small number of features not in the ORM- subset. We extend ORM- with support for two of them: objectification and a restricted class of external uniqueness constraints. The extensions significantly improve our ability to analyze the ORM models created by developers using the new tool. We also show that a recent change to ORM has rendered the original ORM- algorithms unsound, in general; but that an efficient test suffices to show that these algorithms are in fact sound for the ORM- constraints appearing in any of the models currently in use at LogicBlox.
"Approximating Constraint Propagation in Datalog", Dario Campagna, Beatta Sarna-Starosta, and Tom Schrijvers. In CICLOPS 2011: 11th International Colloquium on Implementation of Constraint and LOgic Programming Systems, July 2011.
PDF Show Abstract Hide Abstract
We present a technique exploiting Datalog with aggregates to improve the performance of programs with arithmetic (in)equalities. Our approach employs a source-to-source program transformation which approximates the propagation technique from Constraint Programming. The experimental evaluation of the approach shows good run time speed- ups on a range of non-recursive as well as recursive programs. Further- more, our technique improves upon the previously reported in the liter- ature constraint magic set transformation approach.
"More efficient Datalog queries: Subsumptive tabling beats magic sets", Tuncay Tekle, and Yanhong A. Liu. In SIGMOD/PODS 2011: 2011 International Conference on Management of Data, June 2011.
PDF Show Abstract Hide Abstract
Given a set of Datalog rules, facts, and a query, answers to the query can be inferred bottom-up starting with the facts or top-down starting with the query. The dominant strategies to improve the performance of answering queries are reusing answers to subqueries for top-down methods, and transforming rules based on demand from the query, such as the well-known magic sets transformation, for bottomup methods. However, the performance of these strategies vary drastically, and the most effective method remained unknown.

This paper describes precise time and space complexity analysis for efficient implementation of Datalog queries using subsumptive tabling, a top-down evaluation method with more reuse of answers than the dominant tabling strategy, and shows that subsumptive tabling beats bottom-up evaluation of rules after magic sets transformation in both time and space complexities. It also describes subsumptive demand transformation, a novel method for transforming the rules so that bottom-up evaluation of the transformed rules mimics subsumptive tabling; we show that the time complexity of bottom-up evaluation after this transformation is equal to the the time complexity of top-down evaluation with subsumptive tabling. The paper further describes subsumption optimization, an optimization to increase the use of subsumption in subsumptive methods, and shows its application in the derivation of a well-known demand-driven pointer analysis algorithm. We support our analyses and comparisons through experiments with applications in ontology queries and program analysis.
"Taming the Wildcards: Combining Definition- and Use-Site Variance", John Altidor, Shan Shan Huang, and Yannis Smaragdakis. In PLDI 2011: ACM SIGPLAN Conference on Programming Language Design and Implementation.
PDF Show Abstract Hide Abstract
Variance allows the safe integration of parametric and subtype polymorphism. Two flavors of variance, definition-site versus use- site variance, have been studied and have had their merits hotly debated. Definition-site variance (as in Scala and C#) offers simple type-instantiation rules, but causes fractured definitions of naturally invariant classes; Use-site variance (as in Java) offers simplicity in class definitions, yet complex type-instantiation rules that elude most programmers.

We present a unifying framework for reasoning about variance. Our framework is quite simple and entirely denotational, that is, it evokes directly the definition of variance with a small core calculus that does not depend on specific type systems. This general frame- work can have multiple applications to combine the best of both worlds: for instance, it can be used to add use-site variance anno- tations to the Scala type system. We show one such application in detail: we extend the Java type system with a mechanism that mod- ularly infers the definition-site variance of type parameters, while allowing use-site variance annotations on any type-instantiation. Applying our technique to six Java generic libraries (including the Java core library) shows that 20-58% (depending on the library) of generic definitions are inferred to have single-variance; 8-63% of method signatures can be relaxed through this inference, and up to 91% of existing wildcard annotations are unnecessary and can be elided.
"Pick Your Contexts Well: Understanding Object-Sensitivity", Yannis Smaragdakis, Martin Bravenboer, and Ondrej Lhotak. In POPL 2011: ACM SIGACT/SIGPLAN Symposium on Principles of Programming Languages.
PDF Show Abstract Hide Abstract
Object-sensitivity has emerged as an excellent context abstraction for points-to analysis in object-oriented languages. Despite its practical success, however, object-sensitivity is poorly understood. For instance, for a context depth of 2 or higher, past scalable implementations deviate quite significantly from the original definition of an object-sensitive analysis. The reason is that the analysis has many degrees of freedom, relating to which context elements are picked at every method call and object creation. We offer a clean model for the analysis design space, and discuss a formal and informal understanding of object-sensitivity and of how to create good objectsensitive analyses. The results are surprising in their extent. We find that past implementations have made a sub-optimal choice of contexts, to the severe detriment of precision and performance. We define a âfull-object-sensitiveâ analysis that results in significantly higher precision, and often performance, for the exact same context depth. We also introduce âtype-sensitivityâ as an explicit approximation of object-sensitivity that preserves high context quality at substantially reduced cost. A type-sensitive points-to analysis makes an unconventional use of types as context: the context types are not dynamic types of objects involved in the analysis, but instead upper bounds on the dynamic types of their allocator objects. Our results expose the influence of context choice on the quality of points-to analysis and demonstrate type-sensitivity to be an idea with major impact: It decisively advances the state-of-the-art with a spectrum of analyses that simultaneously enjoy speed (several times faster than an analogous object-sensitive analysis), high scalability (comparable to analyses with much less context-sensitivity), and precision (comparable to the best object-sensitive analysis with the same context depth).

2010

"SecureBlox: Customizable Secure Distributed Data Processing", William Marczak, Shan Shan Huang, Martin Bravenboer, Micah Sherr, Boon Thau Loo, and Molham Aref In SIGMOD/PODS 2010: 2010 International Conference on Management of Data, June 2010.
PDF Show Abstract Hide Abstract
We present SecureBlox, a declarative system that unifies a distributed query processor with a security policy framework. SecureBlox decouples security concerns from system specification, allowing easy reconfiguration of a system's security properties to suit a given execution environment. Our implementation of SecureBlox is a series of extensions to LogicBlox, an emerging commercial Datalog-based platform for enterprise software systems. SecureBlox enhances LogicBlox to enable distribution and static meta-programmability, and makes novel use of existing LogicBlox features such as integrity constraints. SecureBlox allows meta-programmability via BloxGenerics - a language extension for compile-time code generation based on the security requirements and trust policies of the deployed environment. We present and evaluate detailed use-cases in which SecureBlox enables diverse applications, including an authenticated declarative routing protocol with encrypted advertisements and an authenticated and encrypted parallel hash join operation. Our results demonstrate SecureBlox's abilities to specify and implement a wide range of different security constructs for distributed systems as well as to enable tradeoffs between performance and security.
"Querying Data Provenance", Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. In SIGMOD/PODS 2010: 2010 International Conference on Management of Data, June 2010.
PDF Show Abstract Hide Abstract
Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results. In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.
"Morphing: Structurally Shaping a Class by Reflecting on Others", Shan Shan Huang, and Yannis Smaragdakis. To Appear In TOPLAS: Transactions on Programming Languages and Systems.
Show Abstract Hide Abstract
We present MorphJ: a language for specifying general classes whose members are produced by iterating over members of other classes. We call this technique "class morphing" or just "morphing". Morphing extends the notion of genericity so that not only types of methods and fields, but also the structure of a class can vary according to type variables. This adds a disciplined form of meta-programming to mainstream languages and allows expressing common programming patterns in a highly generic way that is otherwise not supported by conventional techniques. For instance, morphing lets us write generic proxies (i.e., classes that can be parameterized with another class and export the same public methods as that class); default implementations (e.g., a generic do-nothing type, configurable for any interface); semantic extensions (e.g., specialized behavior for methods that declare a certain annotation); and more. MorphJ's hallmark feature is that, despite its emphasis on generality, it allows modular type checking: a MorphJ class can be checked independently of its uses. Thus, the possibility of supplying a type parameter that will lead to invalid code is detected early---an invaluable feature for highly general components that will be statically instantiated by other programmers. We demonstrate the benefits of morphing with several examples, including a MorphJ reimplementation of DSTM2, a software transactional memory library, which reduces 1,484 lines of Java reflection and bytecode engineering library calls to just 586 lines of MorphJ code.

2009

"Automated Test Input Generation for Software That Consumes ORM Models", Matthew J. Mcgill, Kurt Stirewalt, and Laura K. Dillon. In On the Move to Meaningful Internet Systems: OTM 2009, Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009, November 2009.
Show Abstract Hide Abstract
Software tools that analyze and generate code from ORM conceptual schemas are highly susceptible to feature interaction bugs. When testing such tools, test suites are needed that cover many combinations of features, including combinations that rarely occur in practice. Manually creating such a test suite is extremely labor-intensive, and the tester may fail to cover feasible feature combinations that are counter-intuitive or that rarely occur. This paper describes ATIG, a prototype tool for automatically generating test suites that cover diverse combinations of ORM features. ATIG makes use of combinatorial testing to optimize coverage of select feature combinations within constraints imposed by the need to keep the sizes of test suites manageable. We have applied ATIG to generate test inputs for an industrial strength ORM-to-Datalog code generator. Initial results suggest that it is useful for finding feature interaction errors in tools that operate on ORM models.
"Strictly Declarative Specification of Sophisticated Points-to Analyses", Martin Bravenboer, and Yannis Smaragdakis. In OOPSLA '09: Proceedings of the 24th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2009
PDF Show Abstract Hide Abstract
We present the Doop framework for points-to analysis of Java programs. Doop builds on the idea of specifying pointer analysis algorithms declaratively, using Datalog: a logic-based language for defining (recursive) relations. We carry the declarative approach further than past work by describing the full end-to-end analysis in Datalog and optimizing aggressively using a novel technique that takes into account Datalog incremental evaluation.
"Exception Analysis and Points-to Analysis: Better Together", Martin Bravenboer, and Yannis Smaragdakis. In ISSTA '09: 2009 International Symposium on Software Testing and Analysis, July 2009.
PDF Show Abstract Hide Abstract
Exception analysis and points-to analysis are typically done in complete separation. Past algorithms for precise exception analysis (e.g., pairing throw clauses with catch statements) use pre-computed points-to information. Past points-to analyses either unsoundly ignore exceptions, or conservatively compute a crude approximation of exception throwing (e.g., considering an exception throw as an assignment to a global variable, accessible from any catch clause). We show that this separation results in significant slowdowns or vast imprecision. The two kinds of analyses are interdependent: neither can be performed accurately without the other. The interdependency leads us to propose a joint handling for performance and precision. We show that our exception analysis is expressible highly elegantly in a declarative form, and can apply to points-to analyses of varying precision. In fact, our specification of exception analysis is ``fully precise'', as it models closely the Java exception handling semantics. The necessary approximation is provided only through whichever abstractions are used for contexts and objects in the base points-to analysis. Our combined approach achieves similar precision relative to exceptions (exception-catch links) as the best past precise exception analysis, with a runtime of seconds instead of tens of minutes. At the same time, our analysis achieves much higher precision of points-to information (an average of half as many values for each reachable variable for most of the DaCapo benchmarks) than points-to analyses that treat exceptions conservatively, all at a fraction of the execution time.

2008

"Relating Constraint Handling Rules to Datalog", Beata Sarna-Sarosta, David Zook, Emir Pasalic and Molham Aref. In Fifth Workshop on Constraint Handling Rules, July 2008
Show Abstract Hide Abstract
DatalogLB is an extension of Datalog supporting global stratification of negation and functional dependencies, designed for use in industrial-scale decision automation applications. Constraint Handling Rules (CHR) is a declarative rule-based programming language, particularly suitable for specifying custom constraint solvers at a high level. Our goal is to enhance DatalogLB with CHR-like capabilities in order to improve its expressive power and open it to specification of general-purpose constraint solvers for industrial applications. In this paper we relate the two formalisms and define a translation of a significant class of CHR programs into DatalogLB. It turns out that the translation enables reasoning about the properties of CHR programs at a high level of Datalog logic.
"Experience Report: Using Tools and Domain Expertise to Remediate Architectural Violations in the LogicBlox Software Base", Kurt Stirewalt, Spencer Rugaber, Hwa-You Hsu and David Zook. In ICSE '09: Proceedings of the IEEE International Conference on Software Engineering, May 2009
Show Abstract Hide Abstract
When modeling the architecture of an existing software system, developers often find inconsistencies between the conceptual and the as-built architecture. To impose the conceptual view on the code often involves large refactoring to remediate architectural violations. This paper reports our experience applying large refactoring to remediate an architectural violation in LogicBlox, a large, multi-language multi-platform system. We used DSM-based analysis in conjunction with a suite of code analysis tools to identify and effect large refactorings. A key insight of this experience is the value of automatically generated proto-interfaces, which may help experts identify standard abstractions around which to structure the refactoring effort.We contribute a process for refactoring that includes the generation of proto-interfaces and the explicit inclusion of expert review.