Last week I attended the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2017). The event was hosted at Paris 6 which is part of the Sorbonne, University of Paris. It was one of the best POPLs I can remember. The papers are both interesting and informative (you can read them all, for free, from the Open TOC), and the talks I attended were generally of very high quality. (Videos of the talks will be available soon—I will add links to this post.) Also, the attendance hit an all-time high: more than 720 people registered for POPL and/or one of its co-located events.
In this blog post I will highlight a few of my favorite papers at this POPL, as well as the co-located N40AI event, which celebrated 40 years of abstract interpretation. Disclaimer: I do not have time to describe all of the great stuff I saw, and I could only see a fraction of the whole event. So just because I don’t mention something here doesn’t mean it isn’t equally great. 1
Fast Polyhedra Abstract Domain
A significant portion of the POPL program covered the area of static program analysis. My favorite paper in this category was
Fast Polyhedra Abstract Domain, written by Gagandeep Singh, Markus Puschel, and Martin Vechev, from ETH Zurich. The paper presents a new implementation of the convex polyhedra abstract domain used in numeric static program analysis. Such an analysis considers the possible numeric values that variables could take on during a program’s execution. The analysis might do this to, for example, check whether indices of an array are always within bounds to thus ensure the absence of buffer overflow vulnerabilities.
Several numeric abstract domains have been designed, and several implementations of each are available (e.g., in the popular APRON and PPL libraries). The most precise domain, but the most expensive, is Convex Polyhedra (CP). In existing implementations, CP is significantly slower than the least expensive domain, Intervals. As such, many analyses have tended to avoid using CP.
Singh et al’s paper presents a new implementation of CP that performs significantly better than existing implementations. Quoting the abstract:
Our experimental results demonstrate massive gains in both space and
time: we show end-to-end speedups of two to five orders of magnitude
compared to state-of-the-art Polyhedra implementations as well as
significant memory gains, on all larger benchmarks. In fact, in many
cases our analysis terminates in seconds where prior code runs out
of memory or times out after 4 hours.
(Emphasis was mine.) There are two key elements of their approach. First, the authors use a hybrid representation of the underlying domain, rather than a single representation. Second, they maintain a separate polyhedron for each set of related variables, rather than a single polyhedron for all variables.
Elaborating on the first idea: There are two implementations of convex polyhedra that are popular, the “constraint” approach, and the “generator” approach. These two approaches differ in the running time of different operations. Notably, operations that are exponential in one implementation are polynomial in the other. As such, we might implement the domain to switch from using one representation when an exponential operation is polynomial in the other. I understand that APRON does this. But converting between representations is itself an exponential operation, so it’s not a panacea.
The second idea is where the rubber hits the road, apparently. Many times, variables used in a program have independent values; e.g., because the program does not compare them to one another directly or transitively. When too many variables are all put in a single polyhedron, however, the result is bad performance. The paper proposes to separately maintain so-called permissable decompositions of variables so that a separate polyhedron is maintained for each set of related variables, at no loss to precision. Having three small polyhedra rather than one big one helps overall costs. Moreover, it apparently also underpins more efficient conversions between representations.
The dramatic performance improvements are exciting. An algorithmic revolution in SAT solving ultimately resulted in orders-of-magnitude performance improvements, which prompted a renaissance in program analysis and program synthesis tools that target SAT. Perhaps a similar revolution is in store for numeric static analysis.
Since it was introduced by Dwork, McSherry, Nissim, and Smith in 2006, the concept of differential privacy (DP) has underpinned a substantial amount of research on data privacy. The motivation of DP is to ensure that computations over a collection of individuals’ data do not reveal too much about any single individual. Both Apple and Google have started using DP in some of their data-driven products (in RAPPOR, and various end-user predictive software, respectively).
Several research projects have considered programming languages or systems that ensure DP by construction, e.g., the Fuzz programming language and the Airavat framework for MapReduce computations. However, these systems cannot express sophisticated DP algorithms that boast both private and precise outcomes. For example, neither system could express Dwork and Roth’s Sparse Vector analysis algorithm.
Zhang and Kifer’s LightDP fills this gap. The approach’s first key element is a type system that captures the connection between two runs of the program, written in a fairly standard imperative language. 2 These runs are over “adjacent” inputs, e.g., those that differ because one contains a particular individual’s data while the other does not. LightDP’s second key element is a program transformation that takes a correctly typed program and extracts calculations of its impact on the overall privacy budget. This is, roughly, the difference in probability of a particular outcome of the program depending on which adjacent input is used. The privacy budget calculations (which account for the program’s use of noise to obscure the outcome) can be solved by an off-the-shelf model checker. The paper shows that LightDP can be used to prove that the Sparse Vector Method, and several other sophisticated algorithms, can be expressed (and verified) with LightDP with minimal programmer effort.
The use of relational typing to drive a program transformation that exposes the impact on privacy budget is really intriguing to me. I’m excited to explore how the approach might be extended to prove leakage bounds on cryptographic algorithms. I am particularly interested in those for oblivious data structures that also rely on randomness to make secret-dependent outcomes hard to distinguish.
Type Soundness Proofs with Definitional Interpreters
Small-step operational semantics express individual steps of a program’s execution, while big-step semantics express entire executions. Type soundness proofs on big-step semantics tend to be simpler to construct, but conventional wisdom has it that such proofs do not cover non-terminating executions (at least not without a co-inductive interpretation of the operational rules). But, as pointed out in Amin and Rompf’s POPL paper, Type Soundness Proofs with Definitional Interpreters, Jeremy Siek previously showed that it can be done for the simply-typed lambda calculus (STLC) with three easy lemmas. And these lemmas can be proved with standard structural induction—no coinduction required. The paper title’s mention of a definitional interpreter comes from the fact that big-step semantics can be viewed as a (definitional) interpreter, expressed as a recursive function.
The key idea of Siek’s technique is to extend the semantics with a fuel argument. Each recursive call to the interpreter deducts one from the fuel. If the fuel hits zero, there is a “timeout.” If a property (such as type soundness) is proved for all executions of fuel n, for all n, then it must hold for all finite executions. Specifically, infinite executions are by definition not “stuck,” which is what type soundness is ultimately trying to show. Amin and Rompf’s contribution is to show that this idea scales beyond simple languages like STLC to far more sophisticated languages involving subtyping, polymorphism, and mutable references. In particular, they use this technique to prove correct (with a mechanization in Coq) the Dependent Object Type (DOT) calculus which underpins the type system of Scala.
I had seen the idea of indexing an interpreter function with a “fuel” argument in Pierce’s Software Foundations chapter extension on the IMP imperative language. In that case the index was needed to prove the semantics function terminates, as required by Coq. The use of fuel also resembles step-indexing a logical relation (e.g., per Amed) and indexing a semantics with costs (e.g., per Hoffman’s “potentials” for resource aware ML). Despite knowing of these past works, I had not made the leap that one could similarly use fuel to simplify type soundness proofs, and I had not seen the technique in other papers I’ve read.
Now I am hoping to retrofit some ongoing work to use big-step rather than small-step semantics, and to thereby simplify the currently delicate and complicated proofs. One thing I’ve found I don’t like, so far, is that the semantics is expressed as a Coq function in the paper, while the type system is expressed with inference rules; there is a disconnect in notation between the two. I am experimenting with ways of bringing them more in sync.
In general, I’d be pleased to see more papers (or monographs, blog posts, essays, pearls, etc.) on best practices for mechanizing mathematics. Such practices can ultimately be transformative in helping the community move forward faster, just as Wright and Felleisen’s original paper on syntactic type soundness has done over the last 20+ years.
40 years of AI
My final day in Paris was spent at the celebration of 40 years of Abstract Interpretation (counted since the publication of Cousot and Cousot’s seminal paper introducing abstract interpretation at POPL’77). Overall the talks were really interesting; the organizers (listed on the linked page) did a fantastic job! My two favorite talks of the day were Francesco Logozzo‘s talk about the use of abstract interpretation at Facebook, and Patrick Cousot‘s retrospective on the origins and development of abstract interpretation, and what might be in store.
Francesco left Microsoft Research a couple of years ago to work on static analysis tools at Facebook. In his talk, he reported on a static taint analysis tool for PHP programs that now regularly runs on millions of lines of Facebook code. Incredibly, the tool returns very precise results in 25 minutes when run on a 48-core machine with 256 GB. Francesco’s exhortation to the attendees was to focus on what makes his analysis work: modular, parallelizable designs, to emphasize scalability. To make sure you are doing that, he suggested forcing grad students to build their analyses out of 100 Raspberry Pis, rather than using a proper computer! I was excited to hear that Facebook’s taint analysis code may be released open-source, and that Francesco is preparing a paper about the analysis.
Patrick’s talk was really interesting. It’s rare that we get to learn about the origins of a scientific discovery; usually, we just see the final result. The road to the development of abstract interpretation occurred over many years, so that quite a lot of work was covered in that first paper. Patrick highlighted key ideas along the way (several of which were due to the late Radhia Cousot while she was a student), and key moments of encouragement by fellow scientists. The latter half of his talk showed how ideas suggested in that first paper, or in his thesis, eventually came to fruition in papers published 10, 20, or even more years later. It was very inspiring to see how such a progressive research agenda has unfolded over time. One interesting point that came up in the talk was the fact that Patrick has never published a book about abstract interpretation. He now feels this was a mistake; 3 I wonder if that means a book is in the offing …
In all, I really enjoyed my time in Paris at POPL. It has galvanized me to do more reading, thinking, and writing! And I suspect the same is true for the many others who attended.
- I also attended PLMW just before POPL, and gave a talk. I may discuss that in another blog post. ↩
- In general, reasoning about multiple executions is important for capturing how one execution might reveal more information than another (or none at all, per the noninterference property). ↩
- He said that he felt if he published a book it might stymie new ideas—everyone might focus on his presentation rather than consider new directions. ↩