Introduction
Introduction
- Flashfill already exists
- programming by example techniques for learning data transformations
- semi-supervised learning to reduce ambiguity
-
InputDataGraph to represent large set of logical patterns shared across input data
- challenges with flash fill:
- needs to store lots of possible programs
- slow to operate on larger strings
- BlinkFill learns a graph structure which can “encode the set of all logical
structured shared across a set of column values”
- sub-paths in the graph to learn and disambiguate between a large number of DSL expressions.
- DSL of BlinkFill does not support conditionals/loops but supports more substring extraction tasks
DSL
- two token types:
- regex tokens
- ProperCase, CAPS, lowercase, Digits, Alphabets, Alphanumeric, Whitespace, StartT, endT, ProperCaseWithSpaces, CAPSWSpaces, lowercaseWSpaces, AlphabetsWSpaces
- constant string tokens
- regex tokens
- version space algebra (VSA) - idea to succinctly represent an exponential number of programs in polynomial space
- viewed as a DAG
- leaf nodes programs
- intermediate union nodes as set-union of children VSAs
- intermediate join nodes as VSAs with k-ary function F with k inputs
- viewed as a DAG
Approach and InputDataGraphs
- goal is to find a program which satisfies all input output examples AND
- the subexpresions of the program P are contained within a data structure which holds a set of subexpressions derived from the input dataset and the DSL
- nodes in IDG correspond to different indices within a set of strings
- indices represented with a labeling function
- algorithm computes IDG for each example, then unions examples into a single graph containing the common sub expressions. -
Questions
Summary
The authors of BlinkFill present an improvement to the FlashFill predecessor introduced in Microsoft Excel. The authors devise an interesting approach to compressing the total program space utilized during the learning phase. As the amount of input data grows, they are able to contain the graph size to a much smaller size. This helps them improve the speed of learning the synthesis program and enables an efficient graph search with dijkstra’s to pick the most likely program candidate once the initial set of programs is generated.
Strengths
- This work shows that novel methods and techniques can be used to further improve the performance of particular synthesis problems
- The authors develop a strong mathematical model for their DAG representation
Weaknesses
The main weaknesses of BlinkFill is that is it constrained to string manipulation programs, and even then, maintains a limited set of possible regex operations. It does not support conditionals or loops like its predecessor FlashFill which potentially limits its use cases depending on how complicated the data transformations may be
What does BlinkFill use as behavioral constraints? Structural constraints? Search strategy?
- Behavioral
- input/output examples
- Structural
- Regex subset + string constant DSL
- Search Strategy
- bottom up approach guided by the InputDataGraph, then highest ranked program chosen (searches DAG for shortest path with a modified Dijkstra’s)
What is the main technical insight of BlinkFill wrt FlashFill?
The main technical insight with BlinkFill is the ability to represent the input
space in a compressed format as the InputDataGaph
and once the possible
programs are generated from the InputDataGraph
a most likely candidate is
chosen by ranking the possible programs which reduces the number of examples
required to synthesize a correct program.
-
Write a program in the BlinkFill DSL (L_s) that extracts conference acronyms and years; the program should satisfy the following examples:
> "Programming Language Design and Implementation (PLDI), 2019, Phoenix AZ" -> "PLDI 2019"
> "Principles of Programming Languages (POPL), 2020, New Orleans LA" -> "POPL 2020",
Concat(SubStr(Position("(", 1, Start), Position(")", 1, Start)), SubStr(Position(", ", 1, Start), Position(",", 2, Start)))
As described in the paper, the position expressions of the BlinkFill DSL only support matching a single token from Table 1, e.g. C. Could we extend the algorithm to support sequences of tokens, e.g. C ws $ to match caps followed by whitespace follows by end of string? How would this change the construction and intersection of input data graphs (use Fig 6 as an example).
I think it is possible, but would be difficult and likely affect the performance benefits of BlinkFill. If the DSL supported compound regex formulas I presume that the IDG would probably be much larger in size as you would likely need to express exponentially more edges since the combination of operators increases the search space of the synthesis program