This HTML page was converted from the original PDF using Google Gemini 1.5 Pro Experimental gemini-1.5-pro-exp-0801
- the prompt was "Convert this document to neatly styled semantic HTML". More details on my blog.
The conversion is missing some illustrations and tables and has not been verified as correct against the PDF.
Jeff Shute
Google, Inc.
Shannon Bales
Google, Inc.
Matthew Brown
Google, Inc.
Jean-Daniel Browne
Google, Inc.
Brandon Dolphin
Google, Inc.
Romit Kudtarkar
Google, Inc.
Andrey Litvinov
Google, Inc.
Jingchi Ma
Google, Inc.
John Morcos
Google, Inc.
Michael Shen
Google, Inc.
David Wilhite
Google, Inc.
Xi Wu
Google, Inc.
Lulan Yu
Google, Inc.
sql-pipes-paper@google.com
SQL has been extremely successful as the de facto standard language for working with data. Virtually all mainstream database-like systems use SQL as their primary query language. But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend. Many have observed these challenges with SQL, and proposed solutions involving new languages. New language adoption is a significant obstacle for users, and none of the potential replacements have been successful enough to displace SQL.
In GoogleSQL, we've taken a different approach - solving SQL's problems by extending SQL. Inspired by a pattern that works well in other modern data languages, we added piped data flow syntax to SQL. The results are transformative - SQL becomes a flexible language that's easier to learn, use and extend, while still leveraging the existing SQL ecosystem and existing userbase. Improving SQL from within allows incrementally adopting new features, without migrations and without learning a new language, making this a more productive approach to improve on standard SQL.
Jeff Shute, Shannon Bales, Matthew Brown, Jean-Daniel Browne, Brandon Dolphin, Romit Kudtarkar, Andrey Litvinov, Jingchi Ma, John Morcos, Michael Shen, David Wilhite, Xi Wu, and Lulan Yu. SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL. PVLDB, 17(12): 4051–4063, 2024. doi:10.14778/3685800.3685826
SQL has been tremendously successful, standing the test of time for 50 years [17] as the primary language for structured data processing, supported near-universally across databases and query tools.
SQL is not an easy language to learn or use. Even for expert users, SQL is challenging to read, write and work with, which hurts user productivity. Several alternative languages have been proposed, but none have gained widespread adoption or displaced SQL. Migrating away from existing SQL ecosystems is expensive and generally unappealing for users.
This paper presents a different approach. After describing the most critical problems with the SQL language, we present a solution - adding pipe-structured data flow syntax to SQL. This makes SQL more flexible, extensible and easy to use. This paradigm works well in other languages like Kusto's KQL [5] and in APIs like Apache Beam[1]. We show pipe syntax can be added to SQL too, without removing anything, and while maintaining full backwards compatibility and interoperability.
In SQL, the standard clauses occur in one rigidly defined order. Expressing anything else requires subqueries or other workarounds. With pipe syntax, operations can be composed arbitrarily, in any order. This increases flexibility, radically simplifies the user experience, and enables clean language extension.
For example, standard SQL cannot express multi-level aggregations without subqueries, resulting in queries with complex "inside-out" data flow. This is query 13 from the TPC-H benchmark:
SELECT c_count, COUNT(*) AS custdist
FROM
(SELECT c_custkey, COUNT (o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
With pipe syntax, equivalent logic can be expressed sequentially, applying operators top-to-bottom in arbitrary orders.
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT (o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
Stonebraker writes in [13] (via [15]):
My biggest complaint about System R is that the team never stopped to clean up SQL... All the annoying features of the language have endured to this day. SQL will be the COBOL of 2020, a language we are stuck with that everybody will complain about.
It's been 50 years. It's time to clean up SQL. This paper shows one promising approach.
We've implemented pipe syntax in GoogleSQL[19], the SQL dialect and implementation shared across all¹ SQL systems at Google including F1[28], BigQuery[2], Spanner [19] and Procella [18], and the open-source release, ZetaSQL[12]. GoogleSQL is a shared, reusable component, enabling many systems to share the same SQL dialect. This shared component allowed implementing pipe syntax in one place and then enabling it across many products.
Many critiques of SQL have been written, from 1984[21] through 2024[25]. [15] describes many challenges in detail.
Fundamentally, the SQL language is difficult for users. For beginners, SQL is hard to learn. Even for expert users, SQL is awkward and annoying – it's hard to write, hard to read and hard to edit.
Here, we describe several syntax issues. (Pipe syntax addresses all of these!)
SQL's challenges start from its basic query syntax: SELECT ... FROM ... WHERE ... GROUP BY
, etc. This operation² order is rigid and arbitrary and doesn't reflect the actual data flow, which starts with table scans in the FROM
. Figure 1 illustrates this disconnect.
While common, this operation order is far from universal, and expressing any other ordering requires using subqueries. Reordering specific clauses could improve readability somewhat, but wouldn't address the limited expressivity.
SQL works around rigid clause order in particular cases by adding duplicate clauses, which increases language complexity. To filter results before and after aggregation, and after window functions, SQL uses WHERE
, HAVING
and QUALIFY
. These all express the same filtering operation, in different places, with different syntaxes, and with subtly different rules and behaviors.
Many simple operations can only be expressed using subqueries:
SELECT
so they can be referenced multiple times by name, in later SELECT
items, WHERE
clauses, JOINS
, etc.¹Except CloudSQL hosted third-party databases and Postgres-compatible AlloyDB.
²Terminology: We use operation as in relational algebra operation, generalized to include the query engine's supported logical operations. We use operator to describe particular syntactic clauses in the query, including pipe operators.
Subqueries require indentation to reflect nesting clearly, which hurts readability. Editing SQL typically involves frequent refactoring, wrapping query fragments into subqueries, indenting, and adding repetitive SELECT
and GROUP BY
clauses, before the desired operation can finally be added.
With the inline FROM
clause, data flow in standard SQL starts in the middle with the most deeply nested table reference. Then logic builds outwards, applying operations both above and below the starting point, while traversing outwards through layers of nested subqueries.
This "inside-out" structure makes tracing through SQL logic difficult. In large queries, it can be hard to even find the starting point.
WITH
clauses help, but don't fully solve this. WITH
clauses (defining Common Table Expressions or CTEs) can help split up queries, reducing nesting and moving initial logical blocks earlier in the query text, but CTE solutions have several caveats:
WITH CTEk AS (SELECT * FROM CTEk-1)
".FROM
clause, referencing an earlier CTE by name. It's still not possible to read the whole query top-to-bottom.CTEk
referencing just CTEk-1
.Even locally within single queries, behavior is complex and non-local, and syntax is repetitive. SELECT
and GROUP BY
are thoroughly intertwined, and aggregation modifies the behavior of other clauses. The same columns are cross-referenced in SELECT
, GROUP BY
and ORDER BY
, and corresponding edits are often required in three places.
Extending SQL with new query operations is difficult. TVFs are a powerful extension mechanism but the call syntax makes TVFs awkward to actually use since they compose poorly with built-in operations and other TVFs, requiring more nested subqueries as a workaround. (More in 4.3.)
These syntax issues combine to make SQL difficult to learn, and excessively difficult to work with, even for SQL experts.
SQL's foundational semantics work very well. The relational data model, where everything is a table and relational operations consume and produce tables, provides excellent composability, allowing users to factor logic arbitrarily with views, TVFs, temp tables, and subqueries. Subqueries also support natural transitions between expressions and tables. While compound data types weren't considered originally, SQL extends cleanly to support structured data types like arrays, structs, protocol buffers[28] and JSON[26].
Declarative semantics differentiate SQL from many other languages and APIs, and make SQL particularly well-suited for expressing query workloads. SQL, and relational algebra, clearly separate the syntactic and semantic description of what to compute from the implementation details of how to compute it. Query optimization is fundamental in SQL and a key enabler for performance, efficiency and scalability, decoupling logical requests from physical storage, query execution strategies, and optimizations like indexing.
Finally, SQL is pervasive. SQL works in almost all database and query systems, making basically all data available through SQL somewhere, and joinable across systems using federated engines like F1 or BigQuery. Many front-ends, business intelligence tools, data modeling layers, object-relational mappings and other tools are built assuming a SQL backend.
This extensive ecosystem and large userbase make SQL sticky. Adopting non-SQL systems is more difficult and leads to fragmentation across the user's data ecosystem, where other SQL systems are likely still used.
Most alternative-to-SQL languages are lacking in some of these areas. None have the ecosystem or existing userbase, and many are lacking the foundational semantics or composability to express relational algebra fully. Declarative semantics are often lacking, sometimes because alternative platforms don't have the same optimization capabilities, forcing users to entangle business logic and execution strategy.
It seems possible to create a new language that keeps the good properties of SQL, while offering better syntax and usability. Why hasn't this happened yet?
Actually, maybe it has. Many alternative "better than SQL" languages have been proposed (e.g. [11],[25]). None have become mainstream or widespread solutions that seem like SQL replacements, for several reasons:
The challenges that come with learning and adopting a new language, and migrating to it, make it a difficult choice for an organization to adopt an alternative language. Without provable large improvements, it's hard to commit to migration. Large migrations are notoriously expensive, slow and failure-prone.[23]
Creating a new language with SQL parity often means building on the same relational algebra fundamentals, supporting at least the same features. The resulting language may effectively be SQL but with different syntax, making it insufficiently different to justify learning a new language and moving to new tools.
Some challenges are mitigated if the new language is implemented as a proxy that rewrites into SQL. That allows a new language to reuse an excellent existing SQL implementation. Challenges remain getting users productive in a new language, migrating existing code to a new tool, and dealing with limited interoperability between the new language and legacy SQL. Proxied rewriters also suffer some practical issues - they introduce additional production dependencies which may compromise latency or reliability, they don't feel like first-class APIs in the underlying systems, and they make debugging correctness and performance difficult when query that actually runs is decoupled from what the user wrote.
Outside the SQL context, conventional wisdom is that it takes a decade for a new language to succeed and become mainstream - consider C -> C++, C++ -> Rust, Java -> Kotlin, JavaScript -> Dart, etc.[30]
Given SQL's problems, and the challenges involved with replacing SQL, we propose an alternative approach: Fixing SQL.
We assess SQL at various levels:
JOIN
, ORDER BY
, etc), all work reasonably well.So we will just fix item #4, leaving the rest as is. Fortunately, this is easy to do. As we've seen in other languages, pipe-structured data flow works well. We simply reuse the existing query clauses and operations from SQL, preserving their existing syntax where practical, and recast them as independent operators that can be composed arbitrarily into queries using pipe-structured data flow.
This gives us an easy-to-use variant of SQL, which requires minimal new learning since it uses all the same concepts and mostly the same syntax. The data model and top-level statements (CREATE VIEW
, etc) are identical, the concepts and operations are the same, and the expression language is identical. The query operation syntaxes are modified but familiar, and they now compose in a simple and intuitive way.
Section 4 describes our language design in more detail.
Chamberlain describes the origin and evolution of SQL in [16], including the original design goal to have a "walk up and read" property, motivating SQL's pseudo-English syntax. This was a significant tradeoff since it meant SQL was not a "functional" or composable language, as would be expected and preferred by more technical users. This design choice has aged poorly as SQL's complexity has increased over time, beyond the core language from SQL-92.
Many critiques of SQL have been written (see 2.1). [25] is a good example, describing several challenges similar to those we describe. Then it proposes a new alternative language, SaneQL, which like our proposal, cleanly expresses data flow through relational operations. Like other SQL alternatives, this proposal would require users to learn a new language and migrate away from their SQL ecosystem, which are significant obstacles.
PRQL [11] is another recent alternative language, implemented as a front-end that translates to SQL. It also provides composable relational operations, but the unfamiliar syntax and detachment from SQL make adoption challenging.
Our work contrasts with SaneSQL and PRQL by solving the same problems inside SQL, rather than by replacing SQL and hoping users will adopt new tools. The concepts we're using are not particularly novel, but applying them inside SQL is new.
SQL++[26] also extends and modernizes SQL, but is primarily focused on improving support for structured data types like JSON, without addressing SQL's core syntax problems.
Python DataFrame APIs like Pandas[9] have become popular. For many users, Python and data frame APIs are more usable than SQL, especially for ad hoc data exploration and wrangling. These APIs also offer operator chaining with pipe-like data flow. Data frame implementations typically execute imperatively within a Python process, so they don't get the declarative semantics, scalability and performance of SQL systems. While data frame APIs are popular, they fill slightly different niches, and are unlikely to replace SQL.
Others have argued that SQL is too low-level and should be replaced by higher-level languages or APIs (e.g. [6], [7]). Higher-level languages work well in many contexts, but don't replace SQL. Typically, they are implemented as front-ends that generate SQL. Pipe SQL syntax is also useful here, to simplify generating SQL, and to make generated SQL easier to understand and debug.
User-friendliness is less important in the context of generated queries, but humans still read and write SQL frequently. Many across the industry see value in improving SQL usability [24][27] and like Google, have been adding incremental usability features like GROUP BY ALL
[4]. Pipe syntax offers a more transformative usability improvement.
[29] chronicles repeated attempts over many decades to replace SQL and the relational model, and how the industry keeps returning to SQL when alternatives fail. It argues this will continue for decades more, and new ideas won't become mainstream until they're available in SQL. Pipe syntax makes SQL and the relational model extensible (see 4.3), facilitating experimentation and quick adoption of new features and paradigms. It's not necessary to invent new languages and systems to prove out new concepts first if extending SQL directly is easy.
Any query can have zero or more pipe operators as a suffix, delineated with the pipe character "|>
". Each pipe operator starts with an operator name (one or more keywords) followed by its own argument grammar. Many of these operators reuse the existing grammar for standard SQL clauses. Operators can be applied in any order, any number of times.
Additionally, we make standalone FROM
clauses valid queries, which can be followed by pipe operators to build arbitrary queries. The basic grammar is:
<query> :=
{all existing query syntaxes}
| "FROM" <from_body> New: FROM as a query
| <query> "|>" <pipe_operator> New: pipe suffixes
<pipe_operator> :=
"WHERE" <expression>
| "ORDER" "BY" <order_by_body>
| "JOIN" <table_expression> [["AS"] alias]
Figure 2 shows how this aligns syntax with semantics.
Figure 3 shows the pipe operators added in GoogleSQL to achieve parity with standard SQL queries.
We define multiple projection operators for convenience:³
FROM orders
|> SET o_orderstatus = LOWER (o_orderstatus)
|> EXTEND ROUND (o_totalprice) AS dollars
|> DROP o_comment
> AS result
|> SELECT o_custkey, result.dollars
SELECT
produces a new table with exactly the listed columns, like the outermost SELECT
in a table subquery. Standard features like SELECT DISTINCT
and SELECT *
are also supported.EXTEND
propagates the existing table, adding additional computed columns, similar to SELECT *, new_column
. EXTEND
also preserves table aliases, which can be necessary for disambiguation (e.g. WHERE t1.x = t2.x
).DROP
removes columns, similar to SELECT * EXCEPT (column)
in GoogleSQL.SET
replaces column values, similar to SELECT * REPLACE (expression AS column)
in GoogleSQL.AS
introduces a table alias for the current row, which is occasionally useful.³Except where noted, example queries use TPC-H tables. They are runnable in ZetaSQL [12] using the execute_query
tool.
DROP
and SET
aren't strictly necessary but are convenient. EXTEND
is useful to project values without losing existing columns or table aliases. For example, EXTEND
can compute new columns between JOIN
s, while preserving existing table aliases.
Unlike standard SQL, projection operators can be applied repeatedly to incrementally compute expressions without (non-standard) lateral column references. Those computed columns are then visible to use in operators like WHERE
. The SQL optimizer should still combine these computations as appropriate.
FROM lineitem
|> EXTEND 1_quantity * 1_extendedprice AS cost
|> EXTEND cost * 1_discount AS discount
|> WHERE discount > 1000
|> AGGREGATE SUM(cost), SUM(discount)
These projection operators are allowed to compute window functions (with OVER
), but cannot do aggregation. Aggregation is a standalone operator, which simplifies syntax and semantics and preserves operator independence.
Full-table aggregation is expressed with a list of aggregate columns to compute.
FROM orders
|> AGGREGATE SUM(o_totalprice) AS price, COUNT(*) AS cnt
Aggregation with grouping is expressed by adding a GROUP BY
as part of the AGGREGATE
operator⁵.
FROM orders
|> AGGREGATE SUM(o_totalprice) AS price, COUNT(*) AS cnt
GROUP BY EXTRACT(year FROM o_orderdate) AS year
The output table contains the grouping columns (which can be computed expressions, with assigned aliases) if any, and then the aggregate columns. GROUPING SETS
, etc also work here.
⁴Rationale: We originally had a separate WINDOW
operator but found using it inconvenient. Projecting scalar and window functions at the same time is common. Another awkward pattern occurred when users compute a window function just to use it in WHERE
or ORDER BY
and then need to drop the computed column.
⁵Rationale: We used the same operator name for full-table and grouped aggregation to minimize edit distance between these operations. Unfortunately, this puts the grouping and aggregate columns in different orders in the syntax and output. Putting GROUP BY
first would require adding a required keyword before the AGGREGATE
list.
Unlike standard SQL where columns must be repeated between the SELECT
and GROUP BY
, pipe AGGREGATE
is a single standalone operator that achieves both. Expression matching (between GROUP BY
and SELECT
, including grouping by ordinals or aliases), isn't needed since the grouping expressions occur just once.
Since AGGREGATE
already produces exactly the grouping and aggregate columns, it's often unnecessary to write a SELECT
in queries that includes pipe AGGREGATE
.
The pipe WHERE
operator can be added anywhere, including after AGGREGATE
. Separate operators aren't needed for HAVING
and QUALIFY
.
People ask why not just allow writing operator clauses in arbitrary orders without using a pipe character. Some other languages have this structure without using pipe characters – notably GQL [3], the SQL-adjacent Graph Query Language.
The pipe character is useful for several reasons.
SELECT
and GROUP BY
are co-dependent and have side-effects on other clauses.AGGREGATE GROUP BY
) which are part of the same operator, but are naturally formatted on multiple lines. Without pipe characters, they can be misread as independent operators.As a thought exercise, if it was possible to parse statements without semicolon separators, or SELECT
lists or functions calls without commas between arguments, would those syntaxes be preferable?
|>
' for the pipe character?The most natural and obvious choice would be to use '|
'. Unfortunately, in GoogleSQL (and many other SQL dialects), '|
' is already used for bitwise OR
, and reusing it for pipe operators causes parsing ambiguities. e.g.,
FROM Part
| SELECT *, p_size + 1
| EXTEND p_type
| SELECT p_name
| AGGREGATE -COUNT(*)
could be parsed as
FROM Part
| SELECT *, (p_size + 1 | extend) AS p_type
| SELECT (p_name | aggregate) COUNT(*)
This claim on '|
' is unfortunate since bitwise math is used rarely in SQL queries, while pipe operators will be used frequently. It would be preferable to use the better syntax for pipe operators, relegating bitwise OR
to an alternate syntax.
To avoid this conflict, we used '|>
', which is also used for similar purposes in other languages (including JavaScript[10], F#, R, and OCaml).
We explored deprecating existing usage and reclaiming '|
' for pipe syntax. In existing usage, we saw '|
' used mostly within parenthesized expressions (where it would be unambiguous), or where simple lookahead heuristics (for numeric literals or parentheses) could resolve most cases. Such heuristic solutions could allow overloading '|
' for both purposes, avoiding most breaking changes, but these approaches do not seem robust. More drastic solutions requiring users to migrate queries did not seem desirable.
Intermediate (and final) results in SQL are tables. A table has one or more columns, optionally with names, and zero or more rows. Relational operations produce a table as output, and SQL queries resolve as a tree of relational operations.
With pipe syntax, a query starts with a nullary relational operation, which produces an initial table. These syntaxes can be used to start a pipe query:
SELECT ... FROM ... WHERE ... GROUP BY
, etc).SELECT
clause without FROM
, which produces a single row.TABLE t
, which produces a full table scan of one table.FROM
clause, written alone.The first three items are supported as queries already. The last item is new, and allows queries to start with a FROM
clause. The FROM
clause can contain table scans, joins, TVF calls, array UNNEST
, VALUES
, and other common syntaxes to produce an initial result table.
Zero or more pipe operators can follow that initial query fragment. Each pipe operator is a unary relational operation that takes one table as input and produces one table as output. Most operators take additional arguments to specify exactly what to compute. Some operators (like JOIN
) accept table-typed arguments (possibly subqueries), effectively making them binary (or N-ary) relational operations. All pipe operators treat the pipe input table as the primary table input, and produce one pipe output table.
Intermediate tables have a visible schema describing an ordered list of columns, optionally with names, reflecting what SELECT *
would produce for that table. Additionally, there can be hidden columns (called pseudo-columns in GoogleSQL, which are selectable by name but are not included in SELECT *
) and table aliases (which can be dereferenced using alias.column_name
). The name scope for an intermediate table describes the visible columns, plus these hidden columns and table aliases.
Each pipe operator is a self-contained operation that can see only its input table and its arguments, and can only resolve names from the scope associated with the pipe input table. (Inside subqueries, correlated references to names from outer queries are also allowed.) Pipe operators have no visibility to earlier or later operations in the same query, so complex scoping or data flow is impossible by construction. This makes pipe operators naturally composable, so they can be applied in any order, any number of times.
Each pipe operator can be defined by specifying:
Most operators fit in one of these categories, depending how they affect input columns:
WHERE
, ORDER BY
, LIMIT
SELECT
- produces a new table with the specified columnsAGGREGATE
- produces a new table with the grouping and aggregate columnsCALL
- produces a new table with a TVF's output schemaEXTEND
, DROP
, SET
- projection operators that modify the column listJOIN
- extends the input table with columns and a table alias for the joined tableOrder preservation property is used with ORDER BY
and LIMIT
. As in standard SQL, table scans and most operations produce unordered tables. ORDER BY
makes its result ordered. Specific operators (primarily projection operators like SELECT
) preserve order. If LIMIT
is applied on an input table with order, it selects the top-N rows. Otherwise, LIMIT
selects N rows arbitrarily.
Queries with pipe syntax still have declarative semantics. Pipe syntax appears to imperatively describe the computation as sequential steps, but the syntax specifies declarative semantics only, not an execution strategy. As in standard SQL, queries are typically converted to an algebraic form and then optimized, such that they run faster but produce equivalent results, “as if" the query was run as written. ("Equivalence" allows for ordering differences and other artifacts that can result from reordering joins, using indexes, etc.)
Unlike standard SQL, where resolving a query to relational operations is complex, pipe operators correspond basically 1:1 with relational operations. Simple relational operations can always be expressed as one pipe operator. In a few cases, for user convenience, pipe operators can expand to multiple sequential relational operations. For example, SELECT DISTINCT
produces a projection and then an aggregation. Projection operators like SELECT
can include window functions, and thus expand to a projection followed by windowed aggregation. Other operators that accept expressions may introduce a projection before the main operation.
This duality between relational algebra and SQL syntax is useful and does not exist with standard SQL. Bi-directional translation is possible - SQL generators no longer need to introduce complex transformations that make generated SQL difficult to read. Optimized query plans could be converted back into equivalent pipe SQL explaining clearly how a query will actually run. Query engines might support a feature to execute a pipe query as written, without optimization.
Table-valued functions (TVFs) are a powerful extension mechanism in SQL, effectively allowing users to add arbitrary relational operations.
TVFs with only scalar arguments are nullary relational operations, which can be called in FROM
clauses to produce tables dynamically. With SQL TVFs, these are effectively parameterized views. Non-SQL TVFs can produce arbitrary input tables.
TVFs that take at least one table argument are unary or N-ary relational operations. Non-SQL TVFs with table arguments can extend SQL with arbitrary new relational operations.
The major caveat when extending SQL using TVFs is that the syntax is too awkward. For example, BigQuery uses TVFs to express ML model lookups [31], which can look like:
SELECT *
FROM ML.PREDICT(
MODEL `my_project.imdb_classifier`,
(
SELECT *
FROM ML.PREDICT(
MODEL `my_project.nnlm_embedding_model, `,
(SELECT '<text>' AS input, 7 AS rating))
)
)
With pipe syntax, these TVFs can be invoked directly using the CALL
operator without using nested subqueries, like this:
SELECT '<text>' AS input, 7 AS rating
|> CALL ML.PREDICT(MODEL `my_project.nnlm_embedding_model`)
|> CALL ML.PREDICT(MODEL `my_project.imdb_classifier`)
Any TVF can still be called with the usual syntax in FROM
or JOIN
, but TVFs that take table arguments can also be called with the pipe CALL
operator. The provides near-first-class syntax for operators implemented as TVFs.
A standard TVF that can be invoked like
FROM some_tvf((SELECT ...), arg2, arg3, ...)
can be invoked with pipe CALL
like this:
SELECT ...
|> CALL some_tvf(arg2, arg3, ...)
The TVF's first table-typed argument is treated like a this
argument (in object-oriented call style), and implicitly receives the pipe input table. Other arguments are written with the usual argument syntax.
This allows invoking TVFs in a natural way, expressing the logical order of computation without excessive nesting and subqueries. The difference is particularly stark when considering nested or chained TVF calls, which require deeply nested subqueries in standard syntax (as in the example above), but look like flat sequential calls in pipe syntax.
TVF CALL
s with pipe syntax are as composable as any pipe operator, so they can be used like language extensions. CALL
s are limited to using function call syntax, but optional named arguments and structured types allow significant flexibility. (A potential extension could allow TVFs to provide a plug-in grammar snippet, allowing TVFs to have first-class syntax too. Grammar extensions are impractical in standard SQL, but one pipe operator can be parsed independently between successive pipe characters.)
Standard SQL is difficult to extend and evolve, even for built-in operations. Many operations don't fit naturally anywhere in standard SQL grammar. New operations must be forced into the SELECT ... FROM ... GROUP BY
flow somewhere, which usually involves complex syntactic and semantic interactions.
Standard SQL's pseudo-English syntax depends heavily on reserved keywords to parse unambiguously. Adding new syntax without new reserved keywords is often difficult, or requires opportunistically reusing existing reserved keywords in novel ways. New reserved keywords are always breaking changes for some potential existing queries, making such additions difficult.
This may partly explain why SQL has evolved slowly, and why new functionality sometimes has awkward syntax.
For example, consider PIVOT
[20] (which is non-standard but widely supported), which produces multiple columns with slices of aggregate values taken from multiple rows, “pivoting" rows to become columns. For example, this query:
SELECT *
FROM (
SELECT n_name, c_acctbal AS bal, c_mktsegment
FROM customer JOIN nation ON c_nationkey = n_nationkey
) PIVOT(SUM(bal) AS bal
FOR n_name IN ('PERU', 'KENYA', 'JAPAN'))
behaves like a shorthand for:
SELECT
c_mktsegment,
SUM(IF(n_name = 'PERU', bal, NULL)) AS bal_PERU,
SUM(IF(n_name = 'KENYA', bal, NULL)) AS bal_KENYA,
SUM(IF(n_name = 'JAPAN', bal, NULL)) AS bal_JAPAN
FROM
(SELECT n_name, c_acctbal AS bal, c_mktsegment
FROM customer JOIN nation ON c_nationkey = n_nationkey)
GROUP BY c_mktsegment
In standard SQL, there's no natural place to put the PIVOT
operator, so it's kludged in as a suffix on subqueries in the FROM
clause. PIVOT
acts on all columns of the input table, so it's almost always necessary to use a subquery to prepare the input table, and it's always required to have more query syntax around PIVOT
, at least doing SELECT *
. So PIVOT
queries are always complicated.
For language implementers, without a uniform framework for adding operators, PIVOT
is a complex special case in the grammar, that can attach after tables, subqueries, or TVFs, with complex interactions with other suffix clauses like TABLESAMPLE
. These interactions can trigger subtle bugs so tests must cover all combinations.
Now contrast PIVOT
as a pipe operator. While the operation is exactly the same, using exactly the same syntax inside the PIVOT
clause, it fits naturally as a pipe operator, making it easy to call as a fully composable operator that can be parsed, implemented and tested in isolation.
And it's easy for users to call, without subqueries and with minimal wrapping. Syntax before and after PIVOT
is only needed if there's useful logic to express.
FROM customer JOIN nation ON c_nationkey = n_nationkey
|> SELECT n_name, c_acctbal AS bal, c_mktsegment
|> PIVOT(SUM(bal) AS bal
FOR n_name IN ('PERU', 'KENYA', 'JAPAN'))
Similar challenges exist adding other operations added to SQL, both in the standard and as product-specific extensions. Some examples:
MATCH_RECOGNIZE
[8] (from SQL:2016) adds a mini-language for analyzing sequential patterns. Like with PIVOT
, there's no natural place to add the syntax, so it gets added as another table suffix inside FROM
, requiring subqueries before and after to prepare input and consume output.New operations like these make more sense as pipe operators, since pipe syntax makes them both easier to use and easier to specify and implement. Pipe operators are naturally composable and have locally defined syntax (no need for new reserved keywords!) and semantics. This potentially unlocks significant innovation in SQL functionality and usability that is too difficult in standard SQL.
Pipe operators proposed so far aim for parity with existing SQL syntaxes, but novel new operators might be supported with pipe syntax only, maybe with TVF syntax as a fallback.
Figure 5 shows some additional pipe operators we have prototyped as language extensions, going beyond what can be expressed in standard SQL.
Debugging SQL can be challenging (see also 5.3) compared to imperative languages where control flow is obvious in the code. These operators facilitate many common debugging patterns by making it easier to extract or validate intermediate state in a SQL query with simple local syntax additions.
These examples show how pipe syntax enables novel language extensions without any complex interactions with other SQL syntaxes. New pipe operators can always be used anywhere in a query, while they often wouldn't fit well anywhere inside standard SQL.
Since pipe syntax is implemented as a feature within the existing GoogleSQL dialect, the new syntax is fully interoperable and composable with standard syntax. Any query can be written in purely standard syntax, standard syntax plus pipes, or fully with pipes starting from FROM
. Subqueries can be written in either syntax, inside queries written in either syntax. Views and TVFs can be declared using either syntax. DDL and DML statements containing queries can use either syntax.⁶
⁶Operations like INSERT INTO
and CREATE [TEMP] TABLE
could also work as terminal pipe operators on the end of a query. We haven't tried that so far. Keeping those verbs at the start distinguishes pure queries from statements with side-effects.
This allows users to adopt pipe syntax incrementally, where they find it useful, while continuing to use any existing SQL code they have. Users might write new queries in pipe form, or convert to pipe form when updating or debugging existing queries. Where refactoring (adding table subqueries, etc) would be needed anyway, users might instead refactor into pipe form, making further logic changes easier.
Supporting incremental adoption, without requiring migration or bifurcating SQL codebases, is a major advantage over alternate-language proposals. Supporting the new syntax inside query engines, without a separate translation proxy, provides full interoperability automatically, including for example, calling views bidirectionally.
This section covers our experience using GoogleSQL with pipe syntax, including several use cases that benefit.
After an initial implementation phase, iterating on the language design with a small group of early users, we stabilized the language and made pipe syntax available to users broadly inside Google.
Over the following six months, we've seen adoption and usage increase steadily - see figure 4. The two initial spikes follow announcements sharing pipe syntax on a SQL users mailing list. The second followed removing opt-in settings and making pipe syntax available by default. After initial usage spikes from users trying our demo queries, we see many users continue to use pipe syntax as part of their daily work, with further usage growth as the feature spreads virally to more users.
The third spike in June followed a SQL workshop presented at a user conference, to users with a wide range of SQL knowledge. The workshop included a 40-minute tutorial on pipe syntax, which was enough time to introduce and teach the language, walking through several examples applying it. Feedback showed workshop users were excited about pipe syntax, and we've seen significant continued usage.
Users are applying SQL with pipe syntax for widely varying use cases, including ad hoc queries, queries backing dashboards and reports, data processing pipelines, and libraries of reusable SQL functions and TVFs.
We've seen this adoption despite limited documentation and incomplete tooling (e.g. less auto-complete support inside pipe syntax). This adoption has been self-serve with minimal need for support. Many users understand pipe syntax immediately and find it compelling, reporting that it greatly improves their productivity and user experience in SQL. Several users have commented that they "love" using SQL with pipe syntax. The pain points solved here clearly resonate with users.
We haven't yet done formal user experience research on SQL with pipe syntax but plan to as future work. We conjecture that pipe syntax is easier to teach new users because it decouples SQL's conceptual ideas (aggregation, outer joins, window functions, etc) from syntactic complexity. Teaching pipe syntax first may even accelerate teaching standard syntax, since users can learn the relational operations first, understanding clearly how they work, and later learn how those operations are expressed (poorly) in standard SQL.
Pipe syntax naturally expresses queries with linear operator structure. Complex queries with tree-like structure don't map as directly to pipe syntax.
A common pattern is to start a query by joining data from several sources (in a FROM
clause) and then apply a linear sequence of operations (filtering, aggregating, projecting, ordering, etc) to compute the result after those initial joins. This pattern works well using pipe operators for the work following the initial FROM
.
The FROM
clause itself is most often written sequentially using JOIN ON/USING
syntax, expressing a left-deep join tree. Left-deep joins are exactly what the pipe JOIN
operator expresses, so a FROM
clause written with JOIN
syntax can be written with pipe JOIN
s easily. Pipe joins offer additional flexibility, since projections (SELECT
or EXTEND
), filtering (WHERE
) or even aggregation can easily be done between joins, without any subqueries.
When the first item in a FROM
clause is a table subquery, that can naturally be unnested as the pipe input query. Joins having subqueries on the right (non-left-deep joins) can still be expressed as a pipe JOIN
to the same subquery.
Other techniques for factoring tree-like or DAG-like queries still apply. Subqueries can be extracted as WITH
clauses or temp tables, where this helps to express complex structure or break up long queries.
In practice, breaking up long queries is somewhat less necessary, since pipe queries are shorter and don't need as much nesting or indentation.
Pipe syntax works particularly well while editing or debugging SQL queries. Since logic flows top-to-bottom rather than inside-out, users can write queries sequentially, starting with a simple table scan and then incrementally adding operators, running the query as necessary to observe the result so far. Building queries incrementally in standard syntax is more difficult, often requiring adding wrapper subqueries, with new logic above and below the subquery.
Adding logic (filtering, projection, etc) anywhere within a query is easy, simply by inserting pipe operators, without refactoring or subqueries. It's surprisingly convenient that individual pipe operators can be trivially commented out.
For debugging, queries in pipe syntax have a useful prefix property. Each prefix of a query (up to a pipe character) is also a valid query. In editors with a "run selection" feature, users can select a prefix of the query and run it to see the intermediate query result up to that point. (Without editor support, adding a semicolon or commenting out the tail of the query also works.) For example, users can easily see the rows before and after an aggregation in the middle of a query by running two prefixes of the query.
During ad hoc debugging, it's also very convenient to add a LIMIT
, WHERE
or AGGREGATE
on the end of a query, or after a selected query prefix, to see interesting rows or statistics on a final or intermediate result. For example, adding |> AGGREGATE COUNT(*) GROUP BY column
extracts and counts the distinct values in some column. Ad hoc data exploration often involves computing different aggregates of the same data. This is easy with pipe syntax with the combined AGGREGATE GROUP BY
operator since alternate grouping can be computed by editing that single clause, without needing to update (or even include) a corresponding SELECT
.
SQL development environments are generally much less helpful than modern IDEs for other languages. Standard SQL is not amenable to typical IDE features because queries are too long and inconveniently structured, unlike imperative languages with short independent statements.
Auto-completion is difficult in standard SQL. The clauses are in the wrong order - auto-completing in SELECT
isn't possible until after the FROM
clause is written below to indicate what tables are in scope. Many changes require edits in multiple places (e.g. SELECT
and GROUP BY
, and repeatedly SELECT
ing the same column in subqueries), so adding completions at one cursor location is inadequate. Many edits require refactoring first, adding subqueries rather than just inserting text.
Auto-completion works well with pipe SQL. The query flows sequentially, top to bottom, so the required context is available from text above. New operators are often added at the end of the query, but adding new operators in the middle also works. The prefix property means the auto-completer can analyze a query prefix to know exactly which names are in scope at any point. Many changes are possible with one local edit, without requiring corresponding edits or refactoring elsewhere.
The prefix property makes building an interactive SQL debugger plausible. An IDE could allow clicking on any pipe operator and then displaying intermediate query results at that operator, with metadata about columns in scope, or stats about that result. Single-stepping through pipe operators could work, showing the intermediate result after each operator. (Single-stepping appears to imply imperative semantics, but the query prefix to that point could still be treated as a declarative query and optimized as usual. The user would step through business logic incrementally, not physical operators.) Debugging like this wouldn't work well with standard syntax because it's difficult to isolate substrings (not prefixes) of a standard syntax query as usefully executable fragments, other than whole subqueries. Combining tool support with language extensions (see 4.3.3) could be even more powerful.
SQL code generators are common in reporting tools, object-relational mappings (ORMs) and other contexts. Generating standard SQL is excessively difficult. An application typically has some data structure or API describing the desired query as logical operations, generally corresponding to relational operation. Converting these operations into standard SQL is difficult because SQL syntax doesn't express those operators naturally or composably. SQL generators typically generate subqueries for many operations since that's the easiest way to translate operations independently and reliably. The output queries are typically verbose and unreadable, with deeply nested subqueries and indirection through layers of aliasing.
Pipe syntax is easier to generate directly, since relational operations can be expressed with a single pipe operator and they can be stacked sequentially. Most wrapper subqueries and layers of aliasing aren't needed. (Subqueries are still useful for JOIN
right-hand-sides and expression subqueries.) The resulting queries can be much shorter, and are structured more like how hand-written versions of the same query would be expressed.
The same advantages apply in applications that rewrite SQL queries. For example, Google has applications that rewrite user queries to add differential privacy or other policy controls. SQL rewriting is usually done by resolving a query to algebraic form, applying transformations, and then regenerating a new query. Rewritten queries then have no connection to the original query, and are not human-readable. With pipe syntax, most rewrites involve adding or modifying individual operators (e.g. adding filtering and adjusting aggregations), and it's theoretically practical to apply these rewrites correctly while still preserving the original query structure, with its original names, comments, etc. This is an area for future work.
Applying AI, particularly large language models (LLMs), around SQL queries is an active research area, including query generation and query understanding. Standard SQL is a challenging language for LLMs[22], including for some of the same reasons SQL is difficult for human readers and writers. SQL has huge query statements with significant internal cross-referencing that don't break up naturally into smaller independent steps. This may be one reason SQL LLMs have been less successful so far than LLM assistants for other languages like C++ and Java.
Pipe syntax makes SQL more like typical imperative languages, with independent sequential operators, with clear state between operators, with less action-at-a-distance. Speculatively, an LLM that understands pipe SQL syntax could more clearly understand operations in a query, and should be able to generate queries more accurately from its "mental model" of desired query operations. Generating valid pipe SQL should be easier for LLMs, the same way it's easier for algorithmic query generators (and for humans). The same advantages apply for assistive SQL coding - since more edits can be applied locally, it's easier to make suggestions for useful local edits, making use of accurate metadata that says exactly what's in scope at any particular point.
Human validation of AI-generated queries in standard SQL is extremely difficult since generated SQL is so verbose and unreadable. Generating SQL with pipe syntax, or translating generated SQL to pipe syntax, can produce more concise and readable SQL, facilitating human evaluation and validation.
Automated translation of standard syntax to pipe syntax is useful for several reasons:
Automated translation is an area for future work, but seems straightforward, applying a series of algorithmic refactoring steps, that reorder clauses and convert to pipe operators, extract and flatten table subqueries, simplify or drop redundant SELECT
s, etc.
GoogleSQL implements SQL as a reusable component, shared by several query engines across Google, including F1, BigQuery, Spanner, Procella and others. GoogleSQL implements SQL parsing and language analysis, fully checking and resolving language semantics and producing a resolved algebraic representation that query engines consume and convert to physical execution plans. GoogleSQL also provides function libraries and other components query engines reuse in their implementations, and provides a comprehensive suite of compliance tests to validate behavior is correct and consistent across query engines.
This shared language analysis component allowed us to implement pipe syntax once, and then enable it in multiple query engines. Inside Google, we did initial development with F1, with early users, and then expanded to other internal SQL tools. We're now exploring enabling pipe syntax for use in public BigQuery and Spanner products in Google Cloud.
Pipe syntax is also in the open source release, ZetaSQL[12]. Pipe syntax is implemented entirely within GoogleSQL parsing and analysis code, and required minimal work for query engines to enable. This analysis produces exactly the same algebra, with the same operations as standard SQL, so query engines simply enable the language analysis feature. No new execution or optimization features are needed. (Our pipe operators so far aim for parity with existing SQL. Novel operators may be added later.)
This approach makes pipe syntax immediately a first-class feature in query engines that enable it, making the syntax available and fully interoperable everywhere SQL queries are used.
Supporting pipe syntax via query translation using a tool or proxy is also possible. Our initial implementation used translation, implemented in an Unpipefier library that analyzes queries with pipe syntax enabled and then uses our SQLBuilder library to generate a standard-syntax equivalent supported by target systems. While pipe syntax appears to add significant new language, with a well-structured implementation, the complexity is manageable. The new grammar mostly reuses fragments of existing SQL clauses, and analysis code similarly reuses existing logic. Most pipe operators were implemented by refactoring existing analysis code so logic could be shared across standard and pipe operators rather than creating parallel forks of similar logic. The underlying infrastructure for query resolving, like the type system, name scoping, and expression handling, are all reused as is.
Adding pipe syntax to a SQL implementation is drastically less work than implementing a new query language, particularly if trying to achieve parity with SQL functionality.
Our initial implementation and work with enthusiastic SQL users has been helpful while refining the language, adding features and making adjustments to cover more query patterns and make the language easier to use. We are now expanding to all SQL users in Google, including potential users who might now choose to use SQL given the improved syntax. We're also exploring supporting pipe syntax externally in BigQuery and Spanner. Over time, we will collect more data on usage and patterns. Collecting statistics on average query length, nesting depth, etc, will be interesting.
Our goal with this paper is to share what we've learned so far. We believe the pipe syntax extensions are useful and would be appealing to most SQL users, and could potentially be adopted more widely across the industry. Future standardization of pipe SQL syntax could be worth exploring if there is broad interest. We believe pipe syntax offers a transformative improvement to SQL syntax that could be a good step forwards for the second 50 years of SQL.
Our initial work focused on achieving parity with the expressibility of standard SQL. There are many interesting possible features beyond parity, including features in other data products or APIs that don't yet exist in SQL. Adding features in standard SQL syntax is difficult, but adding extensions as new pipe operators is easy. (See 4.3.) Pipe syntax creates a better platform for future experimentation and innovation in query functionality and expressibility. Figure 5 shows some of our experimental extensions.
Outside the language, we continue to explore tooling improvements across the SQL ecosystem. Pipe syntax facilitates better tooling and IDE support for SQL, including auto-completion, automated refactoring, AI assistants, and interactive debuggers. Our goal is to make SQL a first-class language, with SQL data engineering tools and workflows that work as well as software engineering tools and workflows optimized for C++, Java, etc.
SQL has been extremely successful over its 50-year history, as the one standard language for all databases and query engines. That success has come despite serious language design issues that make SQL more difficult to learn and use than it needs to be.
SQL has been criticized repeatedly, and many replacements have been proposed. No replacement languages have achieved broad success or broad market penetration. Despite its flaws, SQL has a number of features that make it uniquely well-suited for declarative query processing, and the SQL ecosystem and userbase is too large to easily disrupt. It's exceedingly unlikely that SQL will be replaced by a successor language any time soon.
But we don't need to live with SQL's flaws. The language is fixable! This paper shows how pipe-structured data flow, inspired by other languages and APIs, can be added to SQL with moderate effort. The resulting language is still SQL, but it's a better SQL. It's more flexible, more extensible, and easier to use. Proficient SQL users can learn and adopt pipe syntax with minimal effort, getting immediate productivity gains and transforming how they work with SQL. These improvements go beyond incremental convenience and cosmetic improvements. Pipe syntax unlocks entirely new ways to work with SQL, opportunities for improved SQL tooling, and future language innovation.
Our experience so far suggests pipe syntax is a compelling improvement to SQL syntax that could be worth adoption across the industry. Replacing SQL isn't necessary, desirable or practical. We can fix SQL's most serious challenges from within the language.
We'd like to thank Goetz Graefe and John Cieslewicz for their support and feedback, and to our early users who gave feedback. This work is possible thanks to the work of many current and former GoogleSQL team members, contributors, and partner teams building query engines and other SQL tools.
(A long list of references, which I won't reproduce here to save space.)