hckrnws
The SQL with pipe syntax is also been implemented at Databricks since Jan 30, 2025: https://docs.databricks.com/en/sql/language-manual/sql-ref-s...
Still, the best is yet to come. Previously, SQL extensions were a pain. There was no good place, and table-value functions were a mess.
Now, it would be possible to have higher-order functions such as enrichment, predictions, grouping or other data contracts. Example:
FROM orders
|> WHERE order_date >= '2024-01-01'
|> AGGREGATE SUM(order_amount) AS total_spent GROUP BY customer_id
|> WHERE total_spent > 1000
|> INNER JOIN customers USING(customer_id)
|> CALL ENRICH.APOLLO(EMAIL > customers.email)
|> AGGREGATE COUNT(*) high_value_customer GROUP BY company.country
This may be called one SQL to determine distinct e-mail domains, then prepare an enriching dataset and later execute the final SQL with JOIN.Iterative SQL with pipes may also work better with GenAI.
Thanks for sharing. I wasn't aware they pushed that out. The ordering of this makes so much more sense. My only real concern, is that I think CTEs are so common place and part of ANSI SQL that you'll see people trade a standard for this. But, I also gripe that Snowflake uses // for comments and people get confused when their code doesn't work on another DB. Or Oracles' join syntax is another example.
Pipe-based query syntax has been appeared a better alternative to SQL in many databases, because it is easier to read, write and maintain. The only thing is that this syntax isn't standardized among databases yet.
BTW, we at VictoriaLogs also use pipe-based syntax for LogsQL query language [1] since the first release in 2023. Recently I wrote an SQL -> LogsQL conversion guide [2], which has been appeared quite clear and easy to follow. This guide was written during the conversion of SQL queries to LogsQL at ClickBench benchmark from ClickHouse [3].
[1] https://docs.victoriametrics.com/victorialogs/logsql/
[2] https://docs.victoriametrics.com/victorialogs/sql-to-logsql/
[3] https://github.com/ClickHouse/ClickBench/tree/main/victorial...
If only it worked on Databricks serverless SQL warehouses...
This is coming to Serverless SQL warehouses soon!
When the pipe syntax was first published last year the SQLite team did a quick hack to try it out.
https://sqlite.org/forum/forumpost/5f218012b6e1a9db
(Note that the fiddle linked no longer supports the syntax)
It is very interesting that they found it completely unnecessary to actually require the pipe characters. The grammar works when the pipe characters are optional.
And, imo, looks a lot better!
Good. I vote for omitting the pipe lexemes. I hate them.
Of the alternatives linked (elsethread), PSQL is the least bad. I'm ambivalent about the curly brackets; will have to ponder.
Malloy would be improved by omitting the superfluous (JSON-esque) semicolons.
PRQL is a similar idea, and it compiles to SQL: https://prql-lang.org/
from invoices
filter invoice_date >= @1970-01-16
derive {
transaction_fees = 0.8,
income = total - transaction_fees
}
filter income > 1
SQL pipe syntax is a great step in the right direction but what I think sets PRQL apart (in my very biased opinion, contributor) is the ability to define functions.
Here's a simple example:
# define the "take_smallest" function
let take_smallest = func n col tbl<relation> -> (
from tbl
sort col
take n
)
# find smallest 3 tracks by milliseconds
from tracks
take_smallest 3 milliseconds
You can try this now in the online playground: https://prql-lang.org/playground/That's simple enough and there's not that much gained there but say you now want to find the 3 smallest tracks per album by bytes?
That's really simple in PRQL and you can just reuse the "take_smallest" function and pass a different column name as an argument:
from tracks
group album_id (
take_smallest 3 bytes
)
Here's a more complex example for sessionizing user events:
# specify the target SQL dialect
prql target:sql.duckdb
# define the "sessionize" function
let sessionize = func user_col date_col max_gap:365 tbl<relation> -> (
from tbl
group user_col (
window rows:-1..0 (
sort date_col
derive prev_date=(lag 1 (date_col|as date))
)
)
derive {
date_diff = (date_col|as date) - prev_date,
is_new_session = case [date_diff > max_gap || prev_date==null => 1, true => 0],
}
window rows:..0 (
group user_col (
sort {date_col}
derive user_session_id = (sum is_new_session)
)
sort {user_col, date_col}
derive global_session_id = (sum is_new_session)
)
select !{prev_date, date_diff, is_new_session}
)
# main query
from invoices
select {customer_id, invoice_date}
sessionize customer_id invoice_date max_gap:365
sort {customer_id, invoice_date}
You can also try that in the playground: https://prql-lang.org/playground/SQL with pipe syntax can also do functions like that. An sequence of pipe operators can be saved as a table-valued function, and then reused by invoking in other queries with CALL.
Example:
CREATE TEMP TABLE FUNCTION ExtendDates(input ANY TABLE, num_days INT64)
AS
FROM input
|> EXTEND date AS original_date
|> EXTEND max(date) OVER () AS max_date
|> JOIN UNNEST(generate_array(0, num_days - 1)) diff_days
|> SET date = date_add(date, INTERVAL diff_days DAY)
|> WHERE date <= max_date
|> SELECT * EXCEPT (max_date, diff_days);
FROM Orders
|> RENAME o_orderdate AS date, o_custkey AS user_id
|> CALL ExtendDates(7)
|> LIMIT 10;
from the script here: https://github.com/google/zetasql/blob/master/zetasql/exampl...Oh nice! Thanks for setting that straight. Apologies, I must have either missed that or forgotten about it.
BigQuery has table-valued functions already, which can be used with pipes with a CALL clause.
I'm tempted to give it a try just for the "For HackerNews enthusiasts" section of their landing page, which states "The PRQL compiler is written in Rust" and "We talk about “orthogonal language features” a lot". Love me some banter in technical documentation :D
I love the design behind PRQL, was a little dissapointed when I tried using it with an MSSQL server and found there was no decent way to filter for `LIKE '%something%'`.
(PRQL uses regex for stuff like this, which sounds great but then isn't supported on some databases like MSSQL)
I'm very surprised to learn that PRQL does not natively support `like`, but you can add it yourself: https://github.com/PRQL/prql/issues/1123#issuecomment-135385...
Huh, it's super cool they have a built-in way of extending to support missing engine features!
Oh man that's awesome!
I think I like this new pipe syntax a lot better than PRQL. Feels like it has the potential to become officially supported by DBs like MySQL since it's a relatively small change to existing SQL syntax.
IIRC the SQL pipe google paper references PRQL. I think both make sense in their own context. SQL pipe is a smaller leap for users familiar with SQL and prql sort of takes on more change. I do wonder if the smaller leap might cause more deja-vu displacement initially than a bigger shift like PRQL. I don't know the answer other than look at users of both over time and see how they do with the different syntaxes.
there's an duckdb extension[0] for it too
0 - https://duckdb.org/community_extensions/extensions/prql.html
“Malloy” is one one in the same spirit, but probably more comprehensive https://github.com/malloydata/malloy
In general in languages like PRQL / Pipe syntax, though conciseness is a big advantage, do they help in query optimisation in any regard ?
I like it, but I'm not sure it's worth the trouble of further expanding an ever-ballooning extended family of SQL syntaxes.
Of course, SQL has numerous issues, both in an absolute sense and relative to what we've come to expect. And the obvious course of action in each individual case is to add syntax to support the missing feature.
But as you keep adding syntax, SQL gets more and more complex, both in the individual variants and due to the complexities of multiple variants with different syntax support at different times.
A transpiling approach makes more sense to me. I'd like the sql implementors to focus on source maps and other things to better support plugging in external, alternate syntaxes (is a standardized mechanism too much to ask for?).
Then individual projects/people/efforts can choose the SQL syntax variant that works for them, and it can be one that evolves separate from the host db.
Plain transpiling won’t work because when you write a SQL query, you have to know what engine is running it because they all do worse at some things than others.
For example, the same result could be fetched using a correlated subquery, a common table expression, a temporary table, a view or very hackily using aggregation and it would depend on whether you were using SQL Server, Postgres, MySQL, or SQLite because they don’t do it all fast. …and you need to know the version.
This might be able to be solved if the language was a brand new higher level one though and the compiler knew the intricacies of each engines.
Yes, any such transpiler would certainly need to be engine aware (for multiple reasons). Right off the bat, the output needs to be in a syntax the engine accepts, and there are all kinds of differences from engine to engine, version to version.
And, as you say, if the language contains higher-level abstractions, it might need to work quite differently depending on the target engine.
My thought is to avoid those higher-level abstractions as much as possible and just expose the capabilities of the engine as directly as possible, albeit with a different syntax. In my experience, developers who are willing and able to write SQL are fine with targeting a specific engine and optimizing to that. (Those that aren't get someone else to do it, or live with he sad consequences of dealing with an ORM.)
To summarize:
Normal Approach: you pick an engine, and get the syntax that comes with it. You need to know what the engine does well and doesn't do well. You write SQL accordingly, using the syntax the engine accepts.
Transpiler Approach: you pick an engine, and independently choose a syntax. You still need to know what the engine does well and doesn't do well. You still write SQL accordingly, but using the syntax of the language you chose.
I don’t know if the transpiler approach really helps when you still have to be aware of the underlying database engine. I’m fine with choosing something like PostgreSQL and writing code optimized for its specific SQL dialect.
But I’m not sure I want to deal with a transpiler that still suffers from a leaky abstraction. It starts to feel a lot like using an ORM. ORMs promise database-agnostic syntax across different engines, but that promise falls apart when you need to optimize a query and suddenly realize you must account for engine-specific features—like PostgreSQL’s BRIN indexes.
To me, abstracting away the engine is not a goal. (Generally, it's pointless -- the various engines work in different ways and offer different features. While there's a bunch of stuff that widely common, that's also the stuff you don't really need to abstract.)
The point of transpiling is to allow you to pick a syntax independent of the engine. You still have to pick the engine and will deal with its peculiarities.
Today's post is about pipe syntax, which is syntax. Yesterday's was about trailing commas, which is syntax. I think there's an appetite for this.
One example is LookML, which used to build semantic data models in an analytic layer: https://cloud.google.com/looker/docs/what-is-lookml
LookML does seem to have invested a lot in compilation to different SQL backends, generally using the best syntax for each.
Unfortunately the frontend is so tightly tied to the Looker BI stuff, and it can't really express arbitrary queries without going through lots of hassle.
Its unclear what Google is doing with Looker. Its would be interesting to imagine what LookML would be like as an independent SQL tool chain.
>LookML does seem to have invested a lot in compilation to different SQL backends, generally using the best syntax for each.
To some degree, yes. Yet far and away, users of Looker use engines like RedShift, BigQuery, and Snowflake because they’re extremely effective at the types of queries that Looker sends at them — not because Looker spends a huge number of hours optimizing for each engine (that’s not to say none is done); these dbs are great at analytical queries.
Looker in its earlier days (early/mid 2010s) took a bet on analytical database engines getting better as opposed to other technologies; for example, Tableau had its own datastore and largely did not “push queries down to the database” for execution. In the end, BigQuery was radically faster than SparkSQL and was compelling for customers, for example; it was not that Looker spent a ton of time optimizing BigQuery as opposed to SparkSQL.
Source: I was an early engineer at Looker
The creator of LookML is actually working on that with a OSS project called Malloy.
It's also not just that -- it depends on the data.
A database can change how it decides to execute a query based on whether a table has 100 rows or 10,000,000 rows.
Yeah that's true. Also, you, as the query writer, know how much the table has and will have and sometimes you decide what to do based on that information too.
We transpile from Python to SQL using Calcite. Calcite works well as an engine that knows enough SQL dialects to be practical.
+1 for Calcite. Any more details on what you’re doing? Are you using a python library on top of calcite? Something else?
We run Calcite in the backend. Python library sends DSL as json to the backend that returns SQL for the target engine.
It seems like this is solved by just having the transpiler aware of the target system.
This is nonsense and the person who was advocating for implementation-aware transpilation is completely wrong.
It is the role of a database engine implementation to service useful queries. Sometimes those queries are very complex; and yes, pipe syntax allows users to more easily write queries that can challenge database performance.
Yet this totally misses the point. Technologies like LookML have long allowed for extremely useful yet complicated queries to be executed by a database engine (often against one that is exceptional at complex analytical queries, like BigQuery) with high performance.
We should never handicap a SQL user’s ability to do what they want in the most ergonomic way. Transpilation largely allows for this — and I am 100% certain that implementations that allow for pipe syntax will effectively merely transform an AST with pipes to an existing kind of AST (this is, of course, effectively transpilation).
It is the job of the database engine, not the one querying (within reason) to performantly execute a query — and ditto for a pipe->non-pipe SQL transpiler. If you disagree with this, you are ignoring the fact that BigQuery, Snowflake, Redshift, and other optimized analytical databases are collectively worth billions, largely for their ability to performantly service crazy yet useful queries.
There are a lot of transpilers, and they don't get much adoption because the people implementing them are not the people implementing the database.
The database and the query language are tightly coupled. The latter determines the kinds of optimizations that the former can do, and a transpiled language means that optimization info needs to be passed through in weird and inefficient ways; it also means that obvious optimizations aren't possible, because the people building the language are not building the db.
If you've ever tried used query hints, or deeply inspected a distributed query plan, you'll know exactly what I mean.
> A transpiling approach makes more sense to me.
That's the solution we've been using, oft referred to as ORM, to work around the lack of this syntax. But then you had to live in a world where you had an almost SQL, but not really, language that doesn't always quite map perfectly to SQL and all the pain to go along with that.
Now instead of that song and dance you can compose queries with plain SQL and simple string concatenation. That is a huge win, even if troubling that it has taken so long (and still not widely supported).
Generally, ORM's try to present a high-level abstraction of your data and data model than the engine does natively.
That makes it the ORM's responsibility to implement the abstraction in a good way. Problems come in when the "good" way depends on factors the ORM doesn't consider or expose well.
I actually mean transpiler in a mainly syntactic sense, and would want it to avoid higher-level abstractions (after all, as you point out, ORMs have that ground covered).
While that is closer to ORM in the classical sense, the modern use – and why I used 'oft referred to' – is more accurately a query builder, but get called ORM regardless. The reason people reach for this type of tool is because there hasn't been a good solution to this with SQL proper, necessitating another language. Until now.
Comment was deleted :(
Relatedly, PRQL is a lovely pipe-oriented syntax for SQL DBs: https://prql-lang.org/
Their syntax is a lot cleaner, because it's a new language -- it's not backwards-compatible with SQL. But then again, they don't have the resources of Google behind them.
if its not backward compatible with SQL, is not pipe syntax for SQL
its just a new Query language
most DBMS have proprietary extension to SQL, its messy, real success would be adding the new syntax to the standard and have implemented across multiple DBMSs
Well, OK. PRQL is "for SQL DBs" in the sense that you can translate it to SQL on the client side, and then execute it on regular Postgres, MySQL, etc.
That's exactly what qstudio offers: https://www.timestored.com/qstudio/prql-ide a desktop client that translates prql to run against many databases.
Also is in DuckDB https://github.com/ywelsch/duckdb-psql
and SQLite is waiting on Postgres: https://news.ycombinator.com/item?id=41385005
Really cool though typing ">" after "|" is a pain https://github.com/brimdata/super/blob/main/docs/language/pi...
Agreed! LogsQL uses `|` as pipe delimiter too - https://docs.victoriametrics.com/victorialogs/sql-to-logsql/
https://github.com/seancorfield/honeysql#vanilla-sql-clause-...
(-> (select :a :b :c)
(from :foo)
(where [:= :foo.a "baz"]))
=> {:select [:a :b :c] :from [:foo] :where [:= :foo.a "baz"]}
Effort: zero (0). That's what "simple made easy" is about.I appreciate that somebody somewhere may appreciate and enjoy this, but I am not that person. I love SQL. I have always loved SQL. I know why others don't, but I do and its beautiful.
While SQL is great, complex analytical queries are much easier to write, read and modify when they are written in pipe-based syntax because of the reasons outlined in the announcement:
- Pipe-based query is executed in clear sequential stages according to the pipes written in the query. SQL execution stages aren't clear enough.
- SQL requires mentioning GROUP BY fields twice - at SELECT and GROUP BY statements. This complicates modifying the query, since you need to modify the set of GROUP BY fields at SELECT too.
- There are no limits on the combination and the number of pipes, which can be used in a single query. There are various limits in SQL. For example, you cannot use WHERE after GROUP BY (there is HAVING workaround, but it looks unnatural). Another example - you cannot easily use GROUP BY multiple times with arbitrary additional calculations and transformations in the middle (of course, you can use subqueries as a workaround for these limitations, but this will complicate the resulting query too much).
I recommend reading the following short migration guide from SQL to pipe-based LogsQL (query language optimized for typical queries over logs), in order to feel the difference - https://docs.victoriametrics.com/victorialogs/sql-to-logsql/
I universally use column positions in the GROUP BY.
SELECT name, count(*)
FROM account
GROUP BY 1
That is still annoying, but requires less typing+modification.i go back and forth on this a lot, but with fewer and fewer "in production" SQL queries in my life, number based group by is definitely my default
It's insufficient to say that something is objectively better. Often times user preferences are multi-modal, and people cluster around groups. Pipe-based database query syntax may be objectively "easier" by some mathematical deduction, but by the laws of statistics > 50% of the userbase may prefer a different syntax.
This isn't an alternative to it.
You can still write
``` SELECT country, COUNT() AS num_records FROM t GROUP BY country ```
And then append something like
``` |> EXTEND FLOOR(num_records / 10) 10 AS num_records_bin |> AGGREGATE COUNT(*) AS num_countries GROUP BY num_records_bin ```
to get a histogram without having to nest stuff
this is SUPER interesting actually. I definitely didn't realize this. I'm still probably going to use a small army of CTEs for this sort of stuff, but I'm very interested to give this a shot the next time I'm exploring a dataset.
The main value proposition is that it unnests the code. Constantly moving logic into CTEs is another way to do it but comparatively it feels very clunky after a while. For example to inspect intermediate stages of a raft of CTEs, you end up having to write a different final clause. You also can't easily skip a stage. With pipes you just comment out the part of the query you don't want, at the bottom or in the middle, and you still have a syntactically valid query.
With today's pipe syntax CTEs are still needed to deal with teeing situations (multiple references to a single table) but for just a chain of transforms it is a lot simpler.
Though not a pipe syntax, the Malloy language has some similar analytic sugar in its syntax.
Malloy is from Lloyd Tabb, a co-founder of Looker.
Ever since using Kusto Query Language, that I look forward to SQL getting something like this, maybe there is hope if enough DBs have it as an extension.
I heavily use KQL on a daily basis and want nothing more than using it for Postgres and SQlite as well.
I have project that's still very much at the experimental stage, where I try to get something similar to this pipe syntax by allowing users to chain "SQL snippets" together. That is, you can use standalone statements like `where col1 > 10` because the `select * from ...` is implied. https://ajfriend.github.io/duckboat/
import duckboat as uck
csv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
uck.Table(csv).do(
"where sex = 'female' ",
'where year > 2008',
'select *, cast(body_mass_g as double) as grams',
'select species, island, avg(grams) as avg_grams group by 1,2',
'select * replace (round(avg_grams, 1) as avg_grams)',
'order by avg_grams',
)
I still can't tell if it's too goofy, or if I really like it. :)I write a lot of SQL anyway, so this approach is nice in that I find I almost never need to look up function syntax like I would with Pandas, since it is just using DuckDB SQL under the hood, but removing the need to write `select * from ...` repeatedly. And when you're ready to exit the data exploration phase, its easy to gradually translate things back to "real SQL".
The whole project is pretty small, essentially just a light wrapper around DuckDB to do this expression chaining and lazy evaluation.
The first time I looked at Elixir, I thought it was hideous (I was really into Python and Ruby, and still admire them while liking Elixir much more). But I kept hearing about the virtues of Elixir (and of the BEAM) and coming back. Finally, it clicked, and I’m now in my 3rd year of doing Elixir full-time. I hope to never again use a language that doesn’t have pipes.
The moral of the story? Let’s give this new SQL syntax a chance.
This is beautiful.
I first encountered this style of data manipulation in R's tidyverse. Some say C# (or F#?)had similar pipes and influences, but I haven't seen specifics.
Some other data libraries like Polars have sort of similar parse trees, but they also have lots of extra cruft because of limitations to the underlying language. Python's functional calling semantics are extremely powerful, but not quite as powerful as R's.
The C# and F# bit is LINQ (language integrated query) and it's a lovely piece of engineering.
The LINQ syntax was an interesting experiment but the value of that syntax always felt minimal to me. Like it looked out of place in the middle of my C# code, so instead I just use the extension methods.
It grows on you after you start using F# too
It should have always worked this way. Without this feature you take the algebra out of relational algebra. That's the root of most of the composition issues in SQL.
Sadly it's a few decades too late though, and sadly this just fragments the "ecosystem" further.
One great feature of the SQL pipe syntax is that it works much better together with programming languages (generating, or modifying existing SQL queries).
It would be great to have it standardized fast and implemented everywhere.
If this helps people write queries more easily, I think that is unequivocally a great thing and have no reservations.
Personally, I continue to prefer CTEs because it allows me to write queries in a more modular way, and in conjunction with descriptive names helps me to keep less context in my head when working out the logic. When I look at a pipe syntax query, it almost feels like the difference between reading a long paragraph vs bullet points?
I don’t think this impacts your usage of CTEs; it’s a reorganization of the SELECT statement, and presumably enables the use of multiple operators and ordering of the SELECT clauses (eg you can have multiple WHERE clauses and it’ll do the right thing) and unifying the clauses (WHERE/HAVING/QUALIFY is now simply WHERE for all 3 cases)
You’d still be using CTEs on top of this to deal with deduplicating multiple SELECT/SUBSELECTs
Interesting - the examples I've seen seem to have multiple layers of FROM > WHERE > AGGREGATE > WHERE > JOIN...[1], where the goal seems to be to remove the need to have nested CTEs?
I mean that pattern would be a single select statement, just using the psychotic inconsistencies of normal SQL
SELECT
FROM
WHERE
GROUP BY
HAVING
JOIN (
SELECT
FROM
GROUP BY
)
ON
Only reason to use a CTE here would be that JOIN-SELECT is ugly, so it competes in that sense, but the usual usage of deduplication-queries-without-making-views isn’t resolved by the new syntax. So you’d still be using CTEs in either syntax.If you imagine it in normal code, the pipe syntax is just a function call chain, and CTEs are variables. You can mix and match as needed for clarity. And you can define your CTE using the function-call chain
So for me, the example query in pipe syntax (up to the enrich:
FROM orders
|> WHERE order_date >= '2024-01-01'
|> AGGREGATE SUM(order_amount) AS total_spent GROUP BY customer_id
|> WHERE total_spent > 1000
|> INNER JOIN customers USING(customer_id)
|> CALL ENRICH.APOLLO(EMAIL > customers.email)
|> AGGREGATE COUNT(*) high_value_customer GROUP BY company.country
is easier to parse (for me) as something like: WITH
customer_order_totals AS (
SELECT customer_id, SUM(order_amount) AS total_spent
FROM orders
GROUP BY 1
)
,enriched_customers AS (
SELECT cus.customer_id, ENRICH.APOLLO(EMAIL > cus.email) enrichment
FROM customer cus
)
SELECT ec.country, COUNT(*) high_value_customer
FROM customer_order_totals cot
JOIN enriched_customers ec
ON ec.customer_id = cot.customer_id
WHERE cot.total_spent > 1000
GROUP BY ec.country
largely because I can understand the components in isolation first, before bringing together. yes, it's more verbose, but I am of the opinion that the brevity comes at the expense of having to keep more information in my working memory.That’s what I mean by CTE being variable-naming. There’s nothing stopping you from using CTE’s and you’d probably make the same decision under this syntax, using CTE’s to give “names” to the partial solution
And like function-call chaining, you would presumably keep it longer where the items being manipulated are “obvious” — such that giving names is more noise than it’s worth — and CTE/variables where it becomes complicated.
That is, there’s not really a conflict of features, pipe syntax just lets you keep extending the SELECT. How much you stuff into a single unbroken pipe chain is a matter of taste, just as much as stuffing a normal SELECT with a series of subselects. You break it down the same way — with CTEs.
It only competes with CTEs in that you’re more likely to find scenarios where you could reasonably get away without it, and still be sufficiently clear and obvious, if not more so
Pipe syntax just lets you go further without having to breakdown simply due to the amount of syntax noise
WITH
customer_order_totals AS (
FROM orders
AGGREGATE customer_id, SUM(order_amount) as total_spent
GROUP BY 1
)
,enriched_customers AS (
FROM customer cus
SELECT cus.customer_id, ENRICH.APOLLO(EMAIL > cus.email) enrichment
)
FROM customer_order_totals cot
JOIN enriched_customers ec
ON ec.customer_id = cot.customer_id
WHERE cot.total_spent > 1000
AGGREGATE ec.country, COUNT(*) high_value_customer
GROUP BY etc.country
[flagged]
First hit on Google for “sql cte” gives me: CTE (Common Table Expression) and this page: https://www.geeksforgeeks.org/cte-in-sql/
For anyone who doesn't know, CTE = common table expression, which is a part of a query representing a temporary result set that can be referenced from other parts of a query.
This is great. It's a more linear workflow that mimics how we filter data in our minds. How long until it's available in MySQL (never, probably)
We made pql.dev that works with the different sql syntaxes by translating kusto like queries to sql (using CTE). It's worked really well thusfar and I wish someone would make a standard pipelined query language that gets supported across most databases
I know prql exists, but the syntax is pretty cumbersome and not something I enjoyed writing, but I do understand why folks would gravitate towards it
I love it. Why did it take us so long to make SQL easier to reason about?
Hopefully, it gets adopted in Postgresql too.
IMO pipe syntax is so much better. Not necessarily this version, but in general
Somehow Microsoft (my employer) has totally failed in evangelizing Kusto to the masses. KQL is by far the best query language you’ll ever use.
This syntax from Google is nice but it’s still just SQL.
I remain unimpressed by KQL. Comparing SQL to KQL is approximately like comparing Java to C#. Yeah it's better in many ways, but at the end of the day it doesn't make a huge difference. I want to go from Java to Lisp.
Consider how many of these complaints are equally valid against KQL: https://www.scattered-thoughts.net/writing/against-sql/
Totally disagree, I've used KQL for about 10 years now, and SQL for 20. Given the choice, I'll always prefer KQL.
Sorry, I don't have time for a thorough rebuttal of all the topics mentioned in the link you provided, but if I had to bring up a few counterpoints:
1. (Can't be expressed) KQLs dynamic datatype handles JSON much better than SQLs language additions.
2. (variables/Fragile structure/Functions) KQL fixes many of the orthogonality issues in SQL. (Specifically: both variable assignments and function parameters can accept scalar and tabular values in a similiar way, where-as SQL uses different syntax for each)
(disclaimer, msft employee)
Eh, I don’t need Lisp. I need to do my job. KQL allows me to write queries intuitively and others to easily understand what I’m doing.
The summarize operator is also a game changer. I use something like “ | summarize count() by bin(TIMESTAMP, 1h) “ multiple times daily. Getting the same result in SQL is disgusting.
Clickhouse now has KQL too, https://clickhouse.com/docs/en/guides/developer/alternative-...
Plus one for Kusto. I’ve never had this much success teaching people data analysis. It’s good for simple obvious and powerful composed queries. Though I work at GitHub so still too close to the mothership for an unbiased opinion.
There’s some Apache energy around KQL though and a few open source parsers so there’s hope yet.
The Unix pipe paradigm was honestly way ahead of it's time.
I'm all for such an approach, but I also like Firebolt's approach with Lambda functions focusing on arrays. [0]
This is so exciting. I hope Postgres and MySQL get some level of support for it, too.
When I first started working in Elixir I loved Ecto and the pipe syntax, but now I want a SQL sigil, that wont care about where I put my where or from clauses.
Why are SQL parsers even concerned with that? Isn’t that why we have RD parser.
This is such an improvement. Can’t wait for it to get into Postgres
It seems unfortunate that InfluxDB had the pipe operator with flux in version 2 and dropped it the new version due to low adoption. Now it seems to become more popular.
As I know, Flux query language from InfluxDB v2 was too complex comparing to the simple pipe-based SQL from the BigQuery announcement. That is the main reason why Flux is abandoned in InfluxDB v3.
Comment was deleted :(
This reminds me a bit of MSFTs Kusto language. Such an immensely useful way to slice and dice large amounts of structured data.
As a user of Kusto query language in Azure I can highly recommend pipe syntax. The autocomplete is very good.
Who invented pipe syntax? Elixir? Or is there another runtime or language that made this prevalent beforehand?
Originally? Douglas McIlroy for Unix in 1973: https://en.wikipedia.org/wiki/Pipeline_(Unix)
I agree. And then functional languages pick up the pipe, as it is roughly "poor man's method chaining" (so the object oriented languages don't need it).
The |> syntax? Perhaps F#. Pipes have been around since at least Unix v3.
Many such things from F# come from Haskell (given the influence from MSR UK, specifically Simon Peyton Jones before he left a few years ago), so likely Haskell or earlier imo (at least in terms of "complex" "pipe" operations that have higher-level operations than "bytes in, bytes out" of Unix).
Worth noting that F# started out life as an implementation of OCaml for the .NET runtime [1], so most likely the pipe syntax was taken from there, although the pipeline-of-functions construction is much older than that [2]
[1] https://softwareengineering.stackexchange.com/questions/2099...
[2] https://retrocomputing.stackexchange.com/questions/17335/wha...
OCaml took the '|>' pipe symbol from F#. And F# was the language that made the '|>' pipe symbol popular in mainstream programming (as opposed to the unix '|' pipe symbol), afaik. According to Don Syme, it was used in F# in 2003 (see "Early History of F#", section 9.1, [1] which references [2]).
Here's his full comment:
/quote
Despite being heavily associated with F#, the use of the pipeline symbol in ML dialects actually originates from Tobias Nipkow, in May 1994 (with obvious semiotic inspiration from UNIX pipes) [archives 1994; Syme 2011].
... I promised to dig into my old mail folders to uncover the true story behind |> in Isabelle/ML, which also turned out popular in F#...
In the attachment you find the original mail thread of the three of us [ Larry Paulson; Tobias Nipkow; Marius Wenzel], coming up with this now indispensable piece of ML art in April/May 1994. The mail exchange starts as a response of Larry to my changes.
...Tobias ...came up with the actual name |> in the end...
/endquote
Haskell has had "$" or "backwards pipe" for ages, but that is just another way of doing function application and it does not feel the same as (and is not used the same way as) the unix-style piping paradigm.
[1] https://fsharp.org/history/hopl-final/hopl-fsharp.pdf
[2] https://web.archive.org/web/20190217164203/https://blogs.msd...
Elixir took the pipe symbol from F# (See Jose's comment at [1]). For more details see my other comment below.
[1] https://elixirforum.com/t/which-language-first-introduced-th...
Shell would at least qualify as prior art, I’d assume
First thing I thought is "this is a SELECT * in disguise", which is something that you SHOULD never do in BigQuery. If you can combine it with CTEs, seems good because it adds legibility for complex queries. Also looks easier for the planner. Not for the optimizer, though.
KQL has this as well. Quite helpful and can make it easier to show intent.
I think the syntax is awesome. Too bad I’m using snowflake at the $DAYJOB
I’ve been using this for a bit more than a week already, I would say this is a great feature for iteration and interactive queries. Here’s my review:
Doing data exploration, analysis, and cleaning, this is way more productive than just SQL. As an example, fusing the aggregate functions with the group by keys creates a much faster workflow. I like that it unifies WHERE/HAVING/QUALIFY, and the set/extend/drop functions help (though Snowflake still wins for being able to declare a column and use it in the same select). Ultimately this gives me a one liner for situations where I’m creating hard to name intermediate CTEs, and that’s awesome!
Iterating on array columns or parsing non-trivial JSON is much better with this syntax too.
This is a shift back to the data frame API of a few years ago, but we’re still missing typing helpers and support that data frame APIs could provide. It would be nice to have a system like this which plugs into language servers, or can list fields at each step with a mouse over/cursor hover, and getting editor completions (the BQ console does an okay job).
This syntax is great for DBT macros. You can just drop in entire transforms without worrying about polluting the query namespace or working around existing columns on your reference tables.
There’s a dark side to this syntax. The imperative coding style this comes with a tradeoff that the reader needs to track internal state through many steps. It’s the same reason why SELECT * is often not recommended.
As a best practice I like to throw a `|> SELECT X,Y,Z` at the end of these pipe blocks to reinforce to the reader what the output looks like.
I should mention that it’s not as portable, but frankly all the major DBs aren’t portable, and other DBs also now support this syntax.
tl;dr: I like this feature, but use it sparingly. Avoid overuse in model files, definitely lean on it in analyses.
Just a question: if you're adding stuff at the end of the query... does the previous steps get cached?
I mean, if you type first: FROM customers |> WHERE signup_date >= '2024-01-01T00:00:00Z' And then you add (and execute) |> COUNT() as tally GROUP BY CHANNEL
Does the first expression get cached by BQ?
No, I don't think BQ does a lot of caching for you even if you re-run the same query.
[dead]
Crafted by Rajat
Source Code