Sep 18 2024 01:31 AM - edited Sep 18 2024 01:45 AM
Microsoft SQL Server is a database engine based on SQL, which this year turns 50. SQL is known as a "set-oriented" language, meaning it can manipulate sets of data rather than traditional variables used in standard programming languages.
A SELECT statement represents a set. In this historical view of SQL, the "ORDER BY" clause poses a problem because sets are not "orderable." A set A is defined by its elements, regardless of order; for instance, the set {a, b, c} is the same as {b, a, c}. This is not the only issue with the set-based conception of SQL. In a set, elements are ALWAYS distinct; however, in SQL, the DISTINCT clause is optional. In general, a SELECT can produce indistinguishable elements.
The error is fundamental and carries significant consequences. The point is that a SELECT is not a set at all but something more: a SELECT is a stream of elements from a set, which is a substantial difference. If we represent a stream with the same syntax {}, what happens is that:
Alright, some theoretical blah blah, but does it make a difference? Yes, it changes a lot.
In a stream-based approach, the ROW_NUMBER operator would have appeared 50 years ago, not 20 years later. Why did it take so long? Clearly, because a set-based approach does not foresee any numbering of elements, while in a stream, it is the most obvious and natural thing. In SQL Server, to this day, there is no possibility, albeit incredibly powerful, to operate recursively on the previous record, something like:
-- (Returns the previous x * 2 and uses 0 as the initial value)
SELECT x = previous(x, 0) * 2
Why doesn't this operator exist, so obvious and capable of solving countless problems? Because its definition conflicts with the [incorrect] set-based vision of SQL.
Another evident case is given by window functions, based on the OVER clause. The OVER clause is powerful, but its implementation is baroque, again due, in my opinion, to the fundamental misunderstanding: SQL does not handle sets but streams!
Thus, ROW_NUMBER(), in its current implementation, always requires an OVER clause, an absurdity linked to the idea that without the OVER clause, there is a set. Sets do not have an order, so the row numbering does not make sense. This is nonsense. In reality, a SELECT is a stream, and streams ALWAYS have an order. ROW_NUMBER() does not require any OVER clause, and the OVER clause becomes a potent optional tool to manipulate streams.
SELECT ROW_NUMBER(), ... FROM a
This is more than legitimate, it is obvious and powerful. Many theorists turn up their noses at indeterminism, but it is just an absurd obsession with control. Yes, many queries generate indeterministic results, so what? The query
SELECT a FROM t
Generates a repetition of a for each occurrence of every value, so what’s the problem? To obtain deterministic queries, it would be necessary to enforce that:
And I think we can agree that this is absurd. It’s not absurd in the sense that "it would theoretically be nice to have only such well-defined queries but is practically unfeasible"; it’s absurd in the sense that it would be awful to reduce the power of relational databases to this ridiculous subset of possibilities.
The widespread internalization that SQL is a stream-based, not set-based language, would lead to immediate [beautiful] changes in a future version of SQL, specifically:
I have been in this field for many years, and for many years, I have believed that relational database theory contains this fundamental error.
Will I ever see a version of SQL that overcomes this foolish misunderstanding?
marc.