“In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake.“
-- Sayre’s law
As data scientists, our work sways elections; our models move markets; our recommendations launch and dismantle businesses. In positions of such influence, we haven’t the time to waste on petty distractions of little import. Which is why I believe it’s critical to step back from our conversations on the ethical implications of mass data collection or the role of algorithmic decision-making—the trivial stuff—and take on a far more consequential matter: Should SQL queries use trailing or leading commas?
While the correct answer is clear—it’s trailing commas; I’m not trying to pretend to be a fair juror here—the issue remains divisive. So how do we settle it?
The ultimate arbiter of positional punctuation, the New Yorker, runs a regular column, about actual commas, called “Comma Queen.” They have not yet considered SQL. We collectively look forward to their judgment on the subject. Until then, though, we have an interim solution: Hundreds of thousands of people have created nearly a billion SQL queries in Mode. If other data scientists aren’t swayed by aesthetics—I mean, just look at those leading commas, awkwardly hanging out like a drawer that won’t quite close—maybe they’ll be swayed by hard empirical evidence.
Down the rabbit hole we go.
The queries we write
Most queries use trailing commas. Of all the distinct queries in Mode that were edited by one person and have at least five lines in their select statement, 82.3% use trailing commas, compared to 16.3% that use leading commas (which, yes, leaves 1.4% that, hideously, use both).
If we assign comma preferences to people, determined by how they use commas in their queries, trailing commas continue to dominate: 80% of people use only trailing commas. Just over 5%, however, exclusively use leading commas, leaving nearly 14% of query authors in a disturbing middle, as tragic wafflers, acquiescent centrists, or intentional vandals, more anarchist than analyst.
These "undecideds" aside, this first result is clear: a strong preference for trailing commas; a point for the good guys. To be fair to the opposition though, millions of people also watch Love Island every day. Just because people do something doesn't mean that they should.
Troublingly, the more someone should know what to do, the less they prefer trailing commas. 76% of people who've written over 100 queries favor trailing commas, compared to 89% of people who've authored ten or fewer.
Still, what do the experts know? The experts told us the 2007 economy was great. The experts told us that Theranos was great. The experts told us that The Shape of Water was great. Forget the experts—what do the queries themselves tell us?
The queries we should write
Leading comma advocates typically argue that leading commas are better because you can easily comment out lines in a query without having to edit other lines. With trailing commas, that edit is necessary to prevent errors and, they say, easy to forget.
If this theory is true, then we'd expect queries that use leading commas to fail less frequently than those that use trailing commas.
This, regrettably, is true. Among people who display a consistent preference for leading commas, 12.7% of their queries fail because of a syntax error. The error rate for people who use trailing commas is 13.4%.
This rate could be explained by other factors though. As we saw earlier, people who use leading commas tend to be more experienced analysts. That experience, not their disdain for sensible punctuation, might be what’s causing their rates to be lower.
But it’s not: Though error rates improve as people write more queries, leading comma error rates are lower than trailing comma rates at every experience level.
Because error rates differ by database as well, database type could be another confounding variable. Still no—leading comma error rates are also lower for every major database (except MySQL) for which syntax errors can be easily identified.
No matter how else we cut it—by type of query, by query age—the same story emerges: leading commas perform better. The evidence is insurmountable, the data undeniable. The horror is real.
An act of rebellion
And yet—80% of us still refuse to act accordingly. And it’s not just because we haven’t discovered leading commas: 36% of those who use leading commas abandon it. Of the people who use trailing commas, only 12% are lured away. Perhaps most tellingly, of the people in the tortured, undecided middle who eventually choose a side, most choose trailing commas.
And so will I. I’ll do with this analysis what any good data scientist would: stubbornly ignore what the data says, convince myself it’s probably wrong anyway (I had to write regexp to parse SQL queries; who can trust that?), and refuse to change because I care about more about how my queries look than how they perform.
Some might call it vanity, foolishly trading efficiency for beauty. I call it art. Like Frank Gehry, we trailing comma defenders care about function and form; like William Carlos Williams, we care about what our creations evoke, not the rules they abide by; like Maeve, we care about being humans, not machines. Our commas are silent acts of rebellion against the ruthless optimizations—the bullet journals, the working lunches, the life hack listicles, the Medium posts about the habits of highly successful people - that trade our humanity for marginal productivity gains. With every trailing comma we write, we cry out: We are writers; we are creators. We are human; we are large; we contain multitudes.
So, the next time you find yourself questioning where commas go in SQL query select statements, the answer is simple:
While leaders lead with leading commas, and trailing commas are leading signs of failing lines, and the tale aligns no matter the database breed, we’re not agreed that it's best to concede to lead because the more we scale our query kneading, the more we follow the trail to trailing from leading.
Because it’s people, who do the reading.