Emerging Architectures for Modern Data Infrastructure

“The growth of the data infrastructure industry has continued unabated since we published a set of reference architectures in late 2020. Nearly all key industry metrics hit record highs during the past year, and new product categories appeared faster than most data teams could reasonably keep track.”-Future

A Very Big Deal

Snowflake goes shopping, and buys the store.-benn.substack

The FTC’s New Enforcement Weapon Spells Death for Algorithms

The Federal Trade Commission might have a new standard for penalizing tech companies that violate privacy and use deceptive data practices: make them destroy their algorithms.-Protocol

Disclose Your Angel Investments

The community has given us a lot. We should be transparent about it.-benn.substack

The Next Billion Programmers

The next product Mode’s Chief Analytics Officer would build? Excel, for everything.-benn.substack

Business in the Back, Party in the Front

Over the last decade, we’ve reached consensus on how the back of the data stack should look: get the data in with ELT, store and transform the data in cloud data warehouses, etc. How we handle the front of the data stack — the consumption layer — is still very much up for debate.-benn.substack

The Metadata Money Corporation

Selling the potential to be a standard lets companies spin stories about how their growth curves can go vertical. But standards that are only standards until a better idea comes along aren’t really standards at all.-benn.substack

Data's Trillion Dollar Question Mark

How a data warehouse could become a data platform—and an organizational brain.-benn.substack

Lies, Damned Lies, and Rankings: the Problem With Bloomberg's COVID Resilience Ranking

Despite embedded bias, scores aren’t going away. What’s important is recognizing the embedded bias and regularly reviewing the choice of factors and weights to ensure the bias is aligned with your goals and minimizes unintended consequences.-Zata Novo

File Not Found

There’s a generational divide in how we access information. Professors organize files with folders. Students search. But directory structure remains incredibly important in tech and STEM fields, leading new coders to constantly come up against “file not found” errors.-The Verge

What Really Happened When Google Ousted Timnit Gebru

This is the most in-depth piece we’ve seen about Google’s unceremonious dismantling of its Ethical AI team and the tensions inherent in an industry’s efforts to research the downsides of its favorite technology.-WIRED

Google Is Poisoning Its Reputation With AI Researchers

“Google has worked for years to position itself as a responsible steward of AI... But now its reputation has been badly, perhaps irreversibly damaged, just as the company is struggling to put a politically palatable face on its empire of data."-The Verge

Your Local Police Department Might Have Used This Facial Recognition Tool To Surveil You. Find Out Here.

This database shows if the police department in your community is among the hundreds of taxpayer-funded entities that used Clearview AI’s facial recognition.-BuzzFeed News

Why the Pandemic Experts Failed

Data-driven thinking isn’t necessarily more accurate than other forms of reasoning, and if you do not understand how data are made, their seams and scars, they might even be more likely to mislead you.-The Atlantic

Python Developers Survey 2020 Results

Here’s one of many interesting tidbits: “Only 32% of the Python developers involved in Data analysis and Machine learning consider themselves to be Data Scientists.”-JetBrains

Developing a Database of Structural Racism–Related State Laws for Health Equity Research and Practice in the United States

“Although U.S. state laws shape population health and health equity, few studies have examined how state laws affect the health of marginalized racial/ethnic groups (e.g., Black, Indigenous, and Latinx populations) and racial/ethnic health inequities.”-SAGE Journals

Data Feminism

“Illustrating data feminism in action, D'Ignazio and Klein show how challenges to the male/female binary can help challenge other hierarchical (and empirically wrong) classification systems. They explain how, for example, an understanding of emotion can expand our ideas about effective data visualization, and how the concept of invisible labor can expose the significant human efforts required by our automated systems.”-Catherine D'Ignazio and Lauren F. Klein

This Is How We Lost Control of Our Faces

Over the last 43 years, facial-recognition researchers gradually abandoned asking for people’s consent. Now, more and more personal photos are used in datasets without their owners knowledge.-MIT Technology Review

What Is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally

“This paper posits that just as an idea of justice is needed in order to establish the rule of law, an idea of data justice – fairness in the way people are made visible, represented and treated as a result of their production of digital data – is necessary to determine ethical paths through a datafying world.”-Big Data & Society

COVID-19 Vaccine Distribution Algorithms May Cement Health Care Inequalities

Many of the algorithms used by federal and state governments rely on data from the U.S. Census. The U.S. Census regularly undercounts vulnerable populations.-VentureBeat

How Our Data Encodes Systematic Racism

“I’ve often been told, ‘The data does not lie.’ However, that has never been my experience. For me, the data nearly always lies."-MIT Technology Review

Google Employees Say Scientist's Ouster Was 'Unprecedented Research Censorship'

Until last Wednesday, Timnit Gebru was a co-lead of the Ethical AI team at Google. She is one of the few Black women working in this field. Her firing brings up an often-raised question: can a company be trusted to hold its technology accountable?-NPR

Emerging Architectures for Modern Data Infrastructure

“In the last two years, we talked to hundreds of founders, corporate data leaders, and other experts – including interviewing 20+ practitioners on their current data stacks – in an attempt to codify emerging best practices and draw up a common vocabulary around data infrastructure.”-Andreessen Horowitz

Towards Decolonising Computational Sciences

“We see this struggle as requiring two basic steps: a) realisation that the present-day system has inherited, and still enacts, hostile, conservative, and oppressive behaviours and principles towards women of colour (WoC); and b) rejection of the idea that centering individual people is a solution to system-level problems.”-arXiv.org

‘People of Colour Aren’t Empowered to Make Changes They’re Brought in to Make’

Inioluwa Deborah Raji of the AI Now Institute talks about how she got started in AI ethics and why tech companies aren’t doing enough to address systemic bias in their products.-Silicon Republic

IBM Walked Away from Facial Recognition. What About Amazon and Microsoft?

While this decision comes amidst the nationwide focus on police brutality, the folks at Algorithmic Justice League having been beating the drum about facial recognition bias for years.-VentureBeat

Don’t Be Fooled by America’s Flattening Curve

At first glance, the national and state-wide new COVID-19 cases appear to be leveling off. But removing major metropolitan areas (where cases are declining) from the calculation reveals a series of regional “mini-epidemics” are on the rise.-The New York Times

Female Pioneers in Computer Science You May Not Know

These women paved the way for computer and data science as we know it today.-Re-work

New Research Suggests the US Unemployment Rate is About to Become Useless

For this crisis, the employment to population ratio may be a better measure to assess the job market.-Quartz

Data on COVID-19 Testing

Comparing confirmed cases across countries is a complicated task because there’s no unified definition of what a confirmed case is. In Germany, it’s samples tested. In the U.K., it’s people. And in some countries, the units are unclear or inconsistent.-Our World in Data

Data Centers Are the New Oil

What connects politics, Utah, and Matthew McConaughey? Rooms upon rooms of servers.-Normcore Tech

Data Science Careers for Baltimore’s Underserved Community Members

This inspiring initiative offers a viable model for providing data science training to those who might not be able to access it otherwise.-Hopkins Bloomberg Public Health Magazine

Racial Bias in a Medical Algorithm Favors White Patients Over Sicker Black Patients

“Correcting the bias would more than double the number of black patients flagged as at risk of complicated medical needs..."-The Washington Post

Estimating the success of re-identifications in incomplete datasets using generative models

“Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”-Nature

How R-Ladies made data science inclusive

Just 14% of R users are women, but that’s actually unusually high for a programming language. And we have R-Ladies to thank!-Quartz

A Turbulent Year: The 2019 Data & AI Landscape

“In a world where data-driven automation becomes the rule (automated products, automated cars, automated enterprises), what is the new nature of work? How do we handle the social impact? How do we think about privacy, security, freedom?”-Matt Turck

Why Hadoop Failed and Where We Go from Here

Hadoop was excellent at economically harnessing data types that were constantly evolving. Managing the core data of an enterprise? Not so much.-Teradata

Python's Caduceus syndrome

What happens when a programming language grows up?-Normcore Tech

Grocery Bills Can Predict Diabetes Rates by Neighborhood

Dietary habits are notoriously difficult to monitor. By analyzing sales figures from London’s biggest grocer, data scientists were able to link eating patterns with local rates of high blood pressure, high cholesterol, and high blood sugar.-MIT Technology Review

An Algorithm Wipes Clean the Criminal Pasts of Thousands

When you see “criminal” and “algorithm” in a headline together, it's usually a sign the article will be about unfair bias. But not in this case! Code for America used an algorithm to automatically remove cannabis convictions from Californians' records, reducing a process that would have taken months to mere minutes.-BBC

A Weather Tech Startup Wants to Do Forecasts Based on Cell Phone Signals

Speaking of making your own data... ClimaCell is developing a new mathematical model that turns cell phone signals into weather data that's way more accurate than your local weatherperson.-MIT Technology Review

Writing a Letter to DataCamp

In the wake of a case of sexual misconduct at DataCamp, one instructor reflects on her relationship with the company, and why she doesn't want you to take her courses.-Julia Silge

Coding Is for Everyone—as Long as You Speak English

If people can translate programming languages easily enough into esoteric versions like LOLCODE, why are there only four programming languages widely available in multilingual versions?-Wired

Scientists rise up against statistical significance

“We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence.”-Nature

Facial recognition's 'dirty little secret': Millions of online photos scraped without consent

Earlier this year IBM released a dataset of 1 million photos of people's faces designed to reduce bias in facial recognition software. These photos were obtained from Flickr, without users' knowledge or consent.-NBC News

How Your Health Information Is Sold and Turned Into ‘Risk Scores’

Companies such as LexisNexis have collected personal data to help doctors make informed decisions about prescribing opioids. And since no law prohibits collecting such data or using it in the exam room, it's happening without patient consent.-Politico

Why There Will Be No Data Science Job Titles By 2029

“The only thing that is certain is change, and there are changes coming to data science. One way to be on top of this trend is to not only invest in data science and machine learning skills but to also embrace soft skills.”-Forbes

Demand and Salaries for Data Scientists Continue to Climb

Data science job openings are expanding faster than the number of technologists looking for them.-IEEE

I Gave a Bounty Hunter $300. Then He Located Our Phone

T-Mobile, Sprint, and AT&T are selling access to their customers’ location data, and that data is ending up in the hands of bounty hunters and others not authorized to possess it, letting them track most phones in the country.-Motherboard

Amazon scraps secret AI recruiting tool that showed bias against women

Don't just read this article—read the discussions around it too. Peter Aldhous makes a great point that “This is being reported as a problem with machine learning, but there's another way of looking at it: The algorithm exposed bias in their existing hiring practices.”-Reuters

Who needs democracy when you have data?

“As far as we know, there is no single master blueprint linking technology and governance in China. But there are several initiatives that share a common strategy of harvesting data about people and companies to inform decision-making and create systems of incentives and punishments to influence behavior.”-MIT Technology Review

To work for society, data scientists need a hippocratic oath with teeth

Guess who was totally unsurprised by the unfolding data scandals surrounding Cambridge Analytica and Facebook? Cathy O’Neil, author of Weapons of Math Destruction. In this interview, O’Neil shares her vision for combatting the silent, society-wide bureaucracy governed by algorithms and big data.-Wired

A Code of Ethics for Data Science

Speaking of the responsible use of data… the former U.S. Chief Data Scientist has issued a rally cry for the data science community to band together and take a leadership role in defining right from wrong. If you’re interested in contributing to the conversation, join the Data for Democracy Slack group.-DJ Patil

Fitness tracking app Strava gives away location of secret US army bases

In a case of content marketing gone wrong, fitness tracker Strava shared a heatmap of every single user activity ever uploaded to the app. Although pretty, the map is detailed enough for someone to clearly identify internal layouts of foreign US army bases in countries such as Afghanistan, Djibouti, and Syria.-The Guardian

What is the Future of Pandas

A must-watch talk for any pandas developer.-PyData

Five ways to fix statistics

The debate rumbles on and on: how much is bad statistics to blame for poor reproducibility? Nature asked influential statisticians to recommend one change to improve science and found the problem is not numbers, but ourselves.-Nature

Why is this company tracking where you are on Thanksgiving?

A data study of how political divisions affected 2016's Thanksgiving celebrations is raising some eyebrows within the data science community, including those of former U.S. Chief Data Scientist DJ Patil. SafeGraph provided the researchers with 17 trillion very specific location markers for 10 million smartphones, despite claiming that the data they collect is anonymized.-The Outline

When Data Science Destabilizes Democracy and Facilitates Genocide

Last week’s Senate Intelligence hearing with Facebook, Twitter, and Google shined a bright light on the ethical responsibility of tech companies—and their data scientists.-fast.ai

Not a revolution (yet): Data journalism hasn’t changed that much in 4 years, a new paper finds

Exploring the news through an interactive visualization can feel cutting-edge, but data journalism's labor intensity and reliance on officially collected data make it “more likely to complement traditional reporting than to replace it on a broad scale.”-NiemanLab

R for Journalists

This site is a great launch pad for anyone who's new to R, journalist or not. Each post provides step-by-step instructions and code for making a visualization with data about a current event.-R for Journalists

The ‘Nate Silver Effect’ Is Changing Journalism. Is That Good?

“Political journalism has become infatuated with opinion polls... and yet news organizations remain ill-equipped to make sense of the flood of data.”-Politico Magazine

The Media Has A Probability Problem

In the final installment of a series reviewing news coverage of the 2016 general election, Nate Silver explores the challenges of calculating, interpreting, and communicating probabilities to the public.-FiveThirtyEight

A Tale of Two Industries: How Programming Languages Differ Between Wealthy and Developing Countries

The latest analysis from Stack Overflow found correlations between certain technologies and GDP per capita. Particularly interesting: questions regarding two data science powerhouses, R and Python, are asked more frequently in high-income countries.-Stack Overflow

Data On Drug Use Is Disappearing Just When We Need It Most

“We’re simply flying blind when it comes to data collection, and it’s costing lives.”-FiveThirtyEight

Dissecting Trump’s Most Rabid Online Following

Come for the “subreddit math,” stay for the latent sentiment analysis methodology.-FiveThirtyEight

Airbnb’s worst problems are confirmed by its own data

While roughly 71 percent of hosts rented out their home for three months or less, there were still thousands of 'whole units', meaning an entire house or apartment, which were rented for six months or more during the last year.-The Verge

Airbnb Says Data Dump Shows Misuse of Service Is Rare

With its release of a trove of data this week, the short-term rental company Airbnb sought to underscore how the majority of its hosts in New York City are playing by the rules.-New York Times

Hans Rosling: An Appreciation

[Hans Rosling] He championed the idea of showing people what the world was really like – and how it was different from their preconceptions–using data and visualization.-eagereyes

Remembering Hans Rosling, the visualization pioneer who made data dance

Rosling's work was a driver of some of the explosion of interest in data visualization in the news and nonprofit sectors starting in the early 2000s. His BBC special and TED Talks sparked an interest in 'storytelling with data,' rather than just with words.-Wonkblog

What It Takes to Truly Delete Data

Can an entire dataset of important information really be deleted, just like that?-FiveThirtyEight

States Move to Protect Their Immigration Data from the Trump Administration

Washington’s governor has asked staff to figure out how to keep data from being used for mass deportations-The Verge

How statistics lost their power – and why we should fear what comes next

“Not only are statistics viewed by many as untrustworthy, there appears to be something almost insulting or arrogant about them. Reducing social and economic issues to numerical aggregates and averages seems to violate some people’s sense of political decency.”-Guardian

Finally, Uber Releases Data to Help Cities With Transit Planning

But it’s not the highly coveted numbers cities need. How helpful is the company’s new data tool?-CityLab

Uber Extends an Olive Branch to Local Governments: Its Data

The ride-hailing company Uber and local governments often do not play well together. But now, with a new data-focused product, Uber is offering a tiny olive branch to its municipal critics.-New York Times

A non-comprehensive list of awesome things other people did in 2016

Here’s a good year-in-review for all you stats lovers out there.-Simply Statistics

Scientists are frantically copying U.S. climate data, fearing it might vanish under Trump

Alarmed that decades of crucial climate measurements could vanish under a hostile Trump administration, scientists have begun a feverish attempt to copy reams of government data onto independent servers in hopes of safeguarding it from any political interference.-Washington Post

How Trump’s White House Could Mess With Government Data

Outright manipulation may be unlikely, but there are subtler things the administration could do.-FiveThirtyEight

White House Special with DJ Patil, US Chief Data Scientist

In this interview, DJ talks about the government’s relationship with Silicon Valley, the White House’s position on data ethics, and why George Washington was actually the first U.S. Chief Data Scientist.-Partially Derivative

2016: A Year of Data-Driven Confusion

“We need strong mechanisms for ethical and fair practices within teams and organisations, and a culture where pushing back on conclusions is well-received and seen as a sign of strength, not of defiance.”-Model View Culture

Yes, the election polls were wrong. Here's why

We treat polls like weather forecasts – but voters are inherently unpredictable. A hunger for certainty sets expectations that are impossible to meet.-Guardian

Meet a Polling Analyst Who Got the 2016 Election Totally Wrong

Sam Wang opens up about political forecasting, eating crickets on live television, and what we can all learn from Hillary Clinton’s shocking loss.-Pacific Standard

How Data Failed Us in Calling an Election

It was a rough night for number crunchers. And for the faith that people in every field — business, politics, sports and academia — have increasingly placed in the power of data.-New York Times

Why are we so surprised?

In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice we are.-Probably Overthinking It

Data Sets Are The New Server Rooms

As Foursquare has proven, collecting proprietary data from the get-go can lead to a major competitive advantage in the long run. But doing so requires cash, and lots of it.-John Nussbaum

Ethics for powerful algorithms

Contrary to a ProPublica investigation, COMPAS—a proprietary algorithm used to predict police recidivism and inform parole—isn’t statistically biased against black people. However, that doesn’t mean COMPAS isn’t deeply unfair. This is the first of four posts digging into data science ethics.-Abe Gong

The Genomics Inflection Point: Implications for Healthcare

Genomics has the potential to massively improve on our collective health. Although cost has dropped significantly and technology has improved, genomics hasn’t yet been widely adopted by the public. This survey of 1,000 consumers sheds light on the challenges genomics faces before becoming a normal part of everyday healthcare.-Rock Health

Data Journalism Awards 2016: what the winners tell us about the state of the data nation

The Data Journalism Award winners were announced last Thursday. The director of the awards reflects on what these winners reveal about the state of data journalism.-Simon Rogers

Uber Checks Into Foursquare’s Massive Location Database

Uber will now tap into Foursquare's location data, especially its "point of interest" data (restaurants, stores, landmarks, etc.) to enhance its database of locations.-Fortune

Uber taps Foursquare’s Places data so you never have to type an address again

Foursquare is providing points of interest data to Uber so that riders can type in venue names to specify their pick-up and drop-off locations.-TechCrunch

What’s driving Silicon Valley to become ‘radicalized’

The fallout from Apple vs. the FBI has the tech industry rattled. More and more companies are upping security—collecting less information, investing in tougher encryption, and giving customers the keys to their own data.-Washington Post

When newsrooms don’t own their data, other companies profit

Companies like Foursquare have proven that there’s power in building proprietary datatsets. And that raises the question: how might news publishers aggregate information to create enterprise data models of their own?-Poynter

An unlikely source predicted Chipotle's disastrous quarter, and it says a lot about the future of investing

Not everyone was caught off guard by the scale of the drop in same-store sales at Chipotle. Using foot traffic data, Foursquare called it.-Business Insider

Microsoft’s Tay is an Example of Bad Design

0r Why Interaction Design Matters, and so does QA-ing.-Caroline Sinders

Here's How We Prevent The Next Racist Chatbot

Tay.ai is the consequence of poor training-Popular Science

Why Microsoft Accidentally Unleashed a Neo-Nazi Sexbot

It’s not surprising that Microsoft’s chatbot spewed racist invective, but here’s how it could have been avoided.-MIT Technology Review

Moneyball for Book Publishers: A Detailed Look at How We Read

Publishers are now using reader behavior data collected from e-readers to inform decisions about advertising budgets and marketing tactics. Obviously, the impact of reading analytics presents concerns for authors and readers alike.-New York Times

We Now Have Algorithms To Predict Police Misconduct

You’ve probably heard of predictive policing, but what about predictive policing for the police? One police department teamed up with researchers to test an algorithm that detects troublesome behavior of officers early on.-FiveThirtyEight

Why data journalism tries, and fails, to go global

With the success of data blogs like The Upshot and data publications like FiveThirtyEight, it feels like data journalism is making a big impact. But in countries where data journalism could do the most good, there are obstacles that bootcamps and hackathons can’t overcome.-Sunlight Foundation

The Ethical Data Scientist

Even though the ethics of data science have been bubbling up in conversation lately, we don’t talk about them nearly as much as we should. Why is that? And how can we go about fixing it?-Slate

Let’s Move Beyond Open Data Portals

Open data portals have been integral to making government more transparent. So why is a man who spent much of his career opening data now arguing that we should abandon open data portals altogether?-Abhi Nemani

On research parasites and internet mobs - let’s try to solve the real problem.

The New England Journal of Medicine recently published an editorial about data sharing which referred to people who use data secondhand as “research parasites.”-Simply Statistics

The Experiment Experiment

When psychologist Brian Nosek tried to reproduce the results of 100 studies published in the top peer-reviewed scientific journals, only 39 could be replicated. Might the scientific community have an unconscious bias toward publishing positive results? Find out.-Planet Money

The Future of Big Data and Analytics in K-12 Education

At edtech startup AltSchool’s private campuses, student actions are recorded every day. AltSchool’s software and algorithms search this data for patterns and make suggestions for how to improve student performance. If you only read one article today, this is it.-Education Week

Georgia Tech Researchers Demonstrate How the Brain Can Handle So Much Data

Random projection is frequently used in machine learning to make sense of big, diverse data. It turns out this method could be one of the ways that humans learn, too.-Georgia Tech

Your Doctor Doesn’t Want to Hear About Your Fitness-Tracker Data

While your Fitbit or Apple Watch can be great for tracking your activity and weight loss, it might not help your doc too much. From these doctor’s perspectives, the most promising wearables are yet to come.-MIT Technology Review

