There's an interesting discussion happening in the Silicon Valley data science community.
Last week about 100 data scientists gathered for the first ever WrangleConf, hosted by Cloudera. Good things start small.
And while the big conferences like Strata are largely about technology (Spark!) and the regular local meetups focus on specific tactics (recruiting, a company's tech stack, etc), last Thursday we just couldn't stop talking about people.
Nearly every speaker and panelist touched on the importance of humans throughout the data science process, no matter how “smart” the machines get. We heard this concept bubbling up a few weeks ago at CrowdFlower's Rich Data Summit, too.
What's better than data-driven?
Data-informed. By now we all know “data-driven” as the popular phrase to show reverence for data in the decision-making process. But folks are starting to suggest that being driven by data can lead to over-reliance. The pendulum is swinging back toward common sense and domain expertise.
Several attendees pointed to this article by Andrew Chen as perfectly describing a better balance: “being data-informed means that you acknowledge the fact that you only have a small subset of the information that you need.”
Kevin Novak from Uber's data science team used surge pricing algorithms to demonstrate the appropriate balance between product vision and data. In the early days, Uber CEO Travis Kalanick insisted that the company needed dynamic pricing. Kevin pushed back—they didn't have any data to prove it would work. The company took the risk, allowing product vision to lead the way. As data began to flow in, data scientists and product folks worked closely to optimize the vision with data and vice versa. By now, there's so much data that the data science team leads the charge (get it?), perfecting dynamic pricing far beyond even the original vision.
We're biased. So too are our algorithms.
One of the more thought provoking talks came from Clare Corthell, who pointed out what so many of us forget:
When bias is deeply baked into your data, it's your job to construct fairness into your approach. -@clarecorthell #wrangleconf
— Daniel Tunkelang (@dtunkelang) October 22, 2015
It takes people to build models. Models affect people.
When an investor commissioned an algorithm to identify future founders, then invited them to dinner, his algorithm became self-fulfilling: those selected gained an advantage over other would-be founders.
The problem is that the people who get funded by VCs today are already a biased sample. By training his algorithm on a data set that said a a certain type of people get funded, the only possible outcome was to predict that same type of person getting funded in the future. And then the dinner gave them another leg up.
Training based on biased data leads to biased models and predictions, so it's up to us to be thoughtful about our data and processes from the very beginning—the people at the other end rely on our judgment.
Training...humans
While poorly trained algorithms can lead to poor outcomes, the same is true of poorly trained humans. Pinterest's Andrea Burbank was not shy about the human element as she presented her talk on A/B testing best practices:
@arburbank #wrangleconf pic.twitter.com/seUWFJvUD9
— Gabriela de Queiroz (@gdequeiroz) October 22, 2015
Data Scientist Mike Bernico lamented the way product builders often behave: "Find me a data scientist that will prove me right." Josh Wills of Slack went one step further to say “the human capacity for post-hoc rationalization is basically infinite.”
Just as we continually train and optimize the work we do with data, people talked about keeping their teams on the cutting edge. Knowledge management is hard, but especially hard in the data world where we have to maintain both data knowledge (schema changes, metric definitions) and domain knowledge (answers to business problems and reasons for solving them).
Elena Grewal provided insight into how they tackle ongoing knowledge sharing at Airbnb. At one point, a data scientist committed a markdown file to a Git repo, simply to explain the findings of a test. Other team members followed suit, and before they knew it, they had a lightweight knowledge repository. It turned out that this information was useful beyond just the data science team, and is now referenced by teams across the company (as a personal aside, I hear this all the time — data scientists turn to Mode to solve this same challenge, and end up seeing engagement across their entire companies).
The tenor of WrangleConf was one of excitement, and it seems like this conversation is just getting started. It's knowledge sharing like this that helps us recognize blind spots and become even better champions for the strategic use of data.