As it has in previous years, WrangleConf 2017 brought to the foreground some of the most pressing issues for data scientists today. While there were certainly tactical discussions about how analysts ply their trade, the most powerful threads focused on the dangers of bias in all its forms and the need to meet audiences where they are when presenting data.
The danger of “implicitly think[ing] you’re right”
Drew Conway (Alluvium) kicked off the day with a sentiment that resonated throughout the room: that Nate Silver's very public “failure” in predicting the outcome of the 2016 presidential election led to “a creep and fear and misunderstanding of what we [data scientists] do.” Conway's call for more education around how data scientists do their work was echoed by Facebook's Sean Taylor, who contended that “we need to acknowledge and report the precision with which we know things.” Audiences who don't understand the fundamentals of data science may be easily swayed by its impressive accuracy (like Silver's 2012 streak), not understanding the uncertainty that's an inherent part of the process.
The need to embrace that uncertainty was underscored by Grant Ingersoll (Lucidworks), who warned of having complete faith in the data you've collected. He said that it's dangerous when “you implicitly think you're right” because data scientists may not have many fail-safes to check their work. Conway echoed the need for caution, defining the process of data science in these terms: “We collect biased data and use biased tools to make inferences about a world that is impossible to measure.”
Even the systems data scientists work within can impact data teams. Jasmine Tsai (Clover Health) outlined the complexities of working in healthcare, saying that the fragmented nature of the field leads to equally fragmented data, which leads to unwieldy tables of barely comprehensible information. It wasn't until Clover Health began transforming data prior to having analysts work with it that their team could see past the mess and produce more valuable insights.
Avoiding the hubris of certitude requires two related exercises on the part of data scientists: First, they have to embrace a skeptical view of their own work. It's not enough to be aware of the limitations that method, tooling, and interpretation all add to the process, they have to be wary of them. Second, they have to acknowledge and mitigate the misconceptions that others may have about the work that they do. Only by communicating the potential promises and pitfalls will the recipients of their analysis be in a position to make well-informed decisions about the data they've wrangled.
Map out the "Black Mirror scenario"
Beyond the question of what and how much data science can really tell us about the world, being thoughtful about the systems and metrics that drive analysts' work is essential for reasons of ethics and efficacy. Tyler Schnoebelen (integrate.ai) urged data scientists to map out the potential “Black Mirror scenario,” the worst or most dystopian repercussions of a system. Equally important is the metric by which those systems will be deemed successful. Taylor reminded us: “You choose to design your metric in a way that represents success to you.” Optimizing for those metrics means prioritizing some measures over others, a process that needs careful consideration in order to avoid unintended consequences.
Once metrics become enshrined, Taylor encouraged data scientists to be aware of how they—and especially the y-axis—can become rote and unnoticed as time goes by. Data scientists can provide guidance in determining the salience of metrics, ensuring that what's being measured is what's best for the business and the public at large. Likewise, Derek Steer (Mode) pointed out that while increasingly user-friendly tools have lowered barriers of entry to analytics, data scientists are in a position to provide essential statistical context when interpreting “the inputs and outputs” of analysis.
Together, these speakers point to the necessity of understanding the full breadth of analytical projects. Looking only at reports can obscure the decisions that led to the numbers, resulting in unfortunate outcomes for businesses basing decisions on metrics that no longer matter and catastrophic ones for communities impacted by biased programs or algorithms. Taking the time to reevaluate them makes good business and ethical sense.
“Publishing data isn't enough”
There's never been as much data as freely available as there is right now. And while it seems like that should mean audiences from the C-suite to Capitol Hill would take advantage, that's not necessarily the case. Discussing open data in government, Trey Causey (Socrata) noted that “publishing data isn't enough.” He found that putting data online didn't lead to meaningful public engagement because it wasn't terribly transparent or compelling to non-analysts. Instead, he suggested building apps focusing on issues like performance management and progress on public works projects that guide end users in understanding the data's importance. Likewise, data scientists looking to make an impact can't just hand their colleagues raw data. Guidance—whether in the form of data visualizations, projections, or narrative—helps business users understand the implications of that data and get the full benefit from it.
Once data is widely available and easily understood, the question becomes how receptive audiences are to it. In Conway's case, the answer was: not very. During his work in national intelligence, he found that motivated reasoning—that is, bias on the part of colleagues reviewing the data team's findings—meant that some were more interested in “defending budget, turf, and their own reputations” than in taking advantage of the insights that the data provided.
While Conway didn't offer concrete suggestions for overcoming audience resistance, Nell Thomas (Facebook) addressed the challenge of conveying insights to users with non-analytical backgrounds in the context of her work on Hillary Clinton's 2016 presidential campaign. She indicated that her team used two techniques: First, they relied heavily on carefully-worded memos, standardizing the format and using peer review to hone the language in each one and ensure that information was conveyed clearly. Second, they invested in data visualization and interactive dashboards to support those messages and make them more compelling to end users. This focus on conveying data in a clear and convincing manner was all the more crucial in a campaign environment where communication (with the public, with the press, and with each other) was job one.
Data scientists can make their work more valuable to those who consume it by knowing who their audience is, what motivates them, and creating reports designed to speak to those less data savvy colleagues where they are. Simply churning out reports is an efficient use of time only if report creation is the success metric. If, instead, data scientists want their work to make an impact, meeting audiences where they are, explaining not just the numbers but how they were arrived at, and couching insights in terms of the benefits audience members stand to gain will go a lot further.
This year's WrangleConf challenged data scientists to recognize and communicate the limits of the field while wielding analytical power in ethical and impactful ways. It's a mind-bending task, one that requires data practitioners to pause amid the press of deadlines and the desire to deliver and to reflect—to reflect on themselves, their tooling, and their responsibility in creating paths forward in business, technology, and public life.