NYT Test

February 25, 2024

The past week hasn’t been a very good one for Google. Just as they thought they’re laying the groundwork for finally catching up to OpenAI with their Gemini releases, they now find themselves in the midst of an even bigger storm: accusations of bias (which, frankly, will be hard to deny) and the loss of trust that can be hard to recover from. It’s possible to give them the benefit of the doubt and assume that the issue was caused by a few overzealous but well-intentioned AI ethics researchers. And that might well be the case.

However, in this post, I want to share a far more common practice at large big tech companies that I think is more to blame – if not directly, then at least in establishing a shared culture that helps such bias thrive. It’s called “the NYT test”, a concept meant to guide employees ethically by asking them “would the New York Times, if they were to write about your work, cover it positively or negatively?”.

Let’s illustrate how the NYT Test can harm even the simplest of efforts. You’re building something new. For the sake of this post, let’s assume it’s something smaller like a solution to help Gmail users not get duped into sending money to bad actors, a problem for which you’ve seen a recent surge in reports. Your solution is to create an ML model that labels which Emails are requesting money, and these labels are in turn used by the spam filter algorithm. This approach allows for trusted sources – such as Venmo requests or invoices – to typically pass through while less trusted sources will likely get filtered out. You run offline tests and see that the model would have done a reasonably good job at mitigating the risk for your users.

Everything sounds good, right?

Someone on the team asks a question - “would this filter out donation requests from political campaigns?”. You answer “possibly”. After all, bad actors have pretended to be speaking on behalf of political campaigns in the past. They invoke the NYT Test.

“What if NYT writes a post about Gmail’s new spam filter filtering out donation requests from pro-choice campaigns?”

Your leadership is now freaked out about what is otherwise an outlier case, and you have no choice but to make modifications. You either half ass it because you know the NYT only cares about some of these issues and in some parts of the world. Or you put in a ton of effort to do this globally across all issues that likely reduces the overall efficacy of your solution given some bad actors pass through unintentionally.

Here's the biggest problem with the NYT test, though: a motivated writer and an editor that fails to uphold the highest standards means that even your most well-intentioned work might still get called out. I know this well: the NYT wrote a post in September 2022 in which they accused a partner team at LinkedIn of being unfair to jobseekers and harming their chances "because they ran A/B tests". LinkedIn runs hundreds of A/B tests every year which helps them understand how they can help job seekers get the outcomes they want. The only way to verify that the changes are having the intended effect is to run A/B tests. Ultimately, hundreds of thousands of LinkedIn members land jobs every year that they otherwise wouldn’t have without the innovation of these teams.

At the heart of this cultural rot that we’re seeing play out is the basic mistake we in Silicon Valley have made post-2016 – we focused too much on preventing potential harm, even when it meant sacrificing what's best for our users. The media convinced us that our work had unintended social consequences, and we believed them because we believed in the societal impact of our work. We thought it was our responsibility to prevent any harm that could come from our work, including any harm caused by bad actors misusing what we built.

What do we, Silicon Valley at large and not just Google, do from here?

The solution isn't to abandon safety and ignore unintended consequences. Instead, we need to put them in perspective. Our ethical guide shouldn’t be what NYT (or Fox or Twitter or Dribbble) would say about our work, but what our customers say. The questions we should be asking ourselves is are we helping them with their objectives? How do they feel using our products? How do we ensure that they trust us such thatother third-parties can’t influence their opinions?

It’s also important to be aware that we will make mistakes, despite our best efforts. However, we shouldn’t overreact to mistakes and lead to an environment of fear – which I assume also played a role at Google given the recent layoffs. The focus should instead be on mitigating harm when detected and limiting its reach as quickly as possible. For that reason, the company deserves kudos for taking down the ability to generate images of people until they have a solution.

Ironically, the entire exercise also shows the value of shipping versus debating theoretical harm – without allowing their users to try out Gemini, Google would have had no knowledge of just how harmful their internal biases can be to their mission. The hard part of iterating on their product and overhauling the foundational cultural issues starts now.

Thanks to Aman Alam, Parul Soi, Sidu Ponnappa, Alex Patry, Yacine and Mustafa Ali for reading and giving feedback on past versions of this post.

/home

NYT Test