As black-box algorithms like neural networks find their way into high-stakes fields such as transportation, healthcare, and finance, researchers have developed techniques to help explain models’ decisions. New findings show that some of these methods can be fooled.

What’s new: Two popular approaches to interpreting black-box algorithms include Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). Researchers at Harvard and the University of California Irvine found a way to trick these methods to make a deliberately biased model appear to be unbiased.

Key Insight: LIME and SHAP expose bias by showing how a model might react to small perturbations in its input. Statistical analysis shows that the distribution of perturbed inputs is distinct from that of the original inputs, as shown in the image above. A malicious actor can take advantage of the difference.

How it works: LIME and SHAP use a linear model, which is highly explainable, to mimic a black-box model’s decision with respect to any given input sample. The explainable model is built by feeding perturbed inputs to the black-box model and measuring how its output changes in response to the perturbations. Once the model is built, we can rank the features most important to the decision, highlighting any bias in the original model. Dylan Slack, Sophie Hilgard, and their colleagues introduce a way to hide such bias from LIME and SHAP by learning to recognize data they’ve altered.

  • The authors’ malicious system includes a distribution classifier, a biased model, and an unbiased LIME/SHAP model.
  • The distribution classifier learns to categorize incoming data as either unperturbed or perturbed.
  • The classification determines whether to use the biased or unbiased model. Unperturbed data are sent to the biased model. Perturbed data go to the unbiased LIME/SHAP model.
  • If the classifier is accurate, LIME/SHAP will see, and thus explain, only the unbiased model’s output, not the biased model’s output.
  • When the malicious system is deployed, the real-world data are not perturbed, so the system uses the biased model.

Results: The authors created malicious systems to run COMPAS (a criminal recidivism dataset), Communities And Crime (a dataset of violent crime and demographic information), and the German Credit loan-application dataset. The embedded biased models deliberately abused discriminatory features such as race and gender, while the unbiased models used only features which were uncorrelated with discriminatory features. A malicious system biased on one feature (say, race) fooled LIME every time and SHAP in 85 percent of cases. A malicious system biased on two features fooled LIME over 90 percent of the time and SHAP 67 percent of the time.

Why it matters: The authors’ approach highlights LIME’s and SHAP’s reliance on generating novel data. If these methods were to generate data more similar to the training data’s distribution, the method would fail. This may be a promising avenue for explainability research. Meanwhile, Duke University computer scientist Cynthia Rudin proposes avoiding black-box models in high-stakes situations. The AI community needs to hold a vigorous discussion about when such models are and aren’t appropriate.

We’re thinking: If a major AI provider were caught using this technique, likely it would be vilified, which should provide some disincentive. We can imagine changes to LIME and SHAP that would counter a specific implementation, but this paper provides a dose of caution that checking for bias is not easy.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox