How do we know when AI tools are safe to use? When should we trust them in high-stakes situations? And given their complexity and opacity, how can we be sure they’re doing what we think they are doing?
These questions aren’t new, but they’ve become even more pressing since OpenAI released ChatGPT late last year and triggered an outpouring of other Generative AI models. One way to answer them is to conduct audits that aim to establish whether models are indeed performing as intended. Such reviews offer an opportunity to identify and mitigate risks in AI tools before they can cause real-world harm.
But there’s an urgent challenge at the heart of this approach that still needs to be addressed. Today’s AI principles and governance frameworks typically don’t provide enough actionable guidance about how to conduct an audit or what quantitative and qualitative standards tools should meet. One relevant example is the “AI Ethics Framework for the Intelligence Community”, a preliminary version of which was made public in June 2020. This includes a section on mitigating undesired bias and ensuring objectivity, but it doesn’t specify how to test models for biases or when auditors should flag biases as a significant risk.
IQT Labs was focused on the challenge of making AI audits more effective long before the buzz around Generative AI began. In the spring of 2021—18 months ahead of ChatGPT’s release—we stood up an AI Assurance initiative which aims to fill in the gaps in the guidance offered in high-level AI frameworks by developing a pragmatic approach to auditing tools. This blog post summarizes some of the most important lessons that have emerged from several in-depth audits we’ve conducted since the initiative was launched.
Testing a trio of tools
Those audits, which took place between June 2021 and January 2023 using the AI Ethics Framework for the Intelligence Community as a guide, focused on three different types of AI tools: a deepfake-detection tool called FakeFinder; a pretrained Large Language Model (LLM) called RoBERTa, and SkyScan, which collects and automatically labels images of aircraft. During our audits, we examined each tool from four perspectives—ethics, bias, security, and the user experience—employing both quantitative and qualitative methods to characterize a variety of risks.
We have already published in-depth reports on each of the three audits which can be accessed via the hyperlinks above, but here we take a step back and draw some broader conclusions inspired by our work. The five lessons summarized below aren’t meant to be exhaustive. Instead, we’ve chosen them because they highlight critical aspects of AI auditing that are often overlooked, and because they are relevant to any type of audit— conducted internally or by a third party, completed prior to deployment or after an incident has occurred—and are not tool-specific.
While we do make some specific technical recommendations, the lessons are not technical in nature. Rather, they emphasize that auditing AI demands critical thinking and a healthy dose of skepticism about emerging technologies—qualities that mean it pays to think especially carefully when composing an audit team:
Lesson #1: Avoid Model Groupthink.
AI models reflect the biases that are present in their training data and can affect real-world outcomes, such as whether a bank customer gets a loan. Not all biases are harmful but identifying them is a crucial component of AI assurance work. Often, organizations entrust audits to data scientists and machine-learning (ML) experts because of their expertise in the field, but this can be problematic. Many data scientists have a positive bias towards AI and ML models, and when team members’ backgrounds and world views are too similar, they may succumb to ‘Model Groupthink’ – prioritizing consensus over accurately describing legitimate concerns.
IQT Labs set out to counter the risk of Groupthink by including a wide range of experts, from UX designers to legal specialists, in audit teams. This definitely helped. For example, during our audit of SkyScan, two team members collaborated to design and implement attacks, such as GPS-spoofing, that relied on the interplay of both hardware and software. Their work enabled a fuller characterization of the potential attack surface against SkyScan and was only possible because one of the team members had significant hardware expertise.
Yet simply having diverse perspectives and expertise on a team isn’t enough. The three audits taught us that it’s also essential to cultivate an adversarial mindset. We set out to achieve this by assuming from the start of each audit that (a) the tool in question was plagued by risks and vulnerabilities, and (b) that any AI tool (including the one being reviewed) was capable of causing real-world harm. Thiscreated the motivation to invent elaborate means of looking for problems others might not see.
Lesson #2: Audit the use case, not (just) the model.
Audits need to be grounded in a specific use case to enable a meaningful discussion of risks and consequences. Since a model is a general capability, the risks associated with it only fully come to light when it is considered in the context of a specific purpose, like solving a problem or automating a task. For instance, it’s possible to compute the likelihood that a model will produce a false positive or a false negative, two different types of error, but will a false negative create real-world harm? How many false negatives are too many? And is a false negative more or less concerning than a false positive?
Without considering specific use cases, there’s no way to answer these questions. Risk is a function both of the probability that something will happen and the cost (including recovery costs) of that thing happening. Applying a model to a specific problem or task makes the cost dimension clear. For example, when we audited the RoBERTa LLM we envisioned an analyst using the model for a task called Named Entity Recognition, which involves identifying entities (people, organizations, etc.) that appear within a corpus of unstructured text. This allowed us to meaningfully assess the cost of an error, such as RoBERTa failing to identify an entity in the text (a false negative)—a failure whose impact could be far-reaching, from undermining trust in the tool to compromising the accuracy of an intelligence assessment.
Lesson #3: Go beyond accuracy.
Just because a model is accurate does not mean it should be trusted in real-world applications. For instance, a model can be very accurate but also very biased. Our experience with FakeFinder, the deepfake-detecting AI tool, makes this risk clear. To assess whether an image or video has been manipulated algorithmically, FakeFinder aggregates predictions from several underlying “detector” models. These models came out on top in terms of their accuracy at spotting deepfakes in a public competition run by Meta (at the time, Facebook) that saw more than 35,000 models submitted.
As part of our audit process, we subjected FakeFinder’s underlying detector models to a battery of bias tests developed in consultation with Luminos.Law (formerly BNH.AI), a law firm specializing in AI liability and risk assessment. The tests included Adverse Impact Ratio (AIR), which assesses the rate at which faces in protected groups are detected as deepfakes compared with faces in a control group; Differential Validity, a breakdown and comparison of system performance by protected class; and Statistical Significance, which also checks for differences in outcomes across protected and control groups. The results of our testing revealed significant biases. As an example, one of the models was over six times more likely to flag a video as a false positive—i.e. incorrectly identifying it as a deepfake—if it showed an East Asian face than if it showed a White one.
Lesson #4: Look for vulnerabilities across the ML stack.
While our third lesson emphasizes that AI tools can cause unintentional harm, this lesson focuses on intentional attacks against them by bad actors. Today, many conversations about AI security are focused on sophisticated ways to fool models into making erroneous predictions. These attacks are novel and require substantial expertise to implement. However, when attackers want to cause harm, they often take the easiest way into a system. So, when assessing the security risks of an AI system, it’s vital to look beyond the model and consider vulnerabilities across the entire ML stack—the infrastructure and software components needed to build, deploy, and access a model. In many cases, choices about where a model is hosted and how it is accessed present more pressing concerns than the model itself.
IQT Labs’ team came across an example of this while penetration-testing RoBERTa. The team assumed the model might be accessed through a Jupyter Notebook, an open-source tool that is commonly used to access models. However, during the audit, team members uncovered a previously unknown vulnerability that, under certain circumstances, enabled them to use Jupyter’s API to view or change files that should be hidden.
By exploiting this newly discovered flaw, the team demonstrated how a malicious actor could gain access to RoBERTa and gather sensitive information. It’s a reminder that data-science tools may not have the same security posture as conventional enterprise-software tools—and that attackers may well seek to profit from this to find an easy way in.
Lesson #5: Don’t be blind to models’ blind spots.
This final lesson emphasizes that it’s also important to consider the perspective of someone using the tool. Like any other software, even the most effective AI tools aren’t foolproof. No dataset is a perfect representation of the world and when models are trained on imperfect data, limitations in the data filter through to the models. This is not necessarily a problem, so long as people using a tool are aware of its limitations. But if they are blind to a model’s blind spots, that’s a significant issue.
Our work on FakeFinder illustrates why taking a user’s perspective matters. In Lesson #3, the models in the competition hosted by Meta used a training dataset provided by the company. During our audit of FakeFinder, we realized that all the labeled examples of deepfakes in that dataset were in fact instances of a single deepfake-creation technique known as “face swap”, where the face of one person is transposed onto the body and actions of another. This was not disclosed to those taking part in the competition. Unsurprisingly, while many models trained on the dataset were good at finding instances of face-swapping, they failed to find other kinds of deepfakes. FakeFinder’s use of the models from the competition meant this limitation propagated through to that tool.
If FakeFinder had been advertised as a “face swap” detector rather than a “deepfake” detector—either in supporting documentation or (better still) in its user interface—this limitation would not have posed a concern. It’s also likely that many of the detector-model developers were unaware of the limitation themselves and so it may not have been obvious to the FakeFinder team that this was an issue. Still, unless users of FakeFinder conducted their own, in-depth audit of the system, they would not be aware of this blind spot. This could lead them to overlook deepfakes, while maintaining a misguided confidence in the system.
Audit teams can identify these types of risks by looking for disconnects between what AI tools claim to do and the available data used to train their underlying models. Once again, an adversarial mindset is helpful here. In the same way that we assumed all software has vulnerabilities (see Lesson #1), in the user-experience portion of our audits we assumed that (1) all available training data had limitations; and (2) that the nuances of those limitations were (probably) not advertised in the high-level sales pitch aimed at convincing people to buy a tool. Then we assessed the user interface from the perspective of someone using the tool for the first time. Audit teams need to do this because busy users of AI tools won’t necessarily have the time, inclination, or patience to probe for blind spots in models themselves.
A vital and never-ending quest
AI tools will never be perfect—and that’s OK. We don’t need perfection for AI to be extremely useful. We do, however, need to be clear-eyed about the risks associated with AI tools. We also need to ensure those using the tools understand their limitations because as the benefits of AI scale, so will the costs of errors. By reflecting more deeply about how to audit the technology, organizations can get a sharper picture of potential issues prior to its deployment so risks can either be mitigated or accepted willingly, with an accurate understanding of what is at stake.
IQT Labs is committed to advancing its own efforts in this crucial area. By sharing our auditing methods and findings with a broad community of interest, we hope other teams can learn from our work and share their own insights more widely, with the ultimate goal of developing better ways of ensuring AI tools we come to rely on really are doing what they were intended to do.