The software development and cybersecurity communities have become painfully aware that modern software package registries—repositories of free (for the user) source code such as Python’s Package Index (PyPI)—are high-value targets susceptible to typosquatting, one form of software supply chain attack. A 2016 undergraduate thesis by Nikolai Tschacher demonstrates the viability of this attack vector. After creating software packages with names that mimic popular package names (i.e., typosquatting) and uploading the ersatz packages to popular package repositories including PyPI, Tschacher observed over 17,000 different computers downloading and executing his code, code that could have been malicious. (Military readers: this is not just a civilian hazard. Two of these hosts had a .mil domain!) More recent analysis by Hashicorp’s William Bengston, who has defensively typosquatted thousands of PyPI domains to prevent typosquatting against popular packages, offers an even more cautionary tale: there were over 540,000 downloads of his anti-typosquatting packages over the past couple years, downloads that, once again, could have caused widespread harm.
A relatively less researched area (except one related analysis) concerns patterns of actual typosquatting examples on PyPI. Outstanding questions include:
- How many total known instances of typosquatting on PyPI are there?
- Are there sub-types of typosquatting?
- Do typosquatting packages only squat on the most downloaded packages?
- To what extent can edit distance algorithms detect typosquatting?
To answer these questions, this post uses a novel dataset of typosquatting attacks found on PyPI from 2017 to 2020 and, borrowing a page from the information security metrics community, presents an analysis of the frequency and nature of typosquatting on PyPI. We hope that answers to these questions aid the ecosystem integrity and namespace management efforts of the PyPI package manager community along with parties interested in open source software supply chain security, such as the Linux Foundation.
And for those who simply want to know the main finding: typosquatting attacks are about much more than typos! Typosquatters appear to prey on those who misspell a package name and on users who experience confusion about the package that he or she wants to download. While initial PyPI typosquatting defenses should probably focus on misspelling attacks, anti-typosquatting defenders will eventually need to address this second, arguably more devious, form of typosquatting.
How Many Typosquatting Attacks Have There Been On PyPI?
Drawing on public reporting and our own efforts at finding typosquatters, we found 40 typosquatting attacks against PyPI users between 2017 and 2020 (Figure 1). We define typosquatting as a package uploaded to PyPI that:
- Has a name similar to another existing package,
- Contains malicious code, and
- Was identified and removed from PyPI.
The actual number of typosquatters is likely higher given that this definition relies on known instances of typosquatting.
Are There Sub-types of Typosquatting?
An examination of the 40 PyPI typosquatting attacks suggests that there are least two broad attack categories. The most obvious sub-type is misspelling attacks. These attacks take advantage of typos made by the user when he or she tries to download a package. For example, a package called ‘urlib3’ sought to mimic the popular ‘urllib3’ package. Confusion attacks, in contrast, do not depend on the victim misspelling a package name. Instead, confusion attacks prey on user uncertainty about the correct name of the desired package. For instance, one attacker created a package called ‘nmap-python’ when the real package is ‘python-nmap.’ Sixteen of the 40 PyPI typosquatting attacks are misspelling attacks; 26 are confusion attacks. Two of the attacks fit in both categories.
The confusion attack category can be further sub-divided into four categories. Separator attacks take advantage of user confusion about whether to separate words with dashes, underscores, or not at all. For instance, one attack involved ‘easyinstall’ squatting on ‘easy_install’. Relatedly, William Bengston’s work, mentioned earlier, could be viewed as measuring the susceptibility, i.e., the vulnerability, of the PyPI user base to separator attacks. His defensively typosquatted packages simply remove dashes or underscores found in popular package names. Separator attacks account for only three of the 26 confusion attacks in this dataset though, suggesting that Bengston’s already frightening estimate of PyPI user susceptibility to typosquatting is a lower bound of overall user susceptibility to typosquatting attacks.
Order attacks switch the order of words in a title, for instance the ‘nmap-python’ attack mentioned above. There were four order attacks in this dataset.
Accounting for three attacks, py attacks involve adding or removing the word ‘python’ or a derivate phrase from a package name to generate user confusion. One attack involved the package ‘smb’ squatting on ‘pysmb’ and another involved the package ‘pyscrapy’ squatting on ‘scrapy.’
Similarity attacks, which account for 14 confusion attacks, use a deceptively similar name such as ‘python-mongo’ instead of ‘pymongo.’ These similarity attacks, which don’t involve typos, are both common and, unfortunately, difficult to defend against because their attack strategy takes advantage of the free association capability of human comprehension and, arguably, parablepsis. (Not a typo. Check your dictionary.)
See Figure 1 for a graphical depiction of the different attack types and the assignment of all the attacks to each attack category.
Figure 1. Typosquatting Taxonomy, Count, and Associated Attacks
Are Only the Most Downloaded Packages Victims of Typosquatting?
The attacks that met our criteria are concentrated on the most downloaded packages. Figure 2 displays the percentage of documented typosquatters by the download count tier of the package on which these attacks are squatting. For instance, 11 of the 40 typosquatting attacks, or 28% of attacks, were squatting on PyPI packages that are among the 50 most downloaded. Download count was calculated using data from August 2020.
Figure 2. Percentage of Typosquatters by Popularity Tier of the Legitimate Package
To What Extent Can Edit Distance Algorithms Flag Typosquatters?
Those who want to combat typosquatting often turn to a technique called Levenshtein distance. This method measures the “edit distance” between two character sequences. For example, ‘cat’ and ‘bat’ have an edit distance of one (since replacing ‘c’ with ‘b’ suffices to transform ‘cat’ to ‘bat’); ‘moon’ and ‘spoon’ have an edit distance of two. Those thinking of employing Levenshtein distance to counter typosquatting often implicitly assume that these attacks have a Levenshtein distance of one or two.
The utility of using Levenshtein distance for positively detecting attacks depends on the attack type. All the misspelling attacks we collected have an edit distance of two or less, suggesting that edit distance can greatly aid in detecting misspelling attacks. The edit distance of the confusion attacks, however, ranges from one to 13, which reduces the usefulness of Levenshtein distance for finding these attacks. See Figure 3 for quantitative evidence on the relationship between attack type and Levenshtein distance.
Figure 3. Count of Attacks by Edit Distance for Misspelling versus Confusion Attacks
What Does This Analysis Mean for the Python Community’s Anti-Typosquatting Efforts?
Although initial efforts to counter typosquatting probably ought to focus on misspelling attacks given the ability of a straightforward edit distance algorithm to combat them, comprehensive anti-typosquatting measures employed by the Python community will need to recognize that typosquatting is about more than typos. The Python security team already implicitly recognizes this fact given the effort it took to prevent packages from using standard library names. The next step is for those interested in PyPI anti-typosquatting and anti-malware efforts to build approaches and tools that counter typosquatting, both misspelling attacks and confusion attacks. And for those up to challenge, countering similarity attacks, an especially pernicious form of confusion attacks, will be an especially knotty but worthwhile issue.
To be sure, some ecosystem maintainers have already taken up the anti-typosquatting cause and, more generally, the malware problem on PyPI and other package managers. For instance, Georgia Tech professor Wenke Lee and his colleagues built a sophisticated anti-malware analysis pipeline that repositories could employ to find malicious software, including typosquatters, hiding in repositories. Another team of researchers, largely from the University of Kansas, created an alternative approach in which a package manager (such as pip) helps protect users from typosquatting packages.
In parallel, University of Bonn Ph.D. student Marc Ohm and his colleagues published colorfully titled research, “Backstabber’s Knife Collection,” that analyzes malware found on package managers to aid anti-malware efforts. Crucially, in February 2020 PyPI launched a malware check system to automate the detection of malicious uploads. We encourage others to join and build on these efforts. For our part, IQT Labs is building a tool called pypi-scan that scans PyPI for possible typosquatters. We’ll explain more in a future post. In the meantime, remember this: typosquatting on PyPI is about more than typos!
Thank you to Josh Bailey, Peter Bronez, Mike Chadwick, Kinga Dobolyi, Vishal Sandesara, and George P. Sieniawski for thoughtful review and critique.