To Blog

pypi-scan: A Tool for Scanning the Python Package Index for Typosquatters

Oct. 16, 2020

The Python Package Index (PyPI) is among the reasons that the Python programming language has become a lingua franca of modern software development and data science. It’s a software registry for Python packages—files containing code—that makes it free and easy for software developers to reuse Python code, which boosts productivity. Anyone can upload code for anyone to download. For some, it’s downright utopian.

But reusing code, particularly someone else’s code, has risks. Not only are there technical risks, but there are documented security risks: our analysis indicates that at least 55 malicious packages have been reported and removed from PyPI. If one had the misfortune of downloading them, these packages did nasty things such as stealing credentials or recording keystrokes. To trick users into downloading these packages, attackers often prey on user typos and user confusion via typosquatting, or mimicking popular package names. For instance, one attacker previously created a package named ‘colourama’ to trick speakers of British English intending to download ‘colorama.’

IQT Labs therefore recently engaged in an exploratory research effort to scan PyPI for typosquatting packages. The resulting command line tool, pypi-scan, identifies PyPI packages with similar names or similar package metadata relative to the most downloaded packages or a package of your choice. Below we describe the tool’s uses, the tool itself, including its strengths and limitations, and relate our discovery of a malicious package found with the tool.

Why Use pypi-scan?

There are two types of pypi-scan users: hunters and defenders.

A hunter could be an information security researcher or a PyPI administrator scanning PyPI for typosquatters, especially typosquatters on the most downloaded packages. For the hunter, pypi-scan outputs a list of potential typosquatters for each package. Potential typosquatters with suspicious metadata are further flagged. Figure 1 is an example of terminal output associated with this mode.

Figure 1. Screenshot of pypi-scan Terminal Output for Hunter Mode

Note: Red text indicates similar metadata, another indicator of potential maliciousness

Defenders only care about a particular package, potentially a package they maintain. For defenders, pypi-scan can determine which packages, if any, are typosquatting the package the defender protects. Figure 2 demonstrates command line output for a user defending the package ‘pandas.’ A defender could then investigate such packages for signs of malicious code.

Figure 2. Screenshot of pypi-scan Terminal Output for Defender Mode

Those interested in the “defender” use case might also be interested in Amazon software engineer Matt Bullock’s pypi-parker, which “parks” an empty package on PyPI in a namespace chosen by the defender to protect PyPI users from particular typosquatting packages.

How Does pypi-scan Work?

Written in Python, pypi-scan loops over all package names in PyPI, checking if each package’s name is suspiciously close to any of the most downloaded packages or to a package name selected by the user.

“Suspiciously close,” for the purposes of pypi-scan, has multiple meanings. First and foremost, the edit distance (Levenshtein distance) of the package name might be less than a user-defined threshold. For instance, pypi-scan would flag the package ‘colourama’ as typosquatting on the legitimate package ‘colorama,’ given that only one letter separates these names. pypi-scan also checks if an attack is switching the order of a package name, say ‘nmap-python’ vs ‘python-nmap.’ Finally, pypi-scan also searches for the existence of homophones, packages with names that are spelled differently but that sound the same. All suspicious packages then undergo a second check related to package metadata, such as package description and package author. pypi-scan compares the metadata of the suspicious package to the metadata of the legitimate package, assessing whether there is any copied metadata and alerting the pypi-scan user if so. Copied metadata indicates that that the typosquatter might be trying to camouflage itself as the legitimate package.

The Strengths and Limitations of pypi-scan

pypi-scan excels at finding misspelling attacks, a type of typosquatting attack in which attackers rely on a user misspelling a package name. Confusion attacks, a typosquatting attack that depends on a user being confused about which package they want to download, present a serious challenge to pypi-scan. The tool can find homophones and also attacks that switch the order of words separated by dashes or underscores, but there are many other confusion attacks that remain beyond the reach of pypi-scan. An analysis of 40 observed typosquatting attacks on PyPI reveals that pypi-scan can detect only 27 of the attacks (68%), assuming the edit distance threshold is set at two.

Perhaps the most significant limitation of pypi-scan is that its detection capabilities only find potential typosquatters. A package with a similar name and even near-identical metadata does not merit removal from PyPI, at least according to the current practices of PyPI administrators. For instance, ‘requestsaa’ has a similar name and near-identical metadata to one of the most downloaded PyPI packages, ‘requests.’ (pypi-scan found this match.) But because a manual examination of the code associated with requestsaa (version 0.1.2) does not reveal any malicious functionality, the PyPI administrator allows this package to remain. Consequently, pypi-scan users will need to use their own expertise and other tools to determine if a typosquatter is truly malicious.

Discovering a Malicious Package

To demonstrate the hunter usage pattern for pypi-scan, we used the tool to identify typosquatters on the approximately 50 most downloaded packages. pypi-scan returned roughly 150 suspicious packages. One suspicious package, named pandar, advertised itself as containing “crazy maths and more” and seemed to be squatting on pandas, the data analysis library. Manual inspection of the code revealed malicious functionality: key-logging and email exfiltration. Crazy maths and more indeed! Of note, the Github repository associated with pandar did not contain the malicious functionality contained in the PyPI tarball. We reported this package to PyPI and the PyPI administrators promptly yanked the package. Finally, we provided a copy of pandar’s code to a research repository so future analysts and tool builders can better understand malicious packages associated with open source software supply chain attacks.

So What? What’s Next?

Here’s the moral of the story: there’s more than one reason that you should consider subscribing to programming guru Rob Pike’s admonition that “a little copying is better than a little dependency.” Code reuse has risks, including security risks such as typosquatting. We are not, however, encouraging developers to abandon the practice of code reuse, via PyPI or other registries. Because of the productivity gains associated with code rescue, that genie is out of the bottle. We are, however, trying to envision and create a safer form of code reuse.

There’s a lot to be done. We at IQT Labs are considering and weighing the merits of different technical approaches and associated applied research and development projects related to software supply chain security, secure code reuse, and software assurance. Expect a future post outlining and comparing different schools of thought. In the meantime, consider re-using code the way one ought to buy a used car: with eyes wide open!

Thank you to Josh Bailey, Kinga Dobolyi, and John B. Meyers for helpful and thoughtful review.

IQT Blog

Insights & Thought Leadership from IQT

Read More