Analyzing PyPI package downloads

This section covers how to use the public PyPI download statistics dataset to learn more about downloads of a package (or packages) hosted on PyPI. For example, you can use it to discover the distribution of Python versions used to download a package.

Background

PyPI does not display download statistics for a number of reasons: 1

  • Inefficient to make work with a Content Distribution Network (CDN): Download statistics change constantly. Including them in project pages, which are heavily cached, would require invalidating the cache more often, and reduce the overall effectiveness of the cache.

  • Highly inaccurate: A number of things prevent the download counts from being accurate, some of which include:

    • pip’s download cache (lowers download counts)

    • Internal or unofficial mirrors (can both raise or lower download counts)

    • Packages not hosted on PyPI (for comparisons sake)

    • Unofficial scripts or attempts at download count inflation (raises download counts)

    • Known historical data quality issues (lowers download counts)

  • Not particularly useful: Just because a project has been downloaded a lot doesn’t mean it’s good; Similarly just because a project hasn’t been downloaded a lot doesn’t mean it’s bad!

In short, because it’s value is low for various reasons, and the tradeoffs required to make it work are high, it has been not an effective use of limited resources.

Public dataset

As an alternative, the Linehaul project streams download logs from PyPI to Google BigQuery 2, where they are stored as a public dataset.

Getting set up

In order to use Google BigQuery to query the public PyPI download statistics dataset, you’ll need a Google account and to enable the BigQuery API on a Google Cloud Platform project. You can run the up to 1TB of queries per month using the BigQuery free tier without a credit card

For more detailed instructions on how to get started with BigQuery, check out the BigQuery quickstart guide.

Data schema

Linehaul writes an entry in a the-psf.pypi.downloadsYYYYMMDD table for each download. The table contains information about what file was downloaded and how it was downloaded. Some useful columns from the table schema include:

Column

Description

Examples

file.project

Project name

pipenv, nose

file.version

Package version

0.1.6, 1.4.2

details.installer.name

Installer

pip, bandersnatch

details.python

Python version

2.7.12, 3.6.4

Useful queries

Run queries in the BigQuery web UI by clicking the “Compose query” button.

Note that the rows are stored in separate tables for each day, which helps limit the cost of queries. These example queries analyze downloads from recent history by using wildcard tables to select all tables and then filter by _TABLE_SUFFIX.

Counting package downloads

The following query counts the total number of downloads for the project “pytest”.

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

num_downloads

2117807

To only count downloads from pip, filter on the details.installer.name column.

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
  AND details.installer.name = 'pip'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

num_downloads

1829322

Package downloads over time

To group by monthly downloads, use the _TABLE_SUFFIX pseudo-column. Also use the pseudo-column to limit the tables queried and the corresponding costs.

#standardSQL
SELECT
  COUNT(*) AS num_downloads,
  SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
FROM `the-psf.pypi.downloads*`
WHERE
  file.project = 'pytest'
  -- Only query the last 6 months of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY `month`
ORDER BY `month` DESC

num_downloads

month

1956741

201801

2344692

201712

1730398

201711

2047310

201710

1744443

201709

1916952

201708

Caveats

In addition to the caveats listed in the background above, Linehaul suffered from a bug which caused it to significantly under-report download statistics prior to July 26, 2018. Downloads before this date are proportionally accurate (e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers are lower than actual by an order of magnitude.

Additional tools

Besides using the BigQuery console, there are some additional tools which may be useful when analyzing download statistics.

google-cloud-bigquery

You can also access the public PyPI download statistics dataset programmatically via the BigQuery API and the google-cloud-bigquery project, the official Python client library for BigQuery.

from google.cloud import bigquery

# Note: depending on where this code is being run, you may require
# additional authentication. See:
# https://cloud.google.com/bigquery/docs/authentication/
client = bigquery.Client()

query_job = client.query("""
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pytest'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
        '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())""")

results = query_job.result()  # Waits for job to complete.
for row in results:
    print("{} downloads".format(row.num_downloads))

pypinfo

pypinfo is a command-line tool which provides access to the dataset and can generate several useful queries. For example, you can query the total number of download for a package with the command pypinfo package_name.

Install pypinfo using pip.

pip install pypinfo

Usage:

$ pypinfo requests
Served from cache: False
Data processed: 6.87 GiB
Data billed: 6.87 GiB
Estimated cost: $0.04

| download_count |
| -------------- |
|      9,316,415 |

pandas-gbq

The pandas-gbq project allows for accessing query results via Pandas.