Image Audit

January 07, 2025

I am upgrading a very old Wagtail project - initially built on Wagtail 1.3 which is way before Wagtail introduced the ReferenceIndex or using the file_hash to check for duplicate images during the upload process. After upgrading to Wagtail 6.3 and as part of moving the site to new hosting, I decided to clean up some of the duplicates.

Duplicate Images

Here is the basic query for duplicate images:

    SELECT id, file, file_hash FROM core_customimage
    WHERE file_hash IN (
        SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Is it safe to delete all but one of the duplicates? Can’t tell from just that query. We need to find out which of these (if any) are in use. To do that we need to join out to the reference index. In my install, my custom image model has the content type id of 37. And any rows what have NULL for the reference index id column are NOT referenced anywhere else in the code. Those can safely be deleted.

    SELECT core_customimage.id, file, file_hash, wagtailcore_referenceindex.id
    FROM core_customimage
    LEFT OUTER JOIN wagtailcore_referenceindex
         ON wagtailcore_referenceindex.to_object_id = core_customimage.id
         AND to_content_type_id =37
    WHERE file_hash IN (
      SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Once I deleted all the duplicate images that were not used anywhere, I had a few where both copies of an image were in use. Since there was just a handful, I used the Wagtail admin UI to locate where the images were being used and edited the pages to use only one of the 2 copies. Then I could safely delete the other, now unused, copies.

Missing Image Files

I also had some situations where I thought the original image might be missing or corrupt. In a previous project, I had used code like this to check the image was still in S3 where my database thinks it should be:

    from django.core.files.storage import default_storage

    for image in CustomImage.objects.all():
        if not default_storage.exists(image.file.name):
            print(f"Image {image.title} (id: {image.id} collection: {image.collection_id}) is missing {image.file}")
            continue

However, because of the way I have my S3 bucket configured, exists was returning False for all images - even those I could see in the browser. This appears to be something to do with the details of HEAD requests with boto3 - and perhaps I didn’t have my credentials configured correctly in the shell I was using for testing. In any case, since my image urls are public, instead of fighting with exists, I used the python requests library to check the images exist and are publically available.

    for img in CustomImage.objects.all():
        url = f"https://{bucket_name}.s3.amazonaws.com/{img.file.name}"
        response = requests.head(url)
        if response.status_code == 200:
            # print(f"found {url}")
            continue
        else:
            print(f"File check failed: {img.id}, {response.status_code}, {img.file.name}")

Document checks

We can do the same things to identify duplicate documents. Again I hard coded the content type id; you will need to figure out what this should be for your installation.

    SELECT id, file, file_hash FROM wagtaildocs_document
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

    SELECT wagtaildocs_document, file, file_hash, wagtailcore_referenceindex.id
    FROM wagtaildocs_document
    LEFT OUTER JOIN wagtailcore_referenceindex
            ON wagtailcore_referenceindex.to_object_id = wagtaildocs_document
            AND to_content_type_id = 5
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Because documents are private and are in the default storage, we can use the exists option I had trouble with for images.

    from django.core.files.storage import default_storage

    for doc in Document.objects.all():
         file_path = doc.file.name
         if default_storage.exists(file_path):
             # print(f"{file_path} exists in S3.")
             pass
         else:
             print(f"The file does NOT exist in S3: {file_path}.")

S3 bucket configuration for Wagtail

January 04, 2025

We host our websites in Docker containers using Fargate on AWS. This means we don’t have a permanent file system so we need to use S3 to store the media files our users upload. Fortunately there is a Django package to add a variety of different file storage options, including S3. Setting up S3 via django-storages is pretty straightforward - install the package, configure storages.backends.s3.S3Storage as the storage backend, and include AWS access information in your environment variables.

Configuring the S3 Bucket

The new default for S3 buckets is to block all public access. This is appropriate for documents which we may need to be private. But an authentication token in the query string interferes with browsers caching images. So I would like image files to be public while still keeping documents private.

First step is to turn off aspects of the S3 public access block so we can install a bucket policy to make images public. We like to manage our AWS resources via Terraform. So we need to create the bucket and configure the access block.

    resource "aws_s3_bucket" "example" {
      bucket = "my-tf-test-bucket"
    }

    resource "aws_s3_bucket_public_access_block" "example" {
      bucket = aws_s3_bucket.example.id

      block_public_acls       = true
      block_public_policy     = false  # Temporarily turn off this block until we add a bucket policy
      ignore_public_acls      = true
      restrict_public_buckets = false  # This needs to be false so the bucket policy works
    }

Then we add a bucket policy.

    # Allow public read access to images in our s3 bucket
    data "aws_iam_policy_document" "images_public_read_policy" {
      # First our normal 'all access from account'
      statement {
        actions = ["s3:*"]

        principals {
          type        = "AWS"
          identifiers = ["arn:aws:iam::${local.account_id}:root"]
        }

        resources = [
          "${module.storage-bucket.s3-bucket-arn}",
          "${module.storage-bucket.s3-bucket-arn}/*",
        ]
      }

      statement {
        actions = ["s3:GetObject"]

        principals {
          type        = "AWS"
          identifiers = ["*"]  # Allow access to everyone (public)
        }

        # Block Public Access settings can allow public access to specific
        # resources, but not the enitre bucket. Set restrict_public_buckets = false
        # to allow a policy that allows access to specific resources.
        resources = [
          "${module.storage-bucket.s3-bucket-arn}/images/*",
          "${module.storage-bucket.s3-bucket-arn}/original_images/*",
        ]
      }
    }

    resource "aws_s3_bucket_policy" "public_images_policy" {
      bucket = aws_s3_bucket.example.id
      policy = data.aws_iam_policy_document.images_public_read_policy.json
    }

Once that bucket policy is in place, we can change block_public_policy back to true again to prevent changes.

    resource "aws_s3_bucket_public_access_block" "example" {
      bucket = aws_s3_bucket.example.id

      block_public_acls       = true
      block_public_policy     = true
      ignore_public_acls      = true
      restrict_public_buckets = false  # This needs to be false so the bucket policy works
    }

Configuring Django STORAGES

The terraform / AWS code above gives us S3 objects that behave as we want them to - objects inside the images or original_images directories can be viewed without authentication but objects anywhere else in the bucket need a token. However, my Django project still creates urls that have authentication tokens in their urls - for documents and images. Not what I wanted.

To get the image and document urls to behave differently, I need to configure two different areas kinds of storage - the default one is private and Django will give us urls with authentication query strings and we create an images storage that produces public urls.

    # settings.py
    AWS_STORAGE_BUCKET_NAME = env('AWS_STORAGE_BUCKET_NAME', default=None)
    if AWS_STORAGE_BUCKET_NAME:
        AWS_S3_REGION_NAME = env('AWS_DEFAULT_REGION', default='us-west-2')
        STORAGES = {
            "default": {
                "BACKEND": "storages.backends.s3.S3Storage"
            },
            "images": {
                "BACKEND": "storages.backends.s3.S3Storage",
                'OPTIONS': {
                    'querystring_auth': False,
                }
            },
            "staticfiles": {
                "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
            },
        }
    else:
        STORAGES = {
            "default": {
                "BACKEND": "django.core.files.storage.FileSystemStorage",
            },
            "images": {
                "BACKEND": "django.core.files.storage.FileSystemStorage",
            },
            "staticfiles": {
                "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
            },
        }

Then we need to make Wagtail’s images use this images storage. To do this, we create a custom image class and set the storage in the file field definition.

from django.core.files.storage import storages
from wagtail.images.models import AbstractImage, WagtailImageField, get_upload_to

class CustomImage(AbstractImage):
    # Get the 'images' storage from the storages defined in settings.py
    image_storage = storages['images']

    file = WagtailImageField(
        verbose_name=("file"),
        storage=image_storage,
        upload_to=get_upload_to,
        width_field="width",
        height_field="height",
    )

However, most image urls are actually for renditions. So what we really need to do is get renditions to use the images storage too. Wanting to serve renditions from a different location isn’t uncommon so there is a setting for that: WAGTAILIMAGES_RENDITION_STORAGE

    # in settings.py
    WAGTAILIMAGES_IMAGE_MODEL = 'core.CustomImage'
    # Use the images storage so we don't get auth querystrings!!
    WAGTAILIMAGES_RENDITION_STORAGE = 'images'

Build Wagtail from a Fork

December 29, 2024

Sometimes you need to run a forked version of Wagtail - for example if you are waiting for a pull request to get merged. The JavaScript parts of the package are committed as source files so you need to build the JS assets before packaging the Python code for distribution.

Install an appropriate node version. Look in the .nvmrc file to see what the major version currently being used is. I use nvm to manage my node versions.

  > cat .nvmrc
    22
  > nvm use v22
    Now using node v22.12.0 (npm v10.9.0)

Use the development instructions to install the prerequisites and build the assets.

  > npm ci
  > npm run build

Then build the python package:

  > python ./setup.py sdist

This will create a file in the dist directory, in my case, wagtail-6.4a0.tar.gz. Put this file somewhere you can access it for repeated installs and then reference that location in your requirements.txt.

All these instructions are in the Wagtail docs but somehow I can never find them when I need them.

Hostnames and Aliases

December 11, 2024

Our multitenant Wagtail setup has made a lot if things smoother compared to the multi-instance setup it replaced. But there are a couple of situations that were easier in the old setup. For example, how do we set up a new site while the existing site is still live. Or when organizations change their names and want their url updated to the new acronym - but still want the old one to work.

Site Aliases

Our solution to these problems is twofold. First, create all sites as subdomains of the installation’s hostname; that takes care of building a new site while the current one is still live. But then we need to be able to assign the real name to the new site. To take care of that, we have a SiteAlias model to associate additional names with a site.

So if the hostname for our install is sites.example.com, then we create new sites as xyz.sites.example.com, etc. We have a wildcard DNS mapping and a wildcard SSL certificate for *.sites.example.com, so once we create a site, it is available on the public internet as https://xyz.sites.example.com. When the customer is ready for this site to be live with their preferred names, e.g. xyz.example.com, we can add the new name to our SiteAliases, request a DNS mapping and a new SSL certificate. Then the site is live with both names. Below is the code we use for mapping requests to sites.

    def match_site_to_request(request):
        """
        Find the Site object responsible for responding to this HTTP request object. Try in this order:
        * unique hostname
        * unique site alias

        If there is no matching hostname or alias for any Site, Site.DoesNotExist is raised.

        This function returns a tuple of (match_type_string, Site), where match type can be 'hostname' or 'alias'.
        It also pre-selects as much as it can from the Site and Settings, to avoid needless separate queries for things
        that will be looked at on most requests.

        This function may throw either MissingHostException or Site.DoesNotExist. Callers must handle those appropriately.
        """
        query = Site.objects.select_related('settings', 'root_page', 'features')

        if 'HTTP_HOST' not in request.META:
            # If the HTTP_HOST header is missing, this is an improperly configured test; any on-spec HTTP client will include it.
            raise MissingHostException()

        # Get the hostname. Strip off any port that might have been specified, since this function doesn't need it.
        hostname = split_domain_port(request.get_host())[0]
        try:
            # Find a Site matching this specified hostname.
            return ['hostname', query.get(hostname=hostname)]
        except Site.DoesNotExist:
            # This catches "no Site exists with this canonical hostname", now check if 'hostname' matches an alias.
            # Site.DoesNotExist will be raised if 'hostname' doesn't match an alias either
            return ['alias', query.get(settings__aliases__domain=hostname)]

So all problems solved, right? Not quite.

Preferred Domains

Now that we have multiple domain names mapped to the same site, we have an SEO problem. Ideally each site should have one and only one canonical name, so we designate one of our aliases as the preferred domain name - and then have a site middleware that redirects requests to the https version of that name.

    class MultitenantSiteMiddleware(MiddlewareMixin):

        def process_request(self, request):
            """
            Set request._wagtail_site to the Site object responsible for handling this request. Wagtail's version of this
            middleware only looks at the Sites' hostnames. Ours must also consider the Sites' lists of aliases.
            """
            try:
                # We store the Site in request._wagtail_site to avoid having to patch Wagtail's Site.find_for_request
                match_type, request._wagtail_site = match_site_to_request(request)
            except Site.DoesNotExist:
                # This will trigger if no Site matches the request. We raise a 404 so that the user gets a useful message.
                # We provide the default site as request._wagtail_site, though, just in case a template(tag) that gets
                # rendered on the 404 page expects Site.find_for_request() to actually return a Site (rather than None).
                request._wagtail_site = Site.objects.get(is_default_site=True)
                raise Http404()
            except MissingHostException:
                # If no hostname was specified, we return a 400 error. This only happens in tests.
                return HttpResponseBadRequest("No HTTP_HOST header detected. Site cannot be determined without one.")

            # Grab the site we just assigned, using the Wagtail method, to ensure the Wagtail method will work later.
            current_site = Site.find_for_request(request)

            # Determine how the user arrived here, so we can redirect them as needed.
            arrival_domain, arrival_port = split_domain_port(request.get_host())
            # If an empty port was returned from split_domain_port(), we know it's either 80 or 443.
            if not arrival_port:
                arrival_port = 80 if not request.is_secure() else 443

            # If a user visits any site via http://, and we can be 100% sure that an https-compatible version of that site
            # exists, redirect to it automatically.
            if not request.is_secure() and is_ssl_domain(arrival_domain, request):
                target_domain = get_public_domain_for_site(current_site)
                target_url = f'https://{target_domain}{request.get_full_path()}'
                logger.info(
                    'https.auto-redirect',
                    arrival_url=f'http://{arrival_domain}{request.get_full_path()}',
                    target_url=target_url
                )
                # Issue a permanent redirect, so that search engines know that the http:// URL isn't valid.
                return HttpResponsePermanentRedirect(target_url)

    def get_public_domain_for_site(site):
        """
        Returns the public-facing domain for this site
        """
        return site.settings.preferred_domain or site.hostname


    def is_ssl_domain(domain, request):
        """
        Returns True if the given domain name is guaranteed to match our SSL certs after being run through
        get_public_domain_for_site().
        """
        # We know the domain matches our SSL certs post-get_public_domain_for_site() if one of two things is true:

        # 1. The current site has a preferred_domain set. We know this implies SSL compatibility because the Site Settings
        # form prevents a non-SSL-compatible preferred_domain from being set.
        current_site = Site.find_for_request(request)
        if current_site.settings.preferred_domain:
            return True

        # 2. If the given domain matches any of our SSL wildcard domains in the way that SSL counts as a match,
        # e.g. *.example.com matches xyz.example.com but not www.xyz.example.com.
        for wildcard_domain in settings.SSL_WILDCARD_DOMAINS:
            if re.match(rf'^[^.]+\.{wildcard_domain}$', domain):
                return True

        return False

Relative urls

So now all requests are going to be going to the preferred domain right? Sadly, no. Despite all advice to use page and document choosers when creating links within a site, our content editors often copy and paste links from the browser’s address bar instead. Unfortunately, that often leads to links with the xyz.sites.example.com domain name - particularly for sites that are built while an old site is still live. So we have code that allows us to always store relative urls for links within a site. When a page is saved, a new Revision is created. Before that revision is saved, we convert any links to “our” domains into relative links and store that instead.

    @receiver(pre_save)
    def postprocess_links(sender, instance, raw, using, update_fields, **kwargs):
        """
        To ensure that copy-pasted URLs always point to the correct site, we remove the scheme and domain from any URLs
        which include the site's hostname or any of its aliases, converting them into relative URLs.
        """
        if sender == Revision:
            domains = get_domains_for_current_site()
            content_string = json.dumps(instance.content, cls=DjangoJSONEncoder)
            updated_string = domain_erase(domains, content_string)
            instance.content = json.loads(updated_string, object_hook=_decode_revision_datetimes)


    def get_domains_for_current_site():
        """
        Returns the list of domains associated with the current site. If there is no current site, returns empty list.
        """
        request = get_current_request()
        site = Site.find_for_request(request)
        alias_domains = []
        if site:
            alias_domains.append(site.hostname)
            try:
                alias_domains.extend([alias.domain for alias in site.settings.aliases.all()])
            except ObjectDoesNotExist:
                # This is a generic "except" because core can't know which settings class's DoesNotExist might get thrown.
                pass
        return alias_domains


    def domain_erase(domains, text):
        """
        Removes all instances of the specified domains from the links in the given text.
        """
        # Do nothing if given an empty list of domains. Otherwise, we'll get mangled output.
        if not domains:
            return text

        # Create a regular expression from the domains, e.g. (https?://blah\.com/?|https?://www\.blue\.com/?).
        escaped_domains = [f"https?://{re.escape(domain)}/?" for domain in domains]
        regex = '(' + "|".join(escaped_domains) + ')'
        # Replace each match with a /. This regex will convert https://www.example.com/path to /path and
        # https://www.example.com into /. Since this is just about converting local URLs, that's the appropriate conversion
        # for path-less ones.
        replaced = re.sub(regex, '/', text)
        return replaced

Running TOX locally

November 27, 2024

TL;DR Install tox and virtualenv-pyenv python packages. In the shell, export VIRTUALENV_DISCOVERY=pyenv. And in the tox.ini add the following to the setenv section: VIRTUALENV_DISCOVERY=pyenv

I needed to to do some long deferred maintenance on the wagtail-hallo package we still use. I have neglected it for so long I feel like I need to test it on several different versions of Wagtail. The package’s CI setup already runs some basic python tests for several combinations Python, Django, Wagtail, and 2 databases. So I thought I would start by running those automated tests locally - and then move onto browser testing.

So how do I run tox locally? Per usual, there is the official documentation which is thourough and overwhelming. But this one pager from the OpenAstronomy Python Packaging Guide was much more what I needed to get started. That explained the structure of my existing tox.ini file and showed me how to run individual test environments - or all of them. I didn’t want to have to set up postgres so the first thing I did was to remove postgres from the database options - leaving only the sqlite environments. Then I did pip install tox into the virtual environment I use for developing Wagtail - currently Python 3.12.5, Django 5.0, and wagtail plus its dependencies from sha a5761bc2a961a8c91e5482d2e301191f617fe3d4.

The first time I ran tox, some of the sets ran but most of them said they couldn’t find an appropriate python - even for pythons I know I have installed in my pyenv setup. ChatGPT told me I needed to install tox-pyenv but when I looked at its GitHub README, that project is archived and it tells me I need a different package: virtualenv-pyenv.

After a little bit of searching, I found virtualenv-pyenv and was able to install it: pip install virtualenv-pyenv. I added the required environment variable to my bash profile: VIRTUALENV_DISCOVERY=pyenv and then edited the tox.ini file to add a new environment variable:

    setenv =
        postgres: DATABASE_URL={env:DATABASE_URL:postgres:///wagtail_hallo}
        VIRTUALENV_DISCOVERY=pyenv

So at this point, I have the following tox-related packages in my virtual environment:

   tox==4.23.2
   virtualenv==20.28.0
   virtualenv-pyenv==0.5.0

And now when I ran tox again, most of my tests ran:

    python3.8-django3.2-wagtail3.0-sqlite: SKIP (0.01 seconds)
    python3.9-django3.2-wagtail3.0-sqlite: SKIP (0.00 seconds)
    python3.10-django3.2-wagtail3.0-sqlite: OK (15.19=setup[10.32]+cmd[4.87] seconds)
    python3.9-django4.1-wagtail4.2-sqlite: SKIP (0.00 seconds)
    python3.10-django4.1-wagtail4.2-sqlite: OK (15.41=setup[8.17]+cmd[7.25] seconds)
    python3.11-django4.1-wagtail4.2-sqlite: OK (15.46=setup[8.02]+cmd[7.44] seconds)
    python3.10-django4.2-wagtail5.2-sqlite: OK (14.53=setup[7.81]+cmd[6.72] seconds)
    python3.10-django4.2-wagtail6.1-sqlite: OK (14.67=setup[7.50]+cmd[7.17] seconds)
    python3.10-django5.0-wagtail5.2-sqlite: OK (14.43=setup[7.42]+cmd[7.01] seconds)
    python3.10-django5.0-wagtail6.1-sqlite: OK (15.05=setup[7.38]+cmd[7.67] seconds)
    python3.11-django4.2-wagtail5.2-sqlite: OK (14.03=setup[7.36]+cmd[6.66] seconds)
    python3.11-django4.2-wagtail6.1-sqlite: OK (14.32=setup[7.08]+cmd[7.24] seconds)
    python3.11-django5.0-wagtail5.2-sqlite: OK (14.19=setup[7.07]+cmd[7.12] seconds)
    python3.11-django5.0-wagtail6.1-sqlite: OK (14.74=setup[7.11]+cmd[7.63] seconds)
    python3.12-django4.2-wagtail5.2-sqlite: OK (6.48=setup[0.01]+cmd[6.47] seconds)
    python3.12-django4.2-wagtail6.1-sqlite: OK (6.81=setup[0.01]+cmd[6.81] seconds)
    python3.12-django5.0-wagtail5.2-sqlite: OK (6.67=setup[0.01]+cmd[6.66] seconds)
    python3.12-django5.0-wagtail6.1-sqlite: OK (7.46=setup[0.01]+cmd[7.45] seconds)
    congratulations :) (189.50 seconds)

Locally I don’t care about python 3.8 and 3.9, so I am just going to ignore them. The messages in the console appear to indicate that pyenv doesn’t have packages for those older pythons:

    skipped because could not find python interpreter with spec(s): python3.9
    only CPython is currently supported

← Older Blog Archives