CNK's Blog

Review: Django Views — The Right Way

I finally set aside some time to read Django Views — The Right Way. Luke Plant makes a very persuasive argument for view functions (FBV) - especially about their being more explicit especially when it comes to the transparency of passing request into each view function. He is quite negative about class-based views (CBV). I kind of agree about CBVs hiding details and making it harder to know what to change. I would be lost without the Classy Class-Based Views site but Luke’s critiques are more about CBVs creating layers of indirection and actually introducing extra boilerplate as compared to FBVs.

Most of his examples are simple enough that either approach is fine. But his example of the detail view plus list of releated items is rather persuasive - largely because the CBV version feels a little weird. And the story about converting to CBVs introducing a security issue is sobering. But the really scary section is “Mixins do not Compose”, though I am not sure how much of that is the fundamentals of Mixins vs the details of UserPassesTestMixin. Perhaps it is fundamental. In the section Helpers vs mixins he enumerates issues with mixins, including:

  • Mixins hide the source of the changes; you have to look at all of them to get the full picture.

  • Mixins do not have a well defined interface. Because mixins are inheritance, each mixin affects the definition of the class but spread out the class definition into multiple places. In the FBV versions, each function has a specific signature - we know all the parameters that go in and we can see what it returns. If you add type hinting, your IDE can help you check for plausibility. The next thing I need to read is this essay on Composition over Inheritance.

The last really interesting part of this essay is the Thin Views section. This is the Django version of Rails’ “fat models; skinny controllers”. He argues that the view should really only be the request/response cycle parts and the business logic should be in separate methods - with a preference for making them part of the model layer rather than creating a Services layer. The business logic doesn’t neccessarily have to be part of the ORM model class. But it should be in a class or function that is independently testable. The most interesting example was a demo of how to implement Rails’ models scopes by creating custom QuerySet classes. This is how Wagtail adds its page tree methods into the Page model.

Cloudflare Proxy to S3

At work we have seen a big uptick in traffic - enough that the egress costs for S3 are exceeding a resonable budget. We use Cloudflare to cache our html, JS, CSS, etc. but have always just served our images straight from our S3 bucket. Time to change that and use Cloudflare for caching and perhaps to block some of the excess traffic.

We had already added a CloudFront distribution in front of our S3 bucket, so our initial idea was just to point a Cloudflare name at the CloudFront url. Unfortunately that didn’t work. We could get to the images when using the CloudFront url, but when using the Cloudflare proxy to that url, we got Status 529: SSL handshake failed. We tried a bunch of variations - including going straight to the CloudFront distribution name rather than the nicer domain name we had assigned to it. But no dice. We are paying customers so we filed a ticket to get help.

The SSL handshake problem is because the domain names don’t match. To be able to proxy to the nice Cloudfront domain name, we need to add a rule that sets a Host header to each request. Steps:

  1. In the zone where you want to create the proxy, navigate to “Rules” and click “Create rule”.
  2. For what type rule to create, choose “Origin Rule”.
  3. Use the Change HTTP host header template to start creating your rule:
    • Create a Custom filter expression that matches “Hostname” to the host you are setting up in Cloudflare
    • Set a Host Header to rewrite to the hostname of your CloudFront distribution
    • Preserve the DNS Record and Destination Port (the other fields in that form).
  4. Deploy.

Our Cloudflare rep had also told us to rewrite the SNI Header but that appears to be optional since the proxy to CloudFront will work without it. That being optional surprised me a little - especially since one of the other configuration parameters they had had us play with had been to change the encryption mode.

When we had trouble with the SSL handshake, we had tried backing off from the default of “Full” to “Flexible” but had been advised to be more strict. When doing this configuration, we have had the encryption mode between Cloudflare and S3 set to Strict (SSL-Only Origin Pull).

If you want to rewrite the SNI header, you need to use a domain name you control. If you try to use the url to your CloudFront distribution, <somestring>.cloudfront.net, you get the error message that <somestring>.cloudfront.net does not belong to your account. Cloudflare doesn’t have the same restriction for the Host header field. You can use the direct url for your CloudFront distribution or any alternate domain name you have set up in CloudFront. Both worked.

If you try to configure the host header to point to S3 bucket url directly, e.g. mybucket.s3.us-west-1.amazonaws.com, we still get the SSL handshake failed message and you are prevented from trying to get around that by adding an SNI header.

So in summary, if you want to use CloudFront + Cloudflare, you will both a DNS record and and Origin Rule to proxy requests to CloudFront and thus to S3.

Cloud Connector

Or, you can dispense with CloudFront and the extra rule and use Cloud Connector to do the set up for you. Cloud Connector is a Cloudflare feature for connecting to other cloud providers. We use Terraform to do any configuration we can, so we followed the Cloudflare documentation to connect directly to the S3 bucket url <mybucket>.s3.us-west-1.amazonaws.com. From what our tech rep said, this is a more automatic way of doing the header rewrites I was fooling with in the first section AND apparently will do the SNI header rewrite for us (or so I surmise).

Configuring Image Serving and Cache Clearing

For our web pages, we configure Cloudflare to obey whatever caching headers we set on the origin server. It’s possbilble to get S3 to send caching headers but we decided it would be a lot easier to control image caching using a Cloudflare rule, so we created a custom Cloudflare zone just for serving images. Then we created a new domain name for each bucket we want to put behind Cloudflare and added a Cloud Connector rule for each. We use django-storages so once we have configured the AWS_S3_CUSTOM_DOMAIN, our app starts building image urls to use the new Cloudflare-backed domain.

A lot of the images we use are very stable so we can cache them for a long time. But when they change, we need to clear them from our cache. First I created a separate cache clearing backend for the images Cloudflare zone.

    def get_images_zone_backend():
        """
        Set up Cloudflare backend for clearing cache in images.example.com zone
        """
        cf_token = getattr(settings, "CLOUDFLARE_BEARER_TOKEN", None)

        if not cf_token:
            if not settings.DEBUG and not settings.TESTING:
                logger.error(
                    "cloudflare.configuration_error: ImproperlyConfigured caching: CLOUDFLARE_BEARER_TOKEN not configured"
                )
            return None

        zone_id = getattr(settings, "CLOUDFLARE_IMAGES_ZONE_ID", None)
        if not zone_id:
            # Some of our installs don't use the images zone yet, so don't log an error.
            return None

        return MultitenantCloudflareBackend({"BEARER_TOKEN": cf_token, "ZONE_ID": zone_id})

Then we need to configure cache clearing when images change.

    def clear_image_cache(file_url):
        """
        Clear the Cloudflare cache for the given image file URL.
        """
        backend = get_images_zone_backend()
        if backend:
            logger.info(f"cloudflare: Purging image file: {file_url}", zone_id=backend.cloudflare_zoneid)
            backend.purge(file_url)


    @receiver(pre_save, sender=CustomImage)
    def clear_cache_when_image_saved(sender, instance, **kwargs):
        """
        Clear the Cloudflare cache if we are saving the entire image (the 'not
        update_fields' clause) or if the collection or file are changed.
        """
        update_fields = kwargs.get("update_fields")
        if not update_fields or "collection" in update_fields or "file" in update_fields:
            clear_image_cache(instance.file.url)


    # Delete the source image file when an image is deleted.
    @receiver(pre_delete, sender=CustomImage)
    def image_delete(sender, instance, **kwargs):
        clear_image_cache(instance.file.url)
        # Tell delete() not to save the instance, since we're in the middle of a delete operation for it.
        instance.file.delete(save=False)


    # Delete the rendition image file when a rendition is deleted.
    @receiver(pre_delete, sender=CustomRendition)
    def rendition_delete(sender, instance, **kwargs):
        # Tell delete() not to save the instance, since we're in the middle of a delete operation for it.
        clear_image_cache(instance.file.url)
        instance.file.delete(save=False)

Caching for the win!

We set the cache time within Cloudflare to 7 days and set the edge TTL (the time we tell browsers to cache the image before checking for updates) to 4 hours. With those settings, we have a cache hit ratio of 99% and the S3 hosting costs are down to below what they were before the traffic increase.

Image Audit

I am upgrading a very old Wagtail project - initially built on Wagtail 1.3 which is way before Wagtail introduced the ReferenceIndex or using the file_hash to check for duplicate images during the upload process. After upgrading to Wagtail 6.3 and as part of moving the site to new hosting, I decided to clean up some of the duplicates.

Duplicate Images

Here is the basic query for duplicate images:

    SELECT id, file, file_hash FROM core_customimage
    WHERE file_hash IN (
        SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Is it safe to delete all but one of the duplicates? Can’t tell from just that query. We need to find out which of these (if any) are in use. To do that we need to join out to the reference index. In my install, my custom image model has the content type id of 37. And any rows what have NULL for the reference index id column are NOT referenced anywhere else in the code. Those can safely be deleted.

    SELECT core_customimage.id, file, file_hash, wagtailcore_referenceindex.id
    FROM core_customimage
    LEFT OUTER JOIN wagtailcore_referenceindex
         ON wagtailcore_referenceindex.to_object_id = core_customimage.id
         AND to_content_type_id =37
    WHERE file_hash IN (
      SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Once I deleted all the duplicate images that were not used anywhere, I had a few where both copies of an image were in use. Since there was just a handful, I used the Wagtail admin UI to locate where the images were being used and edited the pages to use only one of the 2 copies. Then I could safely delete the other, now unused, copies.

Missing Image Files

I also had some situations where I thought the original image might be missing or corrupt. In a previous project, I had used code like this to check the image was still in S3 where my database thinks it should be:

    from django.core.files.storage import default_storage

    for image in CustomImage.objects.all():
        if not default_storage.exists(image.file.name):
            print(f"Image {image.title} (id: {image.id} collection: {image.collection_id}) is missing {image.file}")
            continue

However, because of the way I have my S3 bucket configured, exists was returning False for all images - even those I could see in the browser. This appears to be something to do with the details of HEAD requests with boto3 - and perhaps I didn’t have my credentials configured correctly in the shell I was using for testing. In any case, since my image urls are public, instead of fighting with exists, I used the python requests library to check the images exist and are publically available.

    for img in CustomImage.objects.all():
        url = f"https://{bucket_name}.s3.amazonaws.com/{img.file.name}"
        response = requests.head(url)
        if response.status_code == 200:
            # print(f"found {url}")
            continue
        else:
            print(f"File check failed: {img.id}, {response.status_code}, {img.file.name}")

Document checks

We can do the same things to identify duplicate documents. Again I hard coded the content type id; you will need to figure out what this should be for your installation.

    SELECT id, file, file_hash FROM wagtaildocs_document
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

    SELECT wagtaildocs_document, file, file_hash, wagtailcore_referenceindex.id
    FROM wagtaildocs_document
    LEFT OUTER JOIN wagtailcore_referenceindex
            ON wagtailcore_referenceindex.to_object_id = wagtaildocs_document
            AND to_content_type_id = 5
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Because documents are private and are in the default storage, we can use the exists option I had trouble with for images.

    from django.core.files.storage import default_storage

    for doc in Document.objects.all():
         file_path = doc.file.name
         if default_storage.exists(file_path):
             # print(f"{file_path} exists in S3.")
             pass
         else:
             print(f"The file does NOT exist in S3: {file_path}.")

S3 bucket configuration for Wagtail

We host our websites in Docker containers using Fargate on AWS. This means we don’t have a permanent file system so we need to use S3 to store the media files our users upload. Fortunately there is a Django package to add a variety of different file storage options, including S3. Setting up S3 via django-storages is pretty straightforward - install the package, configure storages.backends.s3.S3Storage as the storage backend, and include AWS access information in your environment variables.

Configuring the S3 Bucket

The new default for S3 buckets is to block all public access. This is appropriate for documents which we may need to be private. But an authentication token in the query string interferes with browsers caching images. So I would like image files to be public while still keeping documents private.

First step is to turn off aspects of the S3 public access block so we can install a bucket policy to make images public. We like to manage our AWS resources via Terraform. So we need to create the bucket and configure the access block.

    resource "aws_s3_bucket" "example" {
      bucket = "my-tf-test-bucket"
    }

    resource "aws_s3_bucket_public_access_block" "example" {
      bucket = aws_s3_bucket.example.id

      block_public_acls       = true
      block_public_policy     = false  # Temporarily turn off this block until we add a bucket policy
      ignore_public_acls      = true
      restrict_public_buckets = false  # This needs to be false so the bucket policy works
    }

Then we add a bucket policy.

    # Allow public read access to images in our s3 bucket
    data "aws_iam_policy_document" "images_public_read_policy" {
      # First our normal 'all access from account'
      statement {
        actions = ["s3:*"]

        principals {
          type        = "AWS"
          identifiers = ["arn:aws:iam::${local.account_id}:root"]
        }

        resources = [
          "${module.storage-bucket.s3-bucket-arn}",
          "${module.storage-bucket.s3-bucket-arn}/*",
        ]
      }

      statement {
        actions = ["s3:GetObject"]

        principals {
          type        = "AWS"
          identifiers = ["*"]  # Allow access to everyone (public)
        }

        # Block Public Access settings can allow public access to specific
        # resources, but not the enitre bucket. Set restrict_public_buckets = false
        # to allow a policy that allows access to specific resources.
        resources = [
          "${module.storage-bucket.s3-bucket-arn}/images/*",
          "${module.storage-bucket.s3-bucket-arn}/original_images/*",
        ]
      }
    }

    resource "aws_s3_bucket_policy" "public_images_policy" {
      bucket = aws_s3_bucket.example.id
      policy = data.aws_iam_policy_document.images_public_read_policy.json
    }

Once that bucket policy is in place, we can change block_public_policy back to true again to prevent changes.

    resource "aws_s3_bucket_public_access_block" "example" {
      bucket = aws_s3_bucket.example.id

      block_public_acls       = true
      block_public_policy     = true
      ignore_public_acls      = true
      restrict_public_buckets = false  # This needs to be false so the bucket policy works
    }

Configuring Django STORAGES

The terraform / AWS code above gives us S3 objects that behave as we want them to - objects inside the images or original_images directories can be viewed without authentication but objects anywhere else in the bucket need a token. However, my Django project still creates urls that have authentication tokens in their urls - for documents and images. Not what I wanted.

To get the image and document urls to behave differently, I need to configure two different areas kinds of storage - the default one is private and Django will give us urls with authentication query strings and we create an images storage that produces public urls.

    # settings.py
    AWS_STORAGE_BUCKET_NAME = env('AWS_STORAGE_BUCKET_NAME', default=None)
    if AWS_STORAGE_BUCKET_NAME:
        AWS_S3_REGION_NAME = env('AWS_DEFAULT_REGION', default='us-west-2')
        STORAGES = {
            "default": {
                "BACKEND": "storages.backends.s3.S3Storage"
            },
            "images": {
                "BACKEND": "storages.backends.s3.S3Storage",
                'OPTIONS': {
                    'querystring_auth': False,
                }
            },
            "staticfiles": {
                "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
            },
        }
    else:
        STORAGES = {
            "default": {
                "BACKEND": "django.core.files.storage.FileSystemStorage",
            },
            "images": {
                "BACKEND": "django.core.files.storage.FileSystemStorage",
            },
            "staticfiles": {
                "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
            },
        }

Then we need to make Wagtail’s images use this images storage. To do this, we create a custom image class and set the storage in the file field definition.

from django.core.files.storage import storages
from wagtail.images.models import AbstractImage, WagtailImageField, get_upload_to

class CustomImage(AbstractImage):
    # Get the 'images' storage from the storages defined in settings.py
    image_storage = storages['images']

    file = WagtailImageField(
        verbose_name=("file"),
        storage=image_storage,
        upload_to=get_upload_to,
        width_field="width",
        height_field="height",
    )

However, most image urls are actually for renditions. So what we really need to do is get renditions to use the images storage too. Wanting to serve renditions from a different location isn’t uncommon so there is a setting for that: WAGTAILIMAGES_RENDITION_STORAGE

    # in settings.py
    WAGTAILIMAGES_IMAGE_MODEL = 'core.CustomImage'
    # Use the images storage so we don't get auth querystrings!!
    WAGTAILIMAGES_RENDITION_STORAGE = 'images'

Build Wagtail from a Fork

Sometimes you need to run a forked version of Wagtail - for example if you are waiting for a pull request to get merged. The JavaScript parts of the package are committed as source files so you need to build the JS assets before packaging the Python code for distribution.

Install an appropriate node version. Look in the .nvmrc file to see what the major version currently being used is. I use nvm to manage my node versions.

  > cat .nvmrc
    22
  > nvm use v22
    Now using node v22.12.0 (npm v10.9.0)

Use the development instructions to install the prerequisites and build the assets.

  > npm ci
  > npm run build

Then build the python package:

  > python ./setup.py sdist

This will create a file in the dist directory, in my case, wagtail-6.4a0.tar.gz. Put this file somewhere you can access it for repeated installs and then reference that location in your requirements.txt.

All these instructions are in the Wagtail docs but somehow I can never find them when I need them.