GCS Storage Optimization

GCS Storage Optimization

GCS Storage Optimization #

With data volumes continuously growing, optimizing Google Cloud Storage usage can lead to significant cost savings. To tackle this challenge, I developed a Python utility that helps summarize and analyze the stored data, making it easier to identify large files and folders on GCS.

While identifying the total storage cost on a bucket is relatively straight forward using the GCP billing report, identifying large files and folders on buckets can be a tedious task. This utility helps to quickly identify large blobs / files and folders.

Key Features #

  1. Storage Summarization: It summarizes storage usage up to a specified directory level, allowing you to see how storage is distributed across your GCS Bucket.
  2. Filtering: You can filter the output by prefix, ensuring only paths that match a specific pattern are included.
  3. Threshold: A size threshold can be set to exclude smaller entries, focusing on the most significant storage consumers.
  4. Readable Output: The sizes are formatted in human-readable units.

How it works #

First we need to to export the blob level storage consumption. This can be done using the gsutil du command [1]. Unfortunately this command currently does not support folder level aggregations.

You can generate this file using a command like the following:

gsutil du "gs://your-bucket-name/**" > all_objects.txt

This command lists the disk usage of your storage bucket and saves the output to a text file, which the script will then process.

Please note that especially for large buckets this can take a while. Of course you can speed up the export by limiting the export scope.

The Script #

import argparse

def parse_input_file(file_path):
    with open(file_path, "r") as f:
        lines = f.readlines()

    data = []
    for line in lines:
        size, path = line.strip().split(maxsplit=1)
        size = int(size)
        data.append((size, path))
    return data

def summarize_storage(data, level, prefix):
    summary = {}

    for size, path in data:
        if prefix and not path.startswith(prefix):
            continue

        parts = path.split("/")

        key = "/".join(parts[: level + 3])
        if key in summary:
            summary[key] += size
        else:
            summary[key] = size

    # Return sorted summary list descending by size
    return sorted(summary.items(), key=lambda item: item[1], reverse=True)

def filter_by_threshold(summary, threshold):
    return [(key, size) for key, size in summary if size >= threshold]

def format_size(size):
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        if size < 1000:
            return f"{size:.2f} {unit}"
        size /= 1000

def main():
    parser = argparse.ArgumentParser(
        description="Summarize storage data."
    )
    parser.add_argument(
        "file", 
        help="The input file with size and filename data."
    )
    parser.add_argument(
        "-L", 
        type=int,
        default=0,
        help="The level of directories to summarize."
    )
    parser.add_argument(
        "--prefix", 
        type=str,
        help="The prefix to filter paths."
    )
    parser.add_argument(
        "--threshold",
        type=int,
        default=0,
        help="Minimum size threshold in bytes to include in the summary.",
    )

    args = parser.parse_args()

    data = parse_input_file(args.file)
    summary = summarize_storage(data, args.L, args.prefix)
    summary = filter_by_threshold(summary, args.threshold)

    for key, size in summary:
        print(f"{key}: {format_size(size)}")

if __name__ == "__main__":
    main()

General Usage:

usage: storage_summary.py [-h] -L L [--prefix PREFIX] [--threshold THRESHOLD] file

Summarize storage data.

positional arguments:
  file                  The input file with size and filename data.

optional arguments:
  -h, --help            show this help message and exit
  -L L                  The level of directories to summarize.
  --prefix PREFIX       The prefix to filter paths.
  --threshold THRESHOLD
                        Minimum size threshold in bytes to include in the summary.

Examples #

Return the full bucket size:

python storage_summary.py all_objects.txt

List all files and folders in the root:

python storage_summary.py all_objects.txt -L 1

List all files and folders in the root folder over 10 GB

python storage_summary.py all_objects.txt -L 1 --threshold 10000000000

List all files on the third nested level inside a folder over 1 GB

python storage_summary.py all_objects.txt -L 4 --prefix gs://your-bucket-name/directory/ --threshold 1000000000

References #

[1] https://cloud.google.com/storage/docs/gsutil/commands/du

Copyright (c) 2025 Nico Hein