GCS Storage Optimization #
With data volumes continuously growing, optimizing Google Cloud Storage usage can lead to significant cost savings. To tackle this challenge, I developed a Python utility that helps summarize and analyze the stored data, making it easier to identify large files and folders on GCS.
While identifying the total storage cost on a bucket is relatively straight forward using the GCP billing report, identifying large files and folders on buckets can be a tedious task. This utility helps to quickly identify large blobs / files and folders.
Key Features #
- Storage Summarization: It summarizes storage usage up to a specified directory level, allowing you to see how storage is distributed across your GCS Bucket.
- Filtering: You can filter the output by prefix, ensuring only paths that match a specific pattern are included.
- Threshold: A size threshold can be set to exclude smaller entries, focusing on the most significant storage consumers.
- Readable Output: The sizes are formatted in human-readable units.
How it works #
First we need to to export the blob level storage consumption. This can be done using the gsutil du
command [1]. Unfortunately this command currently does not support folder level aggregations.
You can generate this file using a command like the following:
gsutil du "gs://your-bucket-name/**" > all_objects.txt
This command lists the disk usage of your storage bucket and saves the output to a text file, which the script will then process.
Please note that especially for large buckets this can take a while. Of course you can speed up the export by limiting the export scope.
The Script #
import argparse
def parse_input_file(file_path):
with open(file_path, "r") as f:
lines = f.readlines()
data = []
for line in lines:
size, path = line.strip().split(maxsplit=1)
size = int(size)
data.append((size, path))
return data
def summarize_storage(data, level, prefix):
summary = {}
for size, path in data:
if prefix and not path.startswith(prefix):
continue
parts = path.split("/")
key = "/".join(parts[: level + 3])
if key in summary:
summary[key] += size
else:
summary[key] = size
# Return sorted summary list descending by size
return sorted(summary.items(), key=lambda item: item[1], reverse=True)
def filter_by_threshold(summary, threshold):
return [(key, size) for key, size in summary if size >= threshold]
def format_size(size):
for unit in ["B", "KB", "MB", "GB", "TB"]:
if size < 1000:
return f"{size:.2f} {unit}"
size /= 1000
def main():
parser = argparse.ArgumentParser(
description="Summarize storage data."
)
parser.add_argument(
"file",
help="The input file with size and filename data."
)
parser.add_argument(
"-L",
type=int,
default=0,
help="The level of directories to summarize."
)
parser.add_argument(
"--prefix",
type=str,
help="The prefix to filter paths."
)
parser.add_argument(
"--threshold",
type=int,
default=0,
help="Minimum size threshold in bytes to include in the summary.",
)
args = parser.parse_args()
data = parse_input_file(args.file)
summary = summarize_storage(data, args.L, args.prefix)
summary = filter_by_threshold(summary, args.threshold)
for key, size in summary:
print(f"{key}: {format_size(size)}")
if __name__ == "__main__":
main()
General Usage:
usage: storage_summary.py [-h] -L L [--prefix PREFIX] [--threshold THRESHOLD] file
Summarize storage data.
positional arguments:
file The input file with size and filename data.
optional arguments:
-h, --help show this help message and exit
-L L The level of directories to summarize.
--prefix PREFIX The prefix to filter paths.
--threshold THRESHOLD
Minimum size threshold in bytes to include in the summary.
Examples #
Return the full bucket size:
python storage_summary.py all_objects.txt
List all files and folders in the root:
python storage_summary.py all_objects.txt -L 1
List all files and folders in the root folder over 10 GB
python storage_summary.py all_objects.txt -L 1 --threshold 10000000000
List all files on the third nested level inside a folder over 1 GB
python storage_summary.py all_objects.txt -L 4 --prefix gs://your-bucket-name/directory/ --threshold 1000000000
References #
[1] https://cloud.google.com/storage/docs/gsutil/commands/du