Guide

The Hidden Cloud Tax: Building Delta-Scanning Architectures for DSPM

Vendors pitch frictionless, continuous, agentless scanning. What they don't put in the pitch deck: reading, classifying, and passing multi-terabyte data objects through a classification engine costs real cloud compute and inter-region egress. A naive full-estate re-scan every 24 hours can spike an infrastructure bill 15 to 25 percent before anyone on the security team understands why the cloud bill changed.

Why full re-scans are the default, and why that's expensive

Most DSPM platforms scan broadly because broad is simple to build and simple to sell. A re-scan of the entire data estate on a fixed schedule, every day or every few hours, doesn't require the platform to track what changed since last time. It just reads everything again. That simplicity is invisible to the buyer until the cloud bill arrives.

Classifying a bucket means reading its contents. Reading at petabyte scale, especially across regions or across cloud accounts, triggers real API call volume and, depending on the architecture, real egress charges. This cost is entirely separate from the DSPM subscription fee, and it scales with the size of the data estate and the frequency of scanning, not with how much of that data actually changed. An organization running a full re-scan nightly across a multi-region, multi-account environment is paying to re-read data that hasn't moved or changed in months, every single night.

The delta-scanning alternative: only read what changed

The architecture that solves this is straightforward in concept: stop re-reading everything, and only read what actually changed since the last scan. The mechanism is object-level storage event triggers wired to a lightweight hashing pipeline. Cloud storage services emit events when objects are created or modified, events like s3:ObjectCreated or the equivalent in other cloud providers. Subscribing a hashing function to those events means the system is notified the moment something changes, instead of having to go looking for changes on a fixed schedule.

That hashing function computes and stores a hash of each object in a lightweight state table, commonly something like DynamoDB given its low operational overhead at this scale. On each scan cycle, the DSPM tool is granted read access only to objects whose hash differs from what's recorded in the state table. Everything else is skipped, not because it's assumed safe, but because it's already been classified and hasn't changed since.

The practical effect: instead of granting a DSPM engine blanket read rights to re-index an entire estate on every cycle, the access pattern narrows to exactly the objects that are new or modified. Compute and egress costs scale with the rate of change in the data estate, not with its total size. For most organizations, the volume of genuinely new or modified data on any given day is a small fraction of the total estate, and the cost savings follow directly from that ratio.

What the state table actually needs to track

A minimal state table needs the object key, a content hash, a last-scanned timestamp, and a reference to the most recent classification result for that object. That's enough to answer the basic question of whether an object's content has changed since it was last classified.

What hash-only tracking misses is anything that isn't a content change. A permission change on an object, an ACL update that suddenly makes a previously private object publicly readable, doesn't touch the object's bytes and won't trigger a content-hash mismatch. But it is exactly the kind of posture change DSPM exists to catch. A delta-scanning architecture built purely around content hashing will silently miss this category of risk, because nothing about the object's content changed, only what's allowed to reach it.

The fix is tracking permission and ACL state alongside content hash, with its own change-detection path. Most cloud providers emit separate events for permission changes distinct from object writes, and those need their own trigger into the pipeline, evaluated independently from the content-hash comparison.

Where this breaks if built naively

Treating delta-scanning as purely content-triggered, and not also wiring in permission and ACL change events, is the most common gap. It produces an architecture that looks complete, passes initial testing against content modifications, and then misses the exact class of exposure, a bucket quietly becoming public, that matters most.

Cross-account copies are a second failure point. When an object is copied from one account into another using an assumed role, the event behavior can differ from a same-account write, and depending on how the pipeline is wired, the destination account's copy of the object may not reliably fire the expected creation event in a way the hashing pipeline picks up. An architecture tested only against single-account writes can have a blind spot here that doesn't surface until an actual cross-account data movement happens.

State table drift is the quieter, longer-term risk. If the hashing pipeline itself fails, an event gets dropped, a Lambda function errors out silently, the state table falls out of sync with reality without anyone noticing immediately. The system continues operating as though everything is current, because nothing in the basic architecture distinguishes between "no changes happened" and "we stopped detecting changes." A periodic reconciliation pass, a slower, lower-frequency full comparison between the state table and actual object inventory, is the necessary backstop against this kind of silent drift. Without it, a delta-scanning architecture can degrade gradually and invisibly until a security review discovers objects that haven't actually been classified in months despite the dashboard showing continuous coverage.

Where this fits

The landscape page names this directly: classifying data requires reading it, and reading terabytes of cloud storage generates real compute and egress charges most procurement conversations don't budget for. Delta-scanning is the practitioner-level architecture that addresses that named gap directly, rather than treating it as an unavoidable cost of doing DSPM at scale.