I spent three weeks debugging an analytics pipeline that calculated average bearing temperatures for a fleet of crushers at a mining site. The numbers were wrong — not dramatically wrong, but consistently off by 2–4 degrees compared to what the process engineers said they should be. The SQL was correct. The data warehouse was correct. The pipeline was correct.
The PI Historian was lying to us.
What swinging door compression actually does
PI stores time-series data using an algorithm called Swinging Door Trending. The idea is elegant: instead of storing every value from a sensor polling every second, it stores only the points that meaningfully change. If a temperature sensor reads 85.2, 85.3, 85.2, 85.1, 85.2 for ten minutes, PI stores the first and last value and discards the rest, because you can draw a straight line between them without losing meaningful information.
The storage savings are enormous — compression ratios of 10:1 to 100:1 are normal. For a mining operation with 50,000 tags across crushers, conveyors, mills, and pumps, that’s the difference between terabytes and gigabytes.
The problem is that “meaningfully change” is controlled by two settings that most sites configure once and never revisit: Exception Deviation and Compression Deviation.
How it breaks your analytics
Exception Deviation controls what reaches the snapshot — the real-time current value. If a new reading is within the exception deviation of the last reading, it’s discarded before it even reaches the archive. Compression Deviation controls what gets written to the archive for historical queries.
Here’s the gotcha that took me three weeks to find: PI’s compressed data produces correct interpolated values for any single point in time, but incorrect results for aggregate calculations over time ranges.
When you ask PI “what was the average temperature between 8am and 4pm?”, it doesn’t average the raw readings — those were discarded. It reconstructs a line between the stored points and calculates the average of that line. If the compression was aggressive (wide deviation bands), the line misses the peaks and valleys that actually occurred. Your average is the average of the compression algorithm’s approximation, not the average of what the sensor actually measured.
For a sensor with a true range of 82–88°C over a shift, aggressive compression might store only 83 and 87. The interpolated average is 85. The actual average, calculated from all readings, might be 83.8 — because the sensor spent most of its time in the lower range with brief spikes. The distribution information is gone.
The specific settings that cause this
Two values on every PI point control the damage:
ExcDev (Exception Deviation) is set in engineering units. For a temperature tag with a span of 0–200°C, an ExcDev of 1.0 means readings within 1°C of the last sent value are discarded. That sounds reasonable until you realise it means all the micro-variations that define the true distribution are lost.
CompDev (Compression Deviation) applies the swinging door algorithm to the exception-filtered values. If CompDev is also 1.0, you’ve effectively created a 2°C dead band — any variation smaller than that is invisible to historical queries.
The rule of thumb I now use: ExcDev should be half the instrument’s accuracy, and CompDev should be half the ExcDev. For a temperature sensor accurate to ±0.5°C, that’s ExcDev = 0.25 and CompDev = 0.125. Most sites I’ve seen have these set to 1.0 or higher because the defaults were never tuned.
What this means for your data pipeline
If you’re extracting PI data into a data warehouse or analytics platform — which is increasingly common as mining operations build cloud data platforms — you need to understand that the data you’re getting is already lossy. The compression happened before your pipeline ever saw it.
Three things to check:
-
Audit the compression settings for your critical tags. Use the PI SMT (System Management Tools) to review ExcDev and CompDev on the tags that feed your analytics. If they’re set to the interface defaults, they’re probably too aggressive for analytical use.
-
Use PI’s summary functions, not raw queries. When you query PI for aggregate values, use the built-in summary functions (TagAvg, TagTot, TagMin, TagMax) rather than extracting raw values and aggregating yourself. The summary functions use a weighted calculation that accounts for the uneven spacing of compressed data. Your SQL average of extracted values doesn’t.
-
Request recorded values, not interpolated values. PI’s API offers both. Recorded values are the actual stored points. Interpolated values are reconstructed from the compressed data. For analytics pipelines, recorded values with timestamps give you the real data — sparse as it is — rather than a smooth fiction.
The uncomfortable realisation
The data quality problem in your analytics pipeline might not be in your pipeline at all. It might be in the historian that’s been silently discarding data points for years. And unlike a pipeline bug, you can’t fix it retroactively — the discarded readings are gone.
For new tag configurations, tune the compression settings before the data starts flowing. For existing tags, audit, adjust, and accept that your historical analytics for those tags have a compression-shaped error margin baked in.
The historian was built for operations, not analytics. When you repurpose it as a data source for a modern analytics platform, you inherit its assumptions — and compression deviation is the biggest one nobody tells you about.