The $47,000 Athena Bill: How a Real-Time Manufacturing Dashboard Was Quietly Bankrupting Its Own Pipeline

Background

The client runs a large-scale manufacturing operation — multiple plants, hundreds of sensors, thousands of data points flowing off the factory floor every minute. Equipment telemetry, production line throughput, defect rates, temperature readings, cycle times. The kind of data that, if you can actually query it in real time, gives operations managers a meaningful edge.

Their data engineering team had built exactly that. Kinesis Data Streams was ingesting the raw sensor feed. Firehose was landing it in S3 as JSON. Athena was sitting on top of that S3 data, and QuickSight dashboards were querying Athena to give plant managers live visibility into what was happening on the floor. The dashboards were set to the fastest available refresh rate — as close to real time as QuickSight supports.

It was genuinely impressive work. The kind of architecture that gets presented at re:Invent.

It was also generating an Athena bill that had gone from approximately $4,800/month at baseline to just under $47,000 in a single cycle — a spike the finance team flagged with the kind of urgency usually reserved for production outages.

Nobody on the data team had seen it coming. From their perspective, the pipeline was working perfectly.

The Problem

What $47,000 of Athena Looks Like in Practice

Athena charges $5 per terabyte of data scanned. That's the entire pricing model — you don't pay for compute time, you don't pay per query, you pay for how much data each query touches in S3. It's an elegantly simple pricing structure that becomes an extremely expensive one if your queries are scanning more data than they need to.

When the Athena bill hit $47,000, that implied roughly 9.4 terabytes of data scanned in a single month. Against a raw data volume that the team estimated at around 800GB for the same period. Something was scanning far more than the actual data size — and doing it repeatedly.

The QuickSight dashboard refresh rate was the first thing to look at. Set to one-minute intervals, the dashboards were triggering Athena queries every 60 seconds, around the clock, seven days a week. That's approximately 43,800 query executions per month, per dashboard. They had six dashboards running.

Query Volume — Monthly Estimate at 1-Minute Refresh

DASHBOARD

Name

QUERIES/MONTH

Executions

AVG SCAN/QUERY

Data Touched

MONTHLY COST

Athena Charges

Production Line

43,800

~380 MB

~$8,100

Defect Tracking

43,800

~410 MB

~$8,800

Equipment Health

43,800

~350 MB

~$7,500

Energy & Utilities

43,800

~460 MB

~$9,800

Throughput Summary

43,800

~290 MB

~$6,200

Shift Overview

43,800

~310 MB

~$6,600

Total (6 dashboards)

262,800

~9.4 TB total

~$47,000

* Figures anonymized. Per-query scan sizes derived from Athena query execution logs.

The query volume alone wasn't the issue — 43,800 queries per month is large but not unreasonable for a production analytics system. The issue was what each query was scanning.

Two Compounding Problems

Every query was scanning hundreds of megabytes when it should have been scanning a fraction of that. Two architectural decisions, made innocently at build time, were responsible.

Problem 1: Raw JSON/CSV landing directly in S3, no format conversion.

Kinesis Firehose was delivering raw data off the sensor feed directly into S3 — uncompressed JSON files, exactly as they arrived from the factory floor. JSON is a row-oriented format. When Athena queries a JSON file, it has to read the entire file to find the columns it needs, even if the query only touches two or three fields out of fifty. For a dashboard query like SELECT machine_id, throughput_rate, timestamp FROM sensor_data WHERE plant = 'Plant_A', Athena was reading every single field on every single record — temperature, pressure, humidity, error codes, calibration offsets, all of it — just to return three columns.

Columnar formats like Apache Parquet work the opposite way: Athena reads only the columns referenced in the query. The difference in data scanned between JSON and Parquet for a typical dashboard query is not marginal — it's often 85–95% less data touched.

Problem 2: No partitioning on the S3 data.

The Firehose delivery was landing files in a flat S3 prefix structure — essentially a single bucket of time-stamped files with no logical organisation by plant, date, machine type, or any other query-relevant dimension. Every Athena query had to scan the entire dataset to find the records it needed.

The sensor data at this point was accumulating at roughly 800GB per month — twelve months of history meant the Athena table was backed by approximately 9.6TB of raw data. A dashboard query asking for the last hour of production line data was scanning all 9.6TB to find the ~200MB of relevant records.

Why It Took a Year to Surface

The architecture had been running for about a year at manageable — if wasteful — cost levels. The bill spiked when the data team added three new dashboards in the same month and expanded historical coverage from 6 months to 12. That doubled both the query volume and the per-query scan size simultaneously. A slow burn became a fire.

The Question the Data Team Actually Asked

When we presented the diagnosis, the data team's immediate response was reasonable and technically sharp: "Fine, we understand Parquet and partitioning. But our data is coming in off a live sensor feed via Kinesis. We can't stop the stream to convert it. How do you convert data that's arriving in real time?"

This is the question worth answering in detail, because it's the one most teams get stuck on.

Analysis

Before designing the solution, we needed to understand the data shape and query patterns precisely. Optimising an Athena pipeline without knowing how it gets queried is like buying the right answer to the wrong question.

We pulled 30 days of Athena query execution logs and analysed every query the dashboards were running:

Query Pattern Analysis — 30-Day Sample

Filter dimension used in WHERE clause

96% of queries

Nearly every query filtered by plant_id, line_id, or both. These were the obvious partition candidates.

Time window of data requested

Last 1–24 hours

No dashboard query looked back more than 24 hours. All 9.6TB of historical data was being scanned every minute for queries that only needed the last few hours.

Columns actually used per query

Avg 4.2 of 52 fields

The raw JSON schema had 52 fields. Dashboard queries used an average of 4. In JSON format, Athena was scanning all 52 every time. Parquet would reduce that to exactly 4.

The numbers made the solution obvious. Partition by plant_id, line_id, and date. Convert to Parquet. The live-streaming constraint was the only real engineering challenge.

Solution

Answering the Real-Time Question

The data team's concern — how do you convert a live stream to Parquet? — has a clean answer that lives inside the existing Kinesis architecture. No new services required, no pipeline rebuilds, no batch jobs running overnight.

Amazon Kinesis Data Firehose has native Parquet conversion and dynamic partitioning built in.

This is a feature that many teams don't know exists because Firehose is usually introduced as a delivery mechanism — a pipe that moves data from a stream to S3. It's actually much more than that. Firehose can transform the format of every record in flight, converting JSON to Parquet before the file ever lands in S3. It can also dynamically partition the output based on fields within the data itself — routing records to S3 prefixes like plant_id=Plant_A/line_id=Line_03/date=2025-01-15/ as they stream through.

The result: every file that lands in S3 is already Parquet, already partitioned. Athena doesn't need to scan anything it doesn't need to.

Architecture — Before vs After

Before

Factory Sensors

↓

Kinesis Data Streams

↓

Firehose → S3 (raw JSON, flat prefix)

No format conversion. No partitioning.

↓

Athena (full table scan every query)

Scans ~9.6 TB per query.

↓

QuickSight (1-min refresh)

After

Factory Sensors

↓

Kinesis Data Streams

↓

Firehose → S3 (Parquet + Dynamic Partitioning)

In-flight conversion. plant_id / line_id / date prefixes.

↓

Athena (partition + column pruning)

Scans ~18 MB per query. 99.8% reduction.

↓

QuickSight (1-min refresh — unchanged)

Implementation

The changes broke down into three parts.

Part 1: Define the schema in AWS Glue Data Catalog

Firehose's Parquet conversion needs to know the schema upfront — it can't infer it from the JSON on the fly. We created a Glue table with the sensor data schema, with plant_id, line_id, and event_date defined as partition keys.

CREATE EXTERNAL TABLE sensor_data_parquet (
  machine_id      STRING,
  timestamp_utc   TIMESTAMP,
  throughput_rate DOUBLE,
  defect_count    INT,
  cycle_time_ms   INT,
  temperature_c   DOUBLE,
  pressure_bar    DOUBLE,
  power_kw        DOUBLE
  -- remaining fields...
)
PARTITIONED BY (
  plant_id    STRING,
  line_id     STRING,
  event_date  DATE
)
STORED AS PARQUET
LOCATION 's3://manufacturing-analytics/sensor-data-parquet/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

Part 2: Reconfigure the Firehose delivery stream

In the existing Firehose delivery stream configuration, we made two changes:

Record format conversion: Enabled, output format set to Apache Parquet, Glue table reference pointed to the schema above
Dynamic partitioning: Enabled, with S3 prefix configured to extract plant_id, line_id, and date from the incoming JSON records using JQ expressions

S3 prefix: sensor-data-parquet/plant_id=!{partitionKeyFromQuery:plant_id}/line_id=!{partitionKeyFromQuery:line_id}/event_date=!{timestamp:yyyy-MM-dd}/

Buffer configuration set to 64MB or 300 seconds (whichever comes first) — large enough to produce reasonably sized Parquet files rather than thousands of tiny ones, which is its own Athena performance problem.

Part 3: Rewrite historical data

The existing 9.6TB of raw JSON needed to be converted. We ran a one-time AWS Glue ETL job that read the raw JSON from S3, converted it to Parquet with SNAPPY compression, and wrote it to the new partitioned prefix structure. The Glue job ran over a weekend — approximately 14 hours for the full historical dataset.

Once the historical backfill was complete, the old raw JSON prefix was archived to S3 Glacier Instant Retrieval (retained for compliance, accessed essentially never).

Part 4: Update Athena queries with partition filters

The existing Athena queries didn't include WHERE plant_id = ... or WHERE event_date >= ... clauses — they didn't need to, because the data wasn't partitioned. Adding those filters was a 10-minute job but the single most important change for query cost. Athena's partition pruning only works if the query actually specifies the partition key.

-- Before: scans entire table
SELECT machine_id, throughput_rate, timestamp_utc
FROM sensor_data
WHERE line_id = 'Line_03'
ORDER BY timestamp_utc DESC
LIMIT 1000;

-- After: scans only matching partition (~18 MB vs ~9.6 TB)
SELECT machine_id, throughput_rate, timestamp_utc
FROM sensor_data_parquet
WHERE plant_id = 'Plant_A'
  AND line_id = 'Line_03'
  AND event_date >= DATE_ADD('day', -1, CURRENT_DATE)
ORDER BY timestamp_utc DESC
LIMIT 1000;

Outcome

Data Scanned / Query

~380 MB avg

~18 MB

95% reduction

Monthly Athena Cost

$47,000

~$5,200

89% reduction

Annual Saving

$501,600

at current query volume

The dashboard refresh rate stayed at one minute. The operations managers kept their real-time visibility. The data team didn't have to change their tooling, their workflow, or anything about how they build and publish dashboards. The only thing that changed was what happened inside the pipeline between Kinesis and S3 — and how the Athena queries specified their filters.

Beyond the direct cost reduction, query execution time dropped from an average of 14 seconds per query to under 2 seconds. The dashboards were faster, not just cheaper.

Lessons Learned

1. Athena's pricing model punishes the path of least resistance. Landing raw JSON directly in S3 and querying it with Athena is the easiest way to get a pipeline working. It's also one of the most expensive ways to run it at scale. The effort to add Parquet conversion and partitioning at build time is small — the cost of retrofitting it after the bill has already exploded is much larger.

2. Kinesis Firehose is not just a pipe. Most teams use Firehose as a delivery mechanism and nothing more. The native Parquet conversion and dynamic partitioning features eliminate the most common reason teams build separate ETL layers between their stream and their analytics queries. If your architecture has a Glue or Lambda job converting raw S3 files to a better format, that step can probably live inside Firehose instead.

3. The real-time constraint doesn't prevent format conversion — it just changes where it happens. The data team's instinct was to treat "live data" and "Parquet" as incompatible requirements. They're not. Firehose converts in-flight, and the conversion adds latency measured in milliseconds, not minutes. The data still lands in S3 within the configured buffer window (in this case, within 5 minutes at most). For a dashboard refreshing every minute, that latency is irrelevant.

4. Partition keys should be chosen based on query patterns, not data structure. The temptation is to partition by whatever fields seem natural in the data — in this case, that might have been machine_id or sensor_type. The right answer was plant_id and line_id because those were what 96% of queries actually filtered on. Partitioning by a key that doesn't appear in WHERE clauses does nothing.

5. Query costs compound with dashboard refresh rates. A single poorly-optimised query running once is a nuisance. The same query running every minute, across six dashboards, 24 hours a day, is a $47,000 monthly bill. Before setting any dashboard to its maximum refresh rate, it's worth calculating the implied monthly query cost and confirming the architecture can support it efficiently.

Running Athena on top of a streaming data pipeline and not sure whether your query costs are where they should be? Get in touch — this is usually a faster audit than teams expect.