Blogs

Notes on systems, storage, and AI infrastructure.

ObjectCache and the KV Cache Memory Wall

June 2026

When I think about ObjectCache, I do not see it as a replacement for GPU memory or local CPU memory. What I expect is a three-tier KV cache management system: GPU HBM should remain the runtime tier for active KV cache during generation, local CPU DRAM should absorb short-length KV cache offloading when the reuse window is still nearby, and ObjectCache should provide the object-storage tier for long-lived, reusable KV state.

The reason I care about object storage is not simply capacity. Prefix KV blocks are immutable after prefill, naturally content-addressable, and reusable across requests, sessions, users, and compute nodes. That makes an S3-compatible object interface a promising cloud-native abstraction for the KV cache footprint that is growing with long-context and agentic workloads. Instead of treating S3 as a cold archive, ObjectCache asks whether object storage can become part of the serving path.

The key design idea is layerwise retrieval. An ordinary object-store request moves objects, but an inference engine consumes cached KV in transformer-layer order. ObjectCache keeps fine-grained, hash-addressed chunks for prefix reuse, then lets the storage side gather many chunks and deliver layer-major payloads in the order the GPU consumes them. If the next layer's transfer can be overlapped with current-layer GPU compute, object-storage round-trip latency can be hidden for long-context workloads.

The broader memory-wall discussion around ObjectCache makes this direction feel even more important. Agentic workloads keep more state alive, reuse longer prefixes, and create KV cache footprints that no single GPU HBM tier can economically retain. My view is that the durable, cloud-native answer should look like a hierarchy: HBM for hot runtime state, CPU DRAM for short offloads, and an S3-compatible ObjectCache layer for scalable, persistent KV reuse across the serving cluster.

arXiv PDF Related discussion

Publications

Google Scholar

"ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse". arXiv preprint arXiv:2605.22850, 2026.
"RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs". arXiv preprint arXiv:2507.20412, 2025.
"Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems". arXiv preprint arXiv:2501.12032, 2025.
"An RDMA-First Object Storage System with SmartNIC Offload". Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025.
"ACCL+: an FPGA-Based Collective Engine for Distributed Applications". 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024), 2024.
"Efficient Tabular Data Preprocessing of ML Pipelines". arXiv preprint arXiv:2409.14912, 2024.
"Co-design Hardware and Algorithm for Vector Search". Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2023), 2023.
"Distributed Recommendation Inference on FPGA Clusters". International Conference on Field-Programmable Logic and Applications (FPL 2021), 2021.

PDF