Yu Zhu

4th-year PhD Student

Biography

I am a fourth-year PhD student in Computer Science at ETH Zurich, working in the Systems Group under the supervision of Prof. Gustavo Alonso. My research focuses on ML systems and AI infrastructure, with an emphasis on storage, networking, and accelerator-based data pipelines. My recent work focuses on prefix KV cache offloading into S3-compatible object storage, aiming to make long-lived LLM context reusable across serving nodes while relieving pressure on GPU HBM and local DRAM. I received my M.S. in Electrical Engineering and Information Technology from ETH Zurich in 2022 and my B.E. from Southeast University in 2019. Here is my resume.

Education

PhD in Computer Science, from 2022 to present (expected 2027)

ETH Zurich, Switzerland

MSc in Electrical Engineering and Information Technology, from 2019 to 2022

ETH Zurich, Switzerland

B.Eng. in Electronic Science and Technology, from 2015 to 2019

Southeast University, China

Blogs

Notes on systems, storage, and AI infrastructure.

ObjectCache and the KV Cache Memory Wall

June 2026

When I think about ObjectCache, I do not see it as a replacement for GPU memory or local CPU memory. What I expect is a three-tier KV cache management system: GPU HBM should remain the runtime tier for active KV cache during generation, local CPU DRAM should absorb short-length KV cache offloading when the reuse window is still nearby, and ObjectCache should provide the object-storage tier for long-lived, reusable KV state.

The reason I care about object storage is not simply capacity. Prefix KV blocks are immutable after prefill, naturally content-addressable, and reusable across requests, sessions, users, and compute nodes. That makes an S3-compatible object interface a promising cloud-native abstraction for the KV cache footprint that is growing with long-context and agentic workloads. Instead of treating S3 as a cold archive, ObjectCache asks whether object storage can become part of the serving path.

The key design idea is layerwise retrieval. An ordinary object-store request moves objects, but an inference engine consumes cached KV in transformer-layer order. ObjectCache keeps fine-grained, hash-addressed chunks for prefix reuse, then lets the storage side gather many chunks and deliver layer-major payloads in the order the GPU consumes them. If the next layer's transfer can be overlapped with current-layer GPU compute, object-storage round-trip latency can be hidden for long-context workloads.

The broader memory-wall discussion around ObjectCache makes this direction feel even more important. Agentic workloads keep more state alive, reuse longer prefixes, and create KV cache footprints that no single GPU HBM tier can economically retain. My view is that the durable, cloud-native answer should look like a hierarchy: HBM for hot runtime state, CPU DRAM for short offloads, and an S3-compatible ObjectCache layer for scalable, persistent KV reuse across the serving cluster.

arXiv PDF Related discussion

Publications

Google Scholar

Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso. "ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse". arXiv preprint arXiv:2605.22850, 2026.

MJ Heer, Benjamin Ramhorst, Yu Zhu, Luyang Liu, Zheyuan Hu, Jonas Dann, Gustavo Alonso. "RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs". arXiv preprint arXiv:2507.20412, 2025.

Yu Zhu, Wenqi Jiang, Gustavo Alonso. "Multi-Tenant SmartNICs for In-Network Preprocessing of Recommender Systems". arXiv preprint arXiv:2501.12032, 2025.

Yu Zhu, Aditya Dhakal, Pedro Bruel, Gourav Rattihalli, Yunming Xiao, Johann Lombardi, Dejan Milojicic. "An RDMA-First Object Storage System with SmartNIC Offload". Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025.

Zhenhao He, Dario Korolija, Yu Zhu, Benjamin Ramhorst, Tristan Laan, Lucian Petrica, Michaela Blott, Gustavo Alonso. "ACCL+: an FPGA-Based Collective Engine for Distributed Applications". 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024), 2024.

Yu Zhu, Wenqi Jiang, Gustavo Alonso. "Efficient Tabular Data Preprocessing of ML Pipelines". arXiv preprint arXiv:2409.14912, 2024.

Wenqi Jiang, Shigang Li, Yu Zhu, Johannes de Fine Licht, Zhenhao He, Runbin Shi, Cedric Renggli, Shuotao Zhang, Gustavo Alonso. "Co-design Hardware and Algorithm for Vector Search". Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2023), 2023.

Yu Zhu, Zhenhao He, Wenqi Jiang, Kai Zeng, Jingren Zhou, Gustavo Alonso. "Distributed Recommendation Inference on FPGA Clusters". International Conference on Field-Programmable Logic and Applications (FPL 2021), 2021.

PDF