I am a fourth-year PhD student in Computer Science at ETH Zurich, working in the Systems Group under the supervision of Prof. Gustavo Alonso. My research focuses on ML systems and AI infrastructure, with an emphasis on storage, networking, and accelerator-based data pipelines. My recent work focuses on prefix KV cache offloading into S3-compatible object storage, aiming to make long-lived LLM context reusable across serving nodes while relieving pressure on GPU HBM and local DRAM. I received my M.S. in Electrical Engineering and Information Technology from ETH Zurich in 2022 and my B.E. from Southeast University in 2019. Here is my resume.
PhD in Computer Science, from 2022 to present (expected 2027)
ETH Zurich, Switzerland
MSc in Electrical Engineering and Information Technology, from 2019 to 2022
ETH Zurich, Switzerland
B.Eng. in Electronic Science and Technology, from 2015 to 2019
Southeast University, China
Notes on systems, storage, and AI infrastructure.
June 2026
When I think about ObjectCache, I do not see it as a replacement for GPU memory or local CPU memory. What I expect is a three-tier KV cache management system: GPU HBM should remain the runtime tier for active KV cache during generation, local CPU DRAM should absorb short-length KV cache offloading when the reuse window is still nearby, and ObjectCache should provide the object-storage tier for long-lived, reusable KV state.
The reason I care about object storage is not simply capacity. Prefix KV blocks are immutable after prefill, naturally content-addressable, and reusable across requests, sessions, users, and compute nodes. That makes an S3-compatible object interface a promising cloud-native abstraction for the KV cache footprint that is growing with long-context and agentic workloads. Instead of treating S3 as a cold archive, ObjectCache asks whether object storage can become part of the serving path.
The key design idea is layerwise retrieval. An ordinary object-store request moves objects, but an inference engine consumes cached KV in transformer-layer order. ObjectCache keeps fine-grained, hash-addressed chunks for prefix reuse, then lets the storage side gather many chunks and deliver layer-major payloads in the order the GPU consumes them. If the next layer's transfer can be overlapped with current-layer GPU compute, object-storage round-trip latency can be hidden for long-context workloads.
The broader memory-wall discussion around ObjectCache makes this direction feel even more important. Agentic workloads keep more state alive, reuse longer prefixes, and create KV cache footprints that no single GPU HBM tier can economically retain. My view is that the durable, cloud-native answer should look like a hierarchy: HBM for hot runtime state, CPU DRAM for short offloads, and an S3-compatible ObjectCache layer for scalable, persistent KV reuse across the serving cluster.