LLM Inference Optimization
High-performance LLM serving built on vLLM, covering KV Cache management, Speculative Decoding, Continuous Batching, and quantized deployment for low-latency online inference at scale.
Kyrie Chen
I write about machine learning, LLM systems, AI infrastructure, and the cities and journeys that stay with me after the trip is over.
Engineer, researcher, traveler, husband, and father. This site is where technical notes and personal essays live together.
My technical writing traces every step of learning and building through school and industry; my travel and life essays capture the places and moments worth revisiting over time.
Here you'll find deep dives into LLMs, inference systems, training strategies, and production engineering — alongside travel stories about cities, coastlines, alleyways, and the everyday moments worth remembering.
High-performance LLM serving built on vLLM, covering KV Cache management, Speculative Decoding, Continuous Batching, and quantized deployment for low-latency online inference at scale.
Large-scale distributed training infrastructure and model architecture research, including 5D Parallelism (DP/TP/PP/SP/EP), MoE sparse architectures, Scaling Law validation, and training stability tuning.
Design and implementation of LLM Agent systems, including hierarchical Memory management for multi-turn conversations, Tool-use orchestration, MCP protocol integration, and RAG pipelines.
Production-grade LLM serving platform covering model deployment orchestration, GPU cluster scheduling, Triton inference gateway, and end-to-end observability.
2026-04-01
2026-03-22
2026-01-16
2025-11-27
2025-11-05
2025-10-15