Kyrie Chen

Hi, that's Kyrie

I write about machine learning, LLM systems, AI infrastructure, and the cities and journeys that stay with me after the trip is over.

Engineer, researcher, traveler, husband, and father. This site is where technical notes and personal essays live together.

About

Where scaling laws meet the open road

My technical writing traces every step of learning and building through school and industry; my travel and life essays capture the places and moments worth revisiting over time.

Here you'll find deep dives into LLMs, inference systems, training strategies, and production engineering — alongside travel stories about cities, coastlines, alleyways, and the everyday moments worth remembering.

  • Machine Learning / Deep Learning
  • LLM / AI Infra / NLP / CV
  • Travel writing / long-form notes
More about me →
Posts 55
Focus ML, LLM, Infra
Also Writing Travel & Life
GitHub Activity Loading...
Tech Stack

Skills & Tools

Python C++ C Java PyTorch Transformers DeepSpeed Megatron-LM CUDA MoE vLLM KV Cache TensorRT Triton Docker Kubernetes LangChain RAG MCP Memory System Prompt Engineering
What I Work On

Focus Areas

View all on GitHub →

LLM Inference Optimization

High-performance LLM serving built on vLLM, covering KV Cache management, Speculative Decoding, Continuous Batching, and quantized deployment for low-latency online inference at scale.

vLLM KV Cache TensorRT CUDA

Distributed Training & Model Architecture

Large-scale distributed training infrastructure and model architecture research, including 5D Parallelism (DP/TP/PP/SP/EP), MoE sparse architectures, Scaling Law validation, and training stability tuning.

DeepSpeed Megatron-LM PyTorch MoE

Agent & Memory System

Design and implementation of LLM Agent systems, including hierarchical Memory management for multi-turn conversations, Tool-use orchestration, MCP protocol integration, and RAG pipelines.

LangChain RAG MCP Memory System

LLM Serving Infrastructure

Production-grade LLM serving platform covering model deployment orchestration, GPU cluster scheduling, Triton inference gateway, and end-to-end observability.

Kubernetes Docker Triton Python
Blogs

Selected writing

View archive →

Claude Code 如何工作

最近把一份 Claude Code 源码快照系统过了一遍,最大的感受是:它并不是一个“会调几个工具的 CLI”,也不只是一个“把大模型接进终端”的聊天程序。更准确地说,它是一个运行在终端里的 agent runtime。 很多人第一次用 Claude Code,会把体验归因于模型本身。但从代...

现代 LLM 中的 Attention 变体可视化指南

本文翻译自 Sebastian Raschka 的文章 A Visual Guide to Attention Variants in Modern LLMs,原文发布于 2026 年 3 月 22 日。文中图片均引自原文及其参考资料,专业名词保留原文英文。

KV Cache 如何影响了 LLM Inference

近年来,主流大语言模型架构正经历从标准多头注意力(MHA)向多查询注意力(MQA)、分组查询注意力(GQA)及多头潜在注意力(MLA)的范式转移。这一演进的核心驱动力在于解决自回归解码阶段 KV Cache 带来的显存容量与带宽瓶颈(即“内存墙”问题),旨在通过降低访存开销来显著提升推理吞吐量...

Context, RAG and Memory

Context、RAG、Memory 不是互斥,而是互补。上下文工程用于会话即时优化,RAG 用于把权威文档注入生成,长期记忆用于跨会话个性化。 Context、RAG、Memory 对比 维度 Context Engineering ...

Into AI Agent

在当今的 LLM 应用中,Agent 是一个至关重要的概念。它能帮助 LLM 完成代码生成、问题解答、多轮对话等复杂任务。然而,在众多 LLM 应用中,最成功的那些往往不依赖于复杂的架构或特殊的计算库,而是采用简单、通用的 Agent 模式。 什么是Agent “Agent”可以有多种定义...

泰国的山海

自从出行游玩要带上小朋友之后,我发现我和我太太能选择的旅行目的地范围越来越小了,总得围绕着路程近、交通方便、环境好、食物适应这几个论点展开,但在这之上,有一个点更是王炸——就是有海滩。对于学龄前的小朋友来说,海滩真是一个绝佳的地方,他们可以在沙滩上瞎玩一整天,而丰盛的海鲜又很满足我们这种生长在...