Xianwei

> Publications

[

DBLP,

ORCID ]

Thesis/Tutorial/Patent § [PhD]. Addressing Prolonged Restore Challenges in Further Scaling DRAMs [slides (pdf), pptx], Pittsburgh, July 2017.
§ [Tutorial]. A. Gutierrez, X. Zhang, T. Ta and Brad Beckmann, AMD gem5 APU Simulator: Modeling GPUs Using the Machine ISA, The 45th International Symposium on Computer Architecture (ISCA), Los Angeles, California, USA, June 2018.
§ [Patent]. X. Zhang, J. Kalamatianos and B. Beckmann, GPU Cache Management based on Lightweight Locality Type Detection. US11487671B2, 2019.
§ [Patent]. S. Puthoor, K. Punniyamurthy, O. Kayiran, X. Zhang, Y. Eckert, J. Alsop and B. Beckmann, Memory Request Priority Assignment Techniques for Parallel Processors. US11507522B2, 2019.
§ [Patent]. M. Seyedzadeh, X. Zhang, B. Beckmann and S. Das, Data Compression System Using Base Values and Methods Thereof. US11740791B2, 2019.
§ [Patent]. A. Gutierrez, S. Blagodurov, S. Moe, X. Zhang, J. Yin and M. Sinclair, Selecting a Precision Level for Executing a Workload in an Electronic Device. US11150899B2, 2018.

arXiv (To-be-published) § [A4]. Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement.
§ [A3]. Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration.
§ [A2]. VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU.
§ [A1]. gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling.

Paper (Conference/Journal)
Note: Supervised Student, Corresponding^# 2025 § [C29][CCF-A]. Hongxin Xu, Tianyu Guo and Xianwei Zhang^#, DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline Parallelism, The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), San Diego, CA, United States, December 2025.
§ [C28][CCF-A]. Yuhao Gu, Haoquan Chen, Xianjie Chen, Jiangsu Du, Zhiguang Chen, Nong Xiao^#, Xianwei Zhang^# and Yutong Lu, coMtainer: Compilation-assisted HPC Container Images with Enhanced Adaptability, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. louis, MO, United States, November 2025.
§ [C27][CCF-A]. Tianyu Guo, Xianwei Zhang^#, Jiangsu Du, Zhiguang Chen^#, Nong Xiao and Yutong Lu, gLLM: Global Balanced Pipeline Parallelism Systems for Distributed LLMs Serving with Token Throttling, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. louis, MO, United States, November 2025.
§ [C26][CCF-A]. Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen and Yutong Lu^#, HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), St. louis, MO, United States, November 2025.
§ [C25][CCF-A]. Xuanteng Huang, Jiangsu Du, Nong Xiao and Xianwei Zhang^#, PaSK: Cold Start Mitigation for Inference with Proactive and Selective Kernel Loading on GPUs, The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.
§ [C24][CCF-A]. Kan Wu, Zejia Lin, Mengyue Xi, Zhongchun Zheng, Wenxuan Pan, Xianwei Zhang^# and Yutong Lu^#, GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving, The 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2025.
§ [C23][CCF-A]. Yuhao Gu, Chunyu Chen, Jiangsu Du, Xiaoxi Zhang and Xianwei Zhang^#, ORFA: Exploring WebAssembly as a Turing Complete Query Language for Web APIs, The ACM Web Conference (WWW), Sydney, NSW, Australia, April 2025.
§ [C22][CCF-B]. Mengyue Xi, Jingyi He and Xianwei Zhang^#, CacheC: LLM-based GPU Cache Management to Enhance Kernel Concurrency, The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.
§ [C21][CCF-B]. Tianyu Guo, Hande Dong^#, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao and Xianwei Zhang^#, EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse, The 31st International European Conference on Parallel and Distributed Computing (Euro-Par), Dresden, Germany, August 2025.
§ [C20][CCF-C]. Mengyue Xi, Tianyu Guo, Xuanteng Huang, Zejia Lin and Xianwei Zhang^#, Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs, The 30th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo Odaiba Miraikan, Japan, January 2025.
§ [J6][CCF-C]. Hengzhong Liang, Han Huang and Xianwei Zhang^#, SuCL: Supply Unified Communication Layer to Improve SYCL-based Heterogeneous Computing, CCF Transactions on High Performance Computing (THPC), 2025.
§ [J5][CCF-C]. Pin Chen, Qing Mo, Zexin Xu, Xianwei Zhang and Yutong Lu^#, Star-gen: An HPC-AI Framework for Constructing Large-scale Computational Materials Database, CCF Transactions on High Performance Computing (THPC), 2025.
2024 § [C19][CCF-A]. Tianyu Guo, Xuanteng Huang, Kan Wu, Xianwei Zhang^# and Nong Xiao, SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism, The 61st ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, United States, June 2024.
§ [C18][CCF-A]. Yuanxin Wei, Jiangsu Du^#, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang^#, Nong Xiao and Yutong Lu, APTMoE: Affinity-aware Pipeline Tuning for MoE Models on Bandwidth-constrained GPU Nodes, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Atlanta, GA, United States, November 2024.
§ [C17][CCF-B]. Zejia Lin, Aoyuan Sun, Xianwei Zhang^# and Yutong Lu, MixPert: Optimizing Mixed-precision Floating-point Emulation on GPU Integer Tensor Cores, The 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), Copenhagen, Denmark, June 2024.
§ [C16][CCF-C]. Zhaowen Shan, Xuanteng Huang, Zheng Zhou and Xianwei Zhang^#, openLG: A Tunable and Efficient Open-source LSTM on GPUs, The International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, June 2024.
§ [C15]. Zhongchun Zheng, Yuan Wu and Xianwei Zhang^#, mLOOP: Optimize Loop Unrolling in Compilation with a ML-based Approach, The 17th International Conference on Networking, Architecture, and Storage (NAS), Guangzhou, China, November 2024.
2023 § [C14][CCF-B]. Zejia Lin, Zewei Mo, Xuanteng Huang, Xianwei Zhang^# and Yutong Lu, KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications, The IEEE 41st International Conference on Computer Design (ICCD), Washington DC, United States, November 2023.
§ [J4]. Xuanteng Huang, Xianwei Zhang^#, Panfei Yang^# and Nong Xiao, Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS, Applied Sciences, December 2023.
§ [J3][CCF-C]. Xi Zhang^#, Xiaohu Gu, Yue Weng, Xianwei Zhang, Yutong Lu and Zhong Zhao, Hybrid MPI and CUDA Paralleled Finite Volume Unstructured CFD Simulations on a Multi-GPU System, Future Generation Computer Systems 139 (2023), February 2023.
§ [W5]. Lianghong Huang, Zejia Lin, Wei Liu^# and Xianwei Zhang^#, Hay: Enhancing GPU Sharing Performance With Two-Level Scheduling for Ray (short), The 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Hainan, China, December 2023.
2022 § [C13][CCF-B]. Tianao Ge, Zewei Mo, Kan Wu, Xianwei Zhang^# and Yutong Lu, RollBin: Reducing Code-size via Loop Rerolling at Binary Level, The 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), San Diego, California, United States, June 2022.
§ [C12][CCF-C]. Zewei Mo, Zejia Lin, Xianwei Zhang^# and Yutong Lu, moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators, The 19th ACM International Conference on Computing Frontiers (CF), Turin, Piedmont, Italy, May 2022.
§ [C11][CCF-C]. Yue Weng, Tianao Ge, Xianwei Zhang^# and Yutong Lu, RAISE: Efficient GPU Resource Management via Hybrid Scheduling, The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina (Messina), Italy, May 2022.
2021 § [J2]. Yue Weng, Xi Zhang, Xiaohu Guo, Xianwei Zhang^#, Yutong Lu and Yang Liu, Effects of Mesh Loop Modes on Performance of Unstructured Finite Volume GPU Simulations, Advances in Aerodynamics 3(21), 2021.
2020 § [W4]. Xianwei Zhang and Evgeny Shcherbakov, DELTA: Validate GPU Memory Profiling with Microbenchmarks (short), The International Symposium on Memory Systems (MemSys), Washington D.C., USA, October 2020.
2019 § [C10]. Tuan Ta, Xianwei Zhang, Anthony Gutierrez and Brad Beckmann, Autonomous Data-Race-Free GPU Testing, IEEE International Symposium on Workload Characterization (IISWC), Orlando, Florida, USA, November 2019.
§ [C9][CCF-C]. Xianwei Zhang, Rujia Wang, Youtao Zhang and Jun Yang, Boosting Chipkill Capability under Retention-error Induced Reliability Emergency, The 24th Asia and South Pacific Design Automation Conference (ASPDAC), Tokyo, Japan, Janurary 2019.
§ [W3]. John Alsop, Matt Sinclair, Srikant Bharadwaj, Anthony Gutierrez, Xianwei Zhang, Brad Beckmann, Alex Dutu, Onur Kayiran, Michael LeBeane, Brandon Potter, Sooraj Puthoor and Tsung Tai Yeh, Optimizing GPU Cache Policies for MI Workloads (short), IEEE International Symposium on Workload Characterization (IISWC), Orlando, Florida, USA, November 2019.
§ [A1]. John Alsop, Matt Sinclair, Srikant Bharadwaj, Alexandru Dutu, Anthony Gutierrez, Onur Kayiran, Michael LeBeane, Sooraj Puthoor, Xianwei Zhang, Tsung Tai Yeh, Bradford M. Beckmann, Optimizing GPU Cache Policies for MI Workloads, ArXiV, October 2019.
2018 § [C8][CCF-A]. Anthony Gutierrez, Brad Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matt Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain and Tim Rogers, Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level, The 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018.
2017 § [C7][CCF-B]. Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang, DrMP: Mixed Precision-aware DRAM for High Performance Approximate and Precise Computing, The 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, Oregon, USA, September 2017.
§ [J1][CCF-B]. Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang, On the Restore Time Variations of Future DRAM Memory, ACM Trans. on Design Automation of Electronic Systems (TODAES), 22(2), February 2017.
2016 § [W2]. Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang, AWARD: Approximation-aWAre Restore in Further Scaling DRAM (extended abstract), The International Symposium on Memory Systems (MemSys), Washington D.C., USA, October 2016.
§ [C6][CCF-A]. Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang, Restore Truncation for Performance Improvement in Future DRAM Systems, The 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016.
2015 § [C5][CCF-B]. Xianwei Zhang, Youtao Zhang, Bruce R. Childers and Jun Yang, Exploiting DRAM Restore Time Variations in Deep Sub-micron Scaling, The IEEE conference on Design, Automation and Test in Europe (DATE), Grenoble, France, March 2015.
§ [C4][CCF-B]. Xianwei Zhang, Youtao Zhang and Jun Yang, DLB: Dynamic Lane Borrowing for Improving Bandwidth and Performance in Hybrid Memory Cube, The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.
§ [C3][CCF-B]. Xianwei Zhang, Youtao Zhang and Jun Yang, TriState-SET: Proactive SET for Improved Performance in MLC Phase Change Memories, The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.
§ [C2][CCF-B]. Xianwei Zhang, Lei Zhao, Youtao Zhang and Jun Yang, Exploit Common Source-Line to Construct Energy Efficient Domain Wall Memory based Caches, The 33rd IEEE International Conference on Computer Design (ICCD), New York City, USA, October 2015.
§ [W1]. Xianwei Zhang, Youtao Zhang and Jun Yang, Adaptive Lane Borrowing of Hybrid Memory Cube, (WIP), The 52nd ACM/IEEE Design Automation Conference (DAC), San Francisco, California, USA, June 2015.
2013 § [C1][CCF-C]. Xianwei Zhang, Lei Jiang, Youtao Zhang, Chuanjun Zhang and Jun Yang, WoM-SET: Lowering Write Power of Proactive-SET based PCM Write Strategy Using WoM Code, The International Symposium on Low Power Electronics and Design (ISLPED), Beijing, China, September 2013. (Best Paper Award)