Work Packages

HOME / TOP / About / 03.Research and Development of Scheduler and Coupling Method of Software/Applications for Quantum-HPC Hybrid Collaboration

JHPC-quantum

03.

Research and Development of Scheduler and Coupling Method of Software/Applications for Quantum-HPC Hybrid Collaboration

“Dream-like Software” that realizes hybrid collaboration between quantum computers and supercomputers

Overview

Information Technology Center, The University of Tokyo (ITC/UTokyo) has developed an innovative software framework “h3-Open-BDEC'” for the integration of "Simulation/Data/Learning (S+D+L)", on the Wistereia/BDEC-01 system, which has a heterogeneous configuration. Wisteria/BDEC-01 with h3-Open-BDEC was applied to various applications and has been leading to groundbreaking scientific discoveries. Building upon these achievements, our research focuses on developing a suite of software tools to efficiently facilitate hybrid collaboration between quantum computers (QC) and supercomputers (HPC). Specifically, we will create a job scheduler (QHScheduler) that allows simultaneous utilization of multiple distributed computing resources across remote locations. Additionally, we will develop coupling tools (h3-Open-BDEC/QH) to efficiently manage real-time communication, coordination, and data transfer between QC and HPC systems. Our goal is to pave the way for new scientific frontiers through the synergy of QC-HPC hybrid collaboration.

Detail

Purpose

In order to efficiently implement collaboration between quantum computer (QC) and supercomputer (HPC) (QC-HPC hybrid collaboration), a job scheduler that can simultaneously use multiple computer resources distributed in remote locations, and a coupling technology is needed to efficiently implement and integrate communication and data transfer online and in real time. In the present work, we will first conduct research and development of a job scheduler (QHschedule) to efficiently implement QC-HPC hybrid collaboration. Furthermore, we will develop an application development framework for QC-HPC hybrid collaboration (h3-Open-BDEC/QH) that realizes high-speed data transfer between QCs and HPCs (Fig. 1).

The target application is an "AI for HPC" type workload that aims to improve simulation efficiency by linking quantum machine learning and simulations of computational science.

Fig.1 Multiple Quantum Computers (QC), Supercomputers (HPC), and Quantum Simulators on Supercomputers (QC Sim. on HPC) distributed in remote locations

Background

Since 2015, Information Technology Center, the University of Tokyo (ITC/UTokyo) has been engaged in research and development for a supercomputer system that integrates simulations, data analytics, and machine learning (Simulation/Data/Learning, S+D+L). Known as the Big Data & Extreme Computing (BDEC) project, our first BDEC system, Wisteria/BDEC-01, commenced operations in May 2021. It comprises the Simulation Node Group (Odyssey) with the same general-purpose CPU (A64FX) as “Fugaku,” and the Data/Learning Node Group (Aquarius) equipped with NVIDIA A100 Tensor Core GPUs. The system’s total peak performance is 33.1 petaflops. Wisteria/BDEC-01 is the world’s inaugural supercomputer to combine different architectures for simulations and data analysis/machine learning. Additionally, each node of Aquarius can directly connect to an external network via a high-capacity communication line, enabling real-time observation data acquisition through SINET. “h3-Open-BDEC”, which is also developed by ITC/UTokyo, is an innovative software platform that integrates S+D+L by minimizing computation and energy consumption and maximizing effectiveness on heterogeneous supercomputers such as Wisteria/BDEC-01.

Fig.2 Wisteria/BDEC-01 (Information Technology Center, The University of Tokyo)

The functions of h3-Open-BDEC include h3-Open-SYS/WaitIO, which supports communication and data transfer between simulation nodes (Odyssey) and data/learning nodes (Aquarius) using an MPI-like interface, and h3-Open-UTIL/MP, which is a coupler for real-time integration of simulation and machine learning workloads on Wisteria/BDEC-01. These functions enable advanced simulations on Wisteria/BDEC-01, such as integration of earthquake simulation and real-time data assimilation, and global cloud physics simulation with machine learning.
Moreover, a unique job scheduler has been developed on Wisteria/BDEC-01 that can simultaneously execute jobs for Odyssey and Aquarius. In the present work, we will expand this idea and develop a scheduler (QHScheduler) and an application development framework (h3-Open-BDEC/QH) for QC-HPC hybrid collaboration.

Details of Development

QHScheduler and h3-Open-BDEC/QH provide an environment for seamlessly realizing hybrid collaboration between HPCs based on various architectures (e.g. general-purpose CPUs, GPUs) and various types of QCs including simulators in real time. It also enables collaboration between multiple HPCs and multiple QCs. It is assumed that each HPC and QC will be installed at an independent site and communicate via SINET, etc. However, considering various situations, we will use the same interface as the communication via network between nodes within the system, and data transfer via the file system.
In FY2023 and FY2024, we develop and verify prototypes of QHScheduler and h3-Open-BDEC/QH on the Wisteria/BDEC-01. We assume that Odyssey nodes of the Wisteria/BDEC-01 as HPCs and Aquarius nodes as QCs, and various experiments using h3-Open-BDEC are done on the Wisteria/BDEC-01 (Fig.3).

Fig.3 Environment for Development and Verification of the Prototype of QHscheduler and h3-Open-BDEC/QH (assuming Wisteria/BDEC-01's Simulation Node Group (Odyssey) to be HPC's and Data/Learning Node Group (Aquarius) to be QC's)

QHScheduler can be started from QC, HPC, or dedicated servers and acts as a metascheduler to control job schedulers of QCs and HPCs. We plan to introduce a resource group dedicated to QC-HPC hybrid collaboration on each HPC, but we are also considering to implement the resource management method shown in Fig. 4 to achieve more flexible and efficient operation. Here, we consider the QC-HPC hybrid collaboration workload to be a priority job, and if the computer resources on the HPC are insufficient, lower priority jobs are temporarily stopped. Stopped jobs will be recovered by checkpoint files when the QC-HPC hybrid collaboration workload is finished. In order to achieve this, each application must have a function for checkpoint restart, so it is also necessary to develop a library to easily implement this function.
h3-Open-BDEC/QH-MP is a coupler that controls multiple applications that work together on QCs and HPCs, coordinates multiple components, and efficiently transfers data. It calls h3-Open-BDEC/QH-WaitIO internally, while h3-Open-BDEC/QH-WaitIO can also be called directly from each application.

Fig.4 Flexible Method of Resource Management method for Supercomputers using Checkpoint Restart

In the present work, we will focus on "AI for HPC" type workloads, which aim to improve simulation efficiency by linking computational science simulations on HPCs and quantum machine learning on QCs in real time. Moreover, various observation data will also be made available for efficient use.
Furthermore, QHScheduler and h3-Open-BDEC/QH-MP will be deployed to the HPCs of Fugaku and HPCI centers, and various QCs. In addition to "AI for HPC" type workloads that link computational science simulations and quantum machine learning, we will also consider error mitigation/correction on NISQ machines, and applications to quantum physics simulations and materials simulations, which have already been attempted in numerous cases.

This is really innovative development and the world's first attempt to link multiple HPCs and QCs installed at different sites in real time.

By releasing QHScheduler and h3-Open-BDEC/QH and deploying them to HPCs in Japan and overseas, many researchers and engineers will be able to easily use QCs, and QC-HPC collaboration will be promoted.

Schedule

(4-1)QHScheduler: Flexible Scheduler for QC-HPC Hybrid Collaboration

FY.2023: Basic software design and preliminary evaluation on the Wistereia/BDEC-01
FY.2024: Based on the results of the preliminary evaluation in FY.2023, we will complete the design, prototype development, and evaluation on Wisteria/BDEC-01, and develop an environment that simulates the operation of multiple HPCs and multiple QCs.
FY.2025: We will conduct operational tests and test operations in a QC-HPC hybrid collaboration environment (multiple HPCs, one real QC). From the second half of this fiscal year, we will provide it to application developers and begin its actual operation and evaluation.
FY.2026: We will continue to operate the system in the QC-HPC hybrid collaborative environment and evaluate and improve it together with application developers.
FY.2027: We will continue to operate the system, and evaluation and improvement will be carried out. Tests will also be conducted in conjunction with multiple QCs (one real QC and one QC Simulator) and multiple HPCs (RIKEN, University of Tokyo, Osaka University).
FY.2028: We will continue to operate the system, and evaluation and improvement will be carried out.

(4-2)h3-Open-BDEC/QH: Application Development Framework for QC-HPC Hybrid Collaboration

FY.2023: Basic software design and preliminary evaluation on the Wistereia/BDEC-01
FY.2024: We will conduct prototype development and verification of a communication library and coupler that simulate the operation of multiple HPCs and multiple QCs on Wisteria/BDEC-01. We will also conduct and complete an operation test in conjunction with QHScheduler.
FY.2025: Applications using h3-Open-BDEC/QH will be linked with QHScheduler to conduct and complete operation tests in a QC-HPC hybrid collaboration environment (multiple HPCs, one real HPC).
FY.2026: The actual operation and evaluation will be carried out in a QC-HPC hybrid collaboration environment linked with QHScheduler, and the verification by applications combining quantum machine learning and simulation will be carried out in the SINET environment using a QC Simulator. We will conduct a preliminary evaluation of linking multiple QCs (one real QC + QC Simulators) and multiple HPCs (RIKEN, University of Tokyo, and Osaka University).
FY.2027: In collaboration with Woking Package (WP) 2, we will continue to improve the software, and in collaboration with QHScheduler. We will also conduct tests that link multiple QCs (one real QC + QC Simulators) and multiple HPCs.
FY.2028: In addition to continuing to improve the software, we will continue to verify collaboration with QHScheduler in a QC-HPC hybrid collaboration environment. Furthermore, in conjunction with WP.2, continue to improve the software.

(4-3)Other Development Items

FY.2024-2027: In cooperation with WP.4, we will develop skeleton applications that can simulate the data input/output and communication of applications for QC-HPC hybrid collaboration. For this purpose, we will develop simulators that has the same interface as superconducting and ion-trap QCs. Interfaces are provided for programs written in C/C++, Fortran, and Python, including QC simulators.
FY.2028: The results will be summarized, such as packaged as open source software, and preparations will be made for deployment after the project is completed.

Project Members

The University of Tokyo

Project Leader

Kengo Nakajima
Supercomputing Division, Information Technology Center
Lysenko Artem
Supercomputing Division, Information Technology Center
Yo Ko
Supercomputing Division, Information Technology Center
Shinji Sumimoto
Supercomputing Division, Information Technology Center
Tatsuhiko Tsunoda
Supercomputing Division, Information Technology Center
Kazuya Yamazaki
Supercomputing Division, Information Technology Center