Name: GPU Day 2021
Start: 2021-11-10T08:50:00+01:00
End: 2021-11-11T14:00:00+01:00
Location: Hotel Mercure Budapest Castle Hill

GPU Day 2021

from Wednesday, 10 November 2021 (08:50) to Thursday, 11 November 2021 (14:00)

Monday, 8 November 2021
Tuesday, 9 November 2021
Wednesday, 10 November 2021

09:00 Opening Talk and Welcome by the Director
Opening Talk and Welcome by the Director
09:00 - 09:20
09:20 Space-ready FPGA hardware acceleration for .NET software - Hastlayer - Erno David (MTA Wigner FK) Zoltán Lehóczky (Lombiq Technologies Ltd.)
Space-ready FPGA hardware acceleration for .NET software - Hastlayer
- Erno David (MTA Wigner FK)
- Zoltán Lehóczky (Lombiq Technologies Ltd.)
09:20 - 09:40
Hastlayer (https://hastlayer.com/) by Lombiq Technologies is a .NET software developer-focused, easy-to-use high-level synthesis tool with the aim of accelerating applications. It converts standard .NET Common Intermediate Language (CIL) bytecode into equivalent Very High Speed Integrated Circuit Hardware Description Language (VHDL) constructs which can be implemented in hardware using FPGAs. After cloud-available FPGA platforms, we've made Hastlayer compatible with the Zynq 7000 family of FPGA SoC devices. The primary goal is to be able to utilize onboard computers of satellites built with the same hardware, readily available by NewSpace manufacturers. In this talk, we'll introduce Hastlayer and how it can be used, the challenges and experiences of making it compatible with Zynqs, and our results showing up to 2 orders of magnitude speed and power efficiency increases.
09:40 200+ GPUs in one HPC - available in months - Zoltan Kiss (KIFÜ)
200+ GPUs in one HPC - available in months
- Zoltan Kiss (KIFÜ)
09:40 - 10:00
A new 5PF HPC is being built in Hungary, it will have more than 200 A100 GPUs. Dedicated partitions will be available for CPU-only jobs with almost 20 000 CPU cores, GPU partition with 200+ Nvidia A100 GPUs, Big Data partition with 9 TB RAM, AI partition with 8 GPU nodes. This will be completed with most advanced HPC software and portal system open for both SMEs and Academia. This talk would go into details of the new machine, and HPC Competence Centre offerings including future plans. We are ready to support Hungarian and International research including quantum simulators, apply for resources today!
10:00 Standards in HPC - Máté Ferenc Nagy-Egri (MTA Wigner FK)
Standards in HPC
- Máté Ferenc Nagy-Egri (MTA Wigner FK)
10:00 - 10:40
10:40 Coffee Break
Coffee Break
10:40 - 11:00
11:00 The GUARDYAN code for high fidelity nuclear reactor calculations - David Legrady (Dr.)
The GUARDYAN code for high fidelity nuclear reactor calculations
- David Legrady (Dr.)
11:00 - 11:30
GUARDYAN (GPU Assisted Reactor Dynamic Analysis) is a continuous energy Monte Carlo (MC) neutron transport code developed at Budapest University of Technology and Economics. It targets to solve time-dependent problems related to fission reactors with the main focus on simulating and analyzing short transients. The key idea of GUARDYAN is a massively parallel execution structure making use of advanced programming possibilities available on CUDA enabled GPUs. Compared to similar code systems GUARDYAN is the first to upscale to nuclear power plant levels targeting the simulation of analyzing severe accident scenarios for reactor safety analysis. Recent advances include the coupling with thermal-hydraulics solvers and comparison to actual measurements at the Paks Nuclear Power Plant.
11:30 Solving the Kuramoto Oscillator Model of Power Grids on GPU - Lilla Barancsuk (Budapest University of Technology and Economics)
Solving the Kuramoto Oscillator Model of Power Grids on GPU
- Lilla Barancsuk (Budapest University of Technology and Economics)
11:30 - 11:50
Power grids are large complex networks whose dynamics, stability and vulnerability are intensively studied; new challenges arise with the increase of distributed renewable energy resources. The dynamics of electrical grids is highly affected by desynchronization between nodes, which can start an avalanche-like cascade of line failures causing massive outages. Modelling power systems in detail leads to an increased computational cost, as a much larger number of nodes (in the order of millions) needs to be dealt with than in the traditional power grid models. The Kuramoto model is a set of coupled nonlinear ordinary differential equations, that describes the power grid as an ensemble of coupled oscillators, and is widely used for investigating the synchronization properties of networks. The modelling of the power grid by the Kuramoto model consists in the solution of a system of such equations where each equation corresponds to a node in the power grid leading to a solution of a number of equations by the millions. To be able to efficiently handle the model, we numerically solved the second order Kuramoto equations on a GPU, and simulated cascades as threshold line failures. In this talk, we present our solution, where a special memory layout for the network graph has been introduced for effective implementation. We studied different numerical solvers supplied by *boost*’s *odeint* library, which we compared in terms of precision and performance.
11:50 Particle Simulation of Resonant Nanoantennas for Laser Driven Fusion - Istvan Papp (Wigner FK)
Particle Simulation of Resonant Nanoantennas for Laser Driven Fusion
- Istvan Papp (Wigner FK)
11:50 - 12:10
Recently Nanoplasmonic Laser Induced Fusion Experiments were proposed, as an improvement in achieving laser driven fusion [1]. This combines recent discoveries in heavy-ion collisions and optics. The existence of detonations with time-like normal on space-time hyper-surfaces combined with absorption adjustment using nanoantennas allows the possibility of heating the target in an opposing laser beam setup [2]. For tracking the time evolution of non-equilibrium plasma interacting with strong laser fields, kinetic modeling is most proper way. However, to describe the absorption effects of gold nanoantennas inside a medium, one requires different approaches. Here we will present a particle-in-cell model of resonant nanoantennas using the capabilities of the EPOCH multi-component PIC code[3]. [1] L.P. Csernai, N. Kroó, & I. Papp, Radiation-Dominated Implosion with Nano-Plasmonics, Laser and Particle Beams 36, 171-178 (2018). [2] L.P. Csernai, M. Csete, I.N. Mishustin, A. Motornenko, I. Papp, L.M. Satarov, H. Stöcker & N. Kroó, Radiation-Dominated Implosion with Flat Target, Physics and Wave Phenomena, 28 (3) 187-199 (2020) in press, accepted February 3, 2020, (arXiv:1903.10896v3). [3] T. D. Arber, et. al. Contemporary particle-in-cell approach to laser-plasma modelling Plasma Phys. Control. Fusion 57, 113001 (2015)
12:10 Accelerating Tridiagonal Solvers - István Reguly (PPKE ITK)
Accelerating Tridiagonal Solvers
- István Reguly (PPKE ITK)
12:10 - 12:30
In this talk, we present work recently done by our group on the parallel solution of multiple tridiagonal linear systems that typically arise during the solution of discretised partial differential equations. We briefly introduce the established serial (Thomas) and parallel (Parallel Cyclic Reduction) algorithms for individual systems, then discuss how multiple systems are formed and solved in a high-dimensional system - including shared memory, distributed memory, and pipeline parallelism, targeting recent many-core CPUs, GPUs and FPGAs. We demonstrate scalability up to 16k CPU cores or 32 GPUs for large systems representative of CFD applications. We also study computational and energy efficiency on GPUs and FPGAs on smaller problems representative of applications in computational finance, demonstrating that a Xilinx Altevo U280 can closely match an NVIDIA V100 GPU in terms of throughput, and significantly outperform it in terms of energy efficiency.
12:30 Lunch break
Lunch break
12:30 - 14:00
14:00 Implementing Hierarchical Bayesian Networks on the GPU - László Dobos (Wigner FK)
Implementing Hierarchical Bayesian Networks on the GPU
- László Dobos (Wigner FK)
14:00 - 14:30
Designing spectroscopic follow-up observations in astronomy poses several challenges. Observing spectra is significantly more time consuming than photometric imaging observations yet, interesting objects need to be selected based on images taken with only a few broad-band filters. Hierarchical Bayesian Networks are often used to estimate physical parameters of photometrically observed stars, a prerequisite to successful spectroscopic targeting. We present an implementation of a novel Bayesian model which can be used to fit parameters of mixtures of stellar populations to derive physical parameters as well as population membership probabilities for each star. Since the Adaptive Monte Carlo method used to integrate the model is computationally expensive, we rely heavily on GPUs.
14:30 AI application in stellar spectroscopy - Viska Wei
AI application in stellar spectroscopy
- Viska Wei
14:30 - 15:00
Artificial Neural Networks have been applied in many fields of science and are particularly successful in image processing. Here we outline the challenges of Deep Learning in stellar spectroscopy, since stellar spectra are fundamentally different from images. Although only one-dimensional, spectra show no translation invariance and important features appear on all scales: While the surface temperature of a star can be told either from the overall shape of the spectrum or the strengths of certain easily detectable absorption lines, other physical parameters, such as chemical element abundances, are encoded in many small features scattered at a multitude of wavelengths. We also consider applications other than physical parameter inference: denoising and normalization with autoencoders and surrogate modelling with generative networks.
15:00 Accelerating the solution of large number of delay differential equations with GPUs - Dániel Nagy (Budapest University of Technology and Economics, Department of Hydrodynamic Systems)
Accelerating the solution of large number of delay differential equations with GPUs
- Dániel Nagy (Budapest University of Technology and Economics, Department of Hydrodynamic Systems)
15:00 - 15:20
Delay differential equations (DDE) appear in several branches of science and engineering. Possible applications include the modelling and forecasting of epidemics, the modelling of stability loss in control systems and many more. The delays in the differential equations can be caused by the incubation time of a virus in epidemic models or by the time which the computer needs to carry out the necessary calculation in case of computer control. The numerical solution of problems described by DDEs is inevitable in several cases, and sometimes a large number of the same delay differential equations must be solved due to the many possible parameter combinations or initial conditions. **The serial solution of millions of equations is usually not viable**, thus some kind of parallelization is necessary; however, **general purpose DDE solvers for GPUs do not exist at present**. Compared to ordinary differential equation (ODE) solvers, DDE solvers use values from the past because of the delay in the equation; thus, every timestep must be saved. However, it cannot be guaranteed that the required previous time instance is available in the global memory. Therefore, interpolation between the past values is necessary to maintain the order of the used numerical method. GPUs were found extremely efficient for the numerical solution of large number of ODEs. **The most efficient method for the parallelization of numerical solution is assigning each equation to one thread** (also called *per-thread* approach). In the present work, the same strategy is applied for the acceleration of DDE solvers. The traditional 4th order fixed-timestep Explicit—Runge—Kutta (ERK) method is implemented with the extension of 3rd order Hermite-interpolation. The code is written in CUDA C++ language. This solver can be applied for a wide range of possible problems on most GPUs. However, an efficient implementation is difficult since it requires the intensive use of global memory, while the goal is to reach the maximum possible FLOP efficiency. The bottleneck of such problems is the memory bandwidth; thus, finding an optimal memory structure is necessary. Furthermore, the global memory size can also be a limiting factor, as each thread saves thousands of states in the global memory. This can lead to several gigabytes of global memory usage limiting the total number of residing threads and throttling the performance. In the study, the basic idea of the solver is shown. **An efficient memory structure is proposed and tested, with aligned and coalesced memory access pattern**. To minimise the global memory usage, each thread works on a fixed-size array, when this array fills up, the older saved timesteps will be overwritten in a circular fashion. Problem specific codes with the previously described structure are tested on several simple test cases, and the FLOP efficiency, memory load/store efficiency and further important metrics are measured. **It is found that 40% FLOP efficiency can be reached with 95% memory efficiency.** The FLOP efficiency of the problem specific codes can be regarded as the maximal possible efficiency of my proposed solver. The implementation of a general purpose DDE solver is presented in the already existing GPU ODE solver called MPGOS and the metrics of the implementation is measured. A general solver must work for every possible problem, which requires extra computations and memory operations; consequently, its efficiency in my case only reaches half of the problem specific solvers. Finally, the limitations of a fixed-timestep ERK method is presented, and the possibilities of adaptive ERK solvers for GPUs is discussed. However, for adaptive ERK methods, aligned and coalesced memory access patterns are not possible with the *per-thread* approach, thus a heterogeneous CPU-GPU solver is proposed which may be a viable solution for later implementations.
15:20 Mixed precision: when is it worth it? - Bálint Siklósi (Pázmány Péter Catholic University - Hungary)
Mixed precision: when is it worth it?
- Bálint Siklósi (Pázmány Péter Catholic University - Hungary)
15:20 - 15:40
Mixing different precision of floating point arithmetics and number representations may be a highly effective tool to tackle some main challenges of exascale computing. By lowering precision, we can reduce memory and network traffic, decrease memory footprint, we can achieve more floating point operations per second by using less time to compute the same operations and we can also reduce energy consumption. Using recently introduced hardware features, the benefit can become even larger. NVIDIA Tensor Cores provide 2.5X speed-up in HPC by enabling mixed-precision computing, but they can also provide 10X speed-up in AI training with their 32-bit and 16-bit Tensor Float support. Using FPGAs with half precision the advantage is further increased, since the operating area may decrease as well, and the frequency of the device may increased. On the flip side, changing the representation also degrades accuracy, so mixed representation can only be used with careful consideration, making it even more difficult to apply automatically. In 2017, a group of NVIDIA researchers published a study detailing how to reduce the memory requirements for neural network training with a technique called Mixed Precision Training. Weights, activations, and gradients are stored in IEEE FP16 format, but in order to match the accuracy of the FP32 networks, FP32 master copies of the weights are maintained. During one training step, the forward and backward passes are calculated using FP16 arithmetics, while the optimizer step and weight update are calculated using FP32 arithmetics. To avoid the underflow of the gradient, they also introduce a loss scaling scheme, whereby the loss and therefore the gradient is scaled up by a constant factor. The GPUMixer (best paper at ISC 2019) is a performance-driven automatic tuner for GPU kernels. It uses static analysis for finding a set of operations (FISet) to execute in lower precision, while data entering and leaving those sets are in high precision. They try to maximize the ratio of low precision arithmetics and type casting operations to achieve better performance. Also they apply "shadow" execution to determine the error and maintain a prescribed error bound. In our work, we want to achieve a similar automatic mixed-precision execution on unstructured mesh computations, using the OP2 domain specific language. The advantage of this system is that we can exploit further domain knowledge instead of focusing on an individual kernel. If we find a variable which acts like an accumulator, then we should keep it in higher precision. If we find one that stores only differences, then we can lower its precision. As an example, we measured mixed-precision execution on the Airfoil application (an industrially representative CFD code which is a finite volume simulation that solves the 2D Euler equations): using two NVIDIA V100 GPUs the speed-up is 1.11X (using all FP32 it would be 1.44X), and using 64 INTEL Xeon processors the speedup is 1.13X (using all FP32 it would be 1.76X).
15:40 Coffee break
Coffee break
15:40 - 16:00
16:00 Laboratory observation of water surface polygon vortices - Adam Kadlecsik
Laboratory observation of water surface polygon vortices
- Adam Kadlecsik
16:00 - 16:20
It is a known phenomenon, when a filled bucket is rotated around its axis the water surface takes up a paraboloid shape. A less trivial instance is when only the bottom of the bucket rotates, and the walls are stationary. In this case between the liquid near the rotating bottom and the stationary wall a velocity shear emerges creating rotating polygon-like shapes. We reproduce this phenomenon and build a physical understanding around it.
16:20 Hydrolysis of N,N-dimethylindole-3-ethaniminium cation, the oxidized form of the endogenous psychedelic N,N-dimethyltryptamine - Károly Kubicskó (ELTE Faculty of Science, Institute of Chemistry)
Hydrolysis of N,N-dimethylindole-3-ethaniminium cation, the oxidized form of the endogenous psychedelic N,N-dimethyltryptamine
- Károly Kubicskó (ELTE Faculty of Science, Institute of Chemistry)
16:20 - 16:40
The monoamine oxidase (MAO) is a flavoenzyme, which performs the oxidation of monoamine neurotransmitters such as serotonin, dopamine, norepinephrine, and their structurally related neuromodulator compounds, usually called "trace amines" (TAs) referring to their lower concentration compared to the main neurotransmitters. The latter group includes tryptamine (T), and phenylethylamine (PEA) as well as their derivatives. They have not received too much scientific interest before the discovery of G protein coupled human trace amine associated receptors (TAARs). The irregularities of TA levels has been linked to numerous mental disorders like schizophrenia, major depression, bipolar disorder, anxiety, attention deficit hyperactivity disorder (ADHD), and substance abuse disorders. The MAO has two isoforms, MAO-A and MAO-B. Their primary structure (sequence of amino acids) share around 70% identity, but their distribution in tissues and their selectivity to substrates is different. MAOs have crucial role in the breakdown/inactivation of monoamine compounds in the body, therefore they responsible for the regulation their levels. A compound belonging to the TA group, N,N-dimethyltryptamine (DMT) is a naturally occurring serotonergic indole alkaloid, which has profound psychedelic (mind-altering) effects on the human psyche. Lately, it has been discovered, that DMT is a natural ligand of sigma-1 receptors and it has important role in tissue protection, regeneration, and immunity. In vitro experiments revealed that DMT shows potent protective effects against hypoxia. We have investigated the metabolism of DMT with monoamine oxidase A enzyme using multilayer QM:MM quantum chemical calculations. The MAO converts DMT into a positively charged iminium ion form, namely N,N-dimethylindole-3-ethaniminium cation (imDMT$^+$). In order to examine the metabolism process of endogenous DMT further, we decided to study the hydrolysis of imDMT$^+$ in detail, which resulting indole-3-acetaldehyde (IAL) and dimethylamine. Three different systems (or reaction paths) were examined, which include the imDMT$^+$ cation and one OH$^-$ ion with zero ($R_0$), one ($R_1$), and two H$_2$O molecules ($R_2$) respectively. The largest, 2 H$_2$O containing system is shown in \figurename\ \ref{fig1}. Our results demonstrate that the presence of water molecule(s) open the possibility for an intermolecular proton transfer in the third step of the reaction (\figurename\ \ref{fig2}) and dramatically reduces the corresponding barriers ($R_1$,$R_2$) compared to the intramolecular ($R_0$) case.
16:40 Parallel proton CT image reconstruction - Akos Sudar (MTA Wigner FK)
Parallel proton CT image reconstruction
- Akos Sudar (MTA Wigner FK)
16:40 - 17:00
Modern proton Computed Tomography (pCT) images are usually reconstructed by the algebraic reconstruction techniques (ART). The Kaczmarz-method and its variations are among the most used methods, which are iterative solution techniques for linear problems with sparse matrices. One can ask, whether statistically-motivated iterations, which have been successfully used for emission tomography, can be applied to reconstruct pCT images as well. In my research, I developed a method, based on the Richardson–Lucy deconvolution -- as a statistically-motivated fixed point iteration. I implemented this algorithm to a parallel GPU code, with spline based trajectory calculation and on-the-fly system matrix generation. My results presented that the method works well, and it can be successfully applied in pCT applications.
17:00 THe challenges and methods of tuning the HIJING++ Monte Carlo event generator - Balázs Majoros
THe challenges and methods of tuning the HIJING++ Monte Carlo event generator
- Balázs Majoros
17:00 - 17:20
17:20 AlphaFold2 transmembrane protein structure prediction shines - Tamas Hegedus (Semmelweis University)
AlphaFold2 transmembrane protein structure prediction shines
- Tamas Hegedus (Semmelweis University)
17:20 - 18:00
Transmembrane (TM) proteins are major drug targets, indicated by the high percentage of prescription drugs acting on them. For a rational drug design and an understanding of mutational effects on protein function, structural data at atomic resolution are required. However, hydrophobic TM proteins often resist experimental structure determination and in spite of the increasing number of cryo-EM structures, the available TM folds are still limited in the Protein Data Bank. Recently, the DeepMind’s AlphaFold2 machine learning method greatly expanded the structural coverage of sequences, with high accuracy. Since the employed algorithm did not take specific properties of TM proteins into account, the validity of the generated TM structures should be assessed. Therefore, we investigated the quality of structures at genome scales, at the level of ABC protein superfamily folds, and also in specific individual cases. We tested template-free structure prediction also with a new TM fold, dimer modeling, and stability in molecular dynamics simulations. Our results strongly suggest that AlphaFold2 performs astoundingly well in the case of TM proteins and that its neural network is not overfitted. We conclude that a careful application of its structural models will advance TM protein associated studies at an unexpected level. URL: http://alphafold.hegelab.org Acknowledgements: Cystic Fibrosis Foundation: HEGEDU20I0 and NRDIO: K127961(TH); CCF LUKACS20G0, CIHR, CFI and Canada Research Chair Program (GLL) Swiss National Funds 310030_197563 (MG). Thanks to https://hpc.kifu.hu, https://www.mpibpc.mpg.de/grubmueller, http://gpu.wigner.mta.hu.
Thursday, 11 November 2021
09:00 Social biases in AI - Balázs Keszthelyi (TechnoLynx Ltd.)
Social biases in AI
- Balázs Keszthelyi (TechnoLynx Ltd.)
09:00 - 09:40
Fairness in AI is a constantly evolving from a regulatory point of view, but the need of attention on this topic has been painstakingly clear after incidents in the past few years. In our presentation we are going to summarize examples of gender and racial bias in AI systems, as well as we are touching upon the latest regulatory trends including ALTAI. We are going to discuss some best practices of bias mitigation but also challenges of detection and clear definition
09:40 20 Years of Static Dataflow - Oskar Mencer
20 Years of Static Dataflow
- Oskar Mencer
09:40 - 10:20
10:20 Boson sampling simulation enhanced by FPGA based data-flow engines - Peter Rakyta (Department of Physics of Complex Systems, Eötvös Loránd University)
Boson sampling simulation enhanced by FPGA based data-flow engines
- Peter Rakyta (Department of Physics of Complex Systems, Eötvös Loránd University)
10:20 - 10:40
As was shown by the pioneering work of Scott Aronson and Alex Arkhipov, bosonic systems are promising candidates to demonstrate quantum advantage. Due to the nature of quantum states describing indistinguishable bosons, the exact simulation of particle number resolved bosonic systems is computationally very hard. One of the main objectives of the Laboratory of Quantum Computer Simulators in Budapest (launched in the collaboration of the Department of Physics of Complex Systems, the Department of Programming Languages and Compilers of the Eötvös Loránd University and the Department for Computational Sciences of Wigner Research Centre for Physics) is to develop new methods to make the simulation of these systems more efficient. According to our recent experiences FPGA based data-flow engines (DFE's) seem to be promising architectures to enhance the simulation of bosonic systems on classical hardware. On platforms supporting data-flow programming model one has an instant access to data generated during the computational process without the overhead of passing the data between the memory and central processing units (CPU's). In particular, we argue that DFE's are suitable to evaluate matrix functions associated with the simulation of different variants of Boson Sampling with high precision. Such a special matrix function is the permanent of a squared matrix. In the talk I will present our DFE implementation to calculate the permanent of a unitary matrix describing a bosonic quantum interferometer using 128 bit fixed point arithmetics. We provide a benchmark of our implementation to calculate the permanent of a matrix up to a size of 28x28 on a single FPGA chip, and up to a matrix size of 40x40 on a dual FPGA chip configuration. Our results outperforms previous benchmarks of permanent calculation both in performance and in numerical precision. We incorporated our DFE permanent calculator into the Piquasso bosonic quantum computer simulator.
10:40 Coffee Break
Coffee Break
10:40 - 11:00
11:00 CERN Quantum Technology Initiative unveils strategic roadmap shaping CERN’s role in next quantum revolution - Michele Grossi
CERN Quantum Technology Initiative unveils strategic roadmap shaping CERN’s role in next quantum revolution
- Michele Grossi
11:00 - 11:40
11:40 Application of Machine Learning tools in heavy-ion collisions at the Large Hadron Collider - Neelkamal Mallick (IIT Indore)
Application of Machine Learning tools in heavy-ion collisions at the Large Hadron Collider
- Neelkamal Mallick (IIT Indore)
11:40 - 12:00
12:00 Introduction to photonic quantum machine learning - Dániel Nagy (Wigner Research Centre for Physics)
Introduction to photonic quantum machine learning
- Dániel Nagy (Wigner Research Centre for Physics)
12:00 - 12:30
Possibly the most influential achievements of modern computer science are the inventions of different machine learning algorithms, especially deep neural networks, which were able to solve problems that were previously intractable for computers, for example recognizing different animals on photos. On the other hand, in the last few decades we were witnessing an enormous improvement in quantum computing, especially quantum hardware developement. Combining classical machine learning methods with the power of quantum computing gives rise to a new field called quantum machine learning. We present a few quantum machine learning algorithms, which use the continuous-variable paradigm of quantum computing.
12:30 Lunch
Lunch
12:30 - 14:00