The University of Manchester
Thanos Stratikopoulos, Juan Fumero, Christos Kotselidis
Heterogeneous hardware has become the norm in modern computer systems. Lately, hardware accelerators, such as GPUs and FPGAs, have been integrated as co-processors to the main CPUs with the aim of balancing the trade-off between high performance and low power consumption. The advantages of GPUs against FPGAs have been studied in literature, and the findings have shown that each type of accelerator can be suitable for different types of workloads [1, 2, 3]. For example, GPUs are superb for high-throughput computing due to their fine-grain parallelism, while FPGAs excel in coarse-grain parallelism and pipelining.
As a result, the characteristics of a single application (or parts of an application) are the drivers for deciding “which device is a better candidate to execute”. However, let’s take a step back. Let’s assume that this decision has been made and there is a winner device among the available candidate co-processors. The next challenging question is “how does the developer of the application write the source code to obtain hardware acceleration and orchestrate the execution?”. To address this question, several programming models (OpenCL , CUDA , OpenACC , OneAPI ) have been introduced in recent years to facilitate programmers by abstracting the hardware knowledge.
Why heterogeneous hardware execution in Java?
The aforementioned programming models are supported mostly in programming languages, such as C, C++, and Python; and they come at the expense of programmability, since special coding skills are required which can fragment the programming language. This fact blocks the integration of heterogeneous hardware within programming languages, such as Java, which executes within a managed runtime system (i.e., Java Virtual Machine (JVM)). To this extent, TornadoVM  is a technology used by Java developers to program applications that can execute on heterogeneous hardware, without requiring to code in OpenCL or CUDA. TornadoVM is a JVM plugin that automatically compiles Java methods to OpenCL C and PTX code, and can target multiple types of accelerators, such as multi-core CPUs, GPUs (integrated, discrete) and FPGAs.
As a result, this ability to automatically compile Java code for heterogeneous hardware makes TornadoVM a candidate technology for accelerating Big Data frameworks (e.g., Apache Flink ) and IoT frameworks (e.g., NebulaStream ). The reason is that these frameworks can create high parallel computations due to the commonly used “map-reduce” programming model, and they currently do not harness hardware acceleration. The alternative of dynamic compilation is native GPU support (i.e., calling directly a CUDA kernel), which has been introduced in Apache Spark 3.0 for accelerating machine learning and deep learning applications .
How does TornadoVM express loop parallelism?
All previous TornadoVM releases (<= v0.8) have enabled Java developers to express loop parallelism for heterogeneous hardware. As a result, Java developers employ the @Parallel code annotation to instruct the TornadoVM compiler that a loop can execute in parallel on multiple threads or compute units. Additionally, TornadoVM has allowed the compilation of multiple parallel-annotated nested loops, as long as the number of nested loops does not exceed the capacity of thread dimensions in hardware (e.g., usually max 3 dimensions for GPUs). Figure 1 presents a code snippet of an annotated method that implements the multiplication of two squared matrices by using the @Parallel annotation and two parallel hardware dimensions.
What is new in TornadoVM?
The latest version of TornadoVM (v0.9) was released in April 2021 and has included an expanded set of features to the API that can aid developers to express kernel parallelism. Kernel parallelism is a term familiar to OpenCL/CUDA programmers. The new set of features is exposed to Java developers to give them more freedom regarding the level of control they can achieve on the heterogeneous hardware. In essence, if Java developers have experience with OpenCL or CUDA, they can use the new API features to express, in Java, complex algorithmic constructs, like stencil computations, histogram, scan operations, that can exhibit significant performance increase when running on GPUs. Figure 2 presents a code snippet of method that implements the multiplication of two squared matrices by using the globalIdx and globalIdy variables of KernelContext to represent two parallel hardware dimensions. In essence, this code is functionally equivalent to the annotated code illustrated in Figure 1 and results in the same level of performance.
In fact, this implementation can be extended to utilize local memory in combination with loop tiling to exploit data locality and increase performance. Figure 3 presents the code snippet of the implementation that follows the OpenCL implementation description provided in https://github.com/cnugteren/myGEMM. More information about exploiting kernel parallelism via TornadoVM is available here.
Additionally, the new API of kernel parallelism is complementary to the previous API of loop parallelism, and they can be combined in the same program. Some examples are given here.
What about performance?
We will re-use the previous example of matrix multiplication for GPU execution to compare the performance of the original implementation written in Java against two implementations implemented with TornadoVM:
(a) TornadoVM v0.8 that uses the @Parallel annotation to instruct that the two-dimensional loops can be concurrently deployed in 2 dimensions on the GPU; and
(b) TornadoVM_LM_LT, an implementation that utilizes local memory in combination with loop tiling to express kernel parallelism. This implementation follows the algorithm as being described here.
The following plot illustrates the performance improvement of the multiplication benchmark between two two-dimensional matrices sized with 2048x2048 data elements. As shown, both TornadoVM implementations have reduced the overall execution time from 35 seconds to the range of hundreds of milliseconds. In particular, the TornadoVM v0.8 implementation has increased the overall performance by 66x, while TornadoVM_LM_LT outperforms the Java implementation by 153x. Both implementations have been executed on a GeForce GTX 1050 Ti GPU, and the JVM is OpenJDK8 with JVMCI support.
Role in the ELEGANT project?
TornadoVM will be the key technology asset for providing "acceleration-as-a-service" across the whole ELEGANT infrastructure (from IoT/Edge to Cloud). In this role, TornadoVM will be used as a compiler plugin that will emit heterogeneous code that will be executed even out of the JVM.
 S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach, "Accelerating Compute-Intensive Applications with GPUs and FPGAs," 2008 Symposium on Application Specific Processors, 2008, pp. 101-107. https://doi.org/10.1109/SASP.2008.4570793
 S. Asano, T. Maruyama and Y. Yamaguchi, "Performance comparison of FPGA, GPU and CPU in image processing," 2009 International Conference on Field Programmable Logic and Applications, 2009, pp. 126-131. https://doi.org/10.1109/FPL.2009.5272532
 D. H. Jones, A. Powell, C. Bouganis and P. Y. K. Cheung, "GPU Versus FPGA for High Productivity Computing," 2010 International Conference on Field Programmable Logic and Applications, 2010, pp. 119-124. https://doi.org/10.1109/FPL.2010.32
 J. E. Stone, D. Gohara, and G. Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering 12, 3 (2010), pp. 66–73. https://doi.org/10.1109/MCSE. 2010.69
 Shane Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs (1st ed.). Morgan Kaufmann Publishers Inc.