Jul 23, 2020 · 7 min read
It was 2003 and in the computer engineering arena, making computer programs run faster, was still mostly driven by using more performant generic processors (Central Processing Units, or CPUs). In simple terms, computer programs are made of instructions which, up until then, were mainly executed sequentially in cycles of a processor, so the higher the processor frequency clock (more cycles per unit of time), the faster the program would run.
But a new wave was about to break … by further increasing the processor frequency clock in a microchip, dealing with variables such as power consumption and heat generation became an unmanageable task on the design of those ultra-performant chips. Some chip manufacturers, like Intel, facing some of those challenges, decided to rethink the hardware approach, moving from a single processing unit (single-core) to multiple processing units (multiple cores) in the same chip.
The following chart, from the computational scientist Karl Rupp of TU Wien, illustrates that shift by merging several microprocessor parameters data for more than 40 years.
As you can see, clock frequency (in green) has reached a plateau, while the number of logical cores (in black) has been increasing. So, the new approach was no longer to increase the “power” of a single processor to perform more cycles in a given timeframe (thus executing more sequential instructions), but rather, to split instructions (workloads) to different processing units working in parallel at the same time.
This was definitely a move that the industry fully adopted, with chip manufacturers progressively deploying HW architectures with an increasing number of processing units (cores) and of different types (from generic CPUs to specialized units like Graphical Processing Units, or GPUs). Quoting Karl Rupp “… if you want to benefit from future processors over what you have now, make sure to have parallel workloads.”
And here we are now, in June 2020, with highly available distributed computing architectures (with multiple cores and of different types) deployed in both huge high performance computing centers (like the 500 world supercomputers) and in portable embedded systems. One could think computer scientists are already taking full advantage of those architectures… But that’s not really the case! Well, a first layer has already been working for some time through the computer Operating System that is able to manage concurrent applications: we can listen to music through a webapp and, simultaneously, work on our favorite productivity tool. But the market seems to be asking for much more in terms of the runtime and performance of applications, with a few examples being:
- A new class of very complex computer systems, with real-time decision windows, is being demanded by the market, with the preeminent case being the autonomous driving systems in which a decision is required in real-time after having to ingest, merge and process several data inputs. Very few systems are already on the market being able to fully meet all of those requirements, suggesting that a lot more is required in terms of application runtime;
- Several complex scientific models are using more advanced computer deep learning techniques that can correlate many more variables, from many more data sources. An example is the discovery process for new drugs. The problem is that the market asks, at the same time, for more complex models but also for faster results (critical in business terms), something that current applications struggle to deliver.
Additionally, note how the Enterprise High Performance Computing market, analyzed by Tractica in May 2018, is expected to grow significantly at a CAGR of +29% achieving, by 2025, a market value +$31b. This is probably going to be fueled by AI-driven applications, with a specific example being embedded real-time decision systems.
So, several data points seem to suggest that computer scientists need to go a step further in terms of taking advantage of the new distributed architectures to be able to cope with ever increasing market requirements. One of the most powerful tools at their hands is doing code parallelization inside each application (and not only doing concurrency between applications), so that applications’ instructions can be executed in parallel in the different processing units, truly taking advantage of the newly deployed architectures.
But this is, to a large extent, a very complex task … the reason being:
- Code parallelization brings a new class of potentially very complex programming errors such as data races or dead locks. As an example, imagine that we have 2 instructions accessing the same data block, with one reading the result of the other. If the developer misses the dependency and allows them to be executed in parallel, we might have the read instruction being executed ahead, thus reading incorrect (not updated) data;
- Software developers were essentially taught to code under a sequential approach. Even today, most parallel computation courses are only part of post-graduation degrees, creating a huge knowledge gap on this topic of code parallelization;
- Although a few new applications are already being developed under the new parallel computing paradigm, the fact is that the large majority of business applications and models currently in use were developed under the sequential approach. As so, there is this huge legacy of applications (always required to run faster) that are not taking any advantage of the newly deployed distributed computer architectures. The problem is particularly hard to tackle because we are talking of “old” code, programed by different developers, using different coding practices and programming languages …
Existing approaches to deal with code parallelization inside an application rely mostly on two classes of tools, both of them with several intrinsic limitations:
- The first class of tools, named static code analysis, applies classical data dependency models, trying to look for dependencies among instructions to check if they could be executed in parallel. This is a traditional “brute-force”, long and complex approach that, in real-life applications, with millions of lines of code, is not feasible;
- The second class of tools, named dynamic code analysis, runs the code with different input conditions and is able to detect parallelization opportunities. The execution time of these tools depends on the code complexity (number of lines of code) and hardware complexity (number of processing units), which represents a huge challenge when dealing with real-life applications deployed in modern architectures like GPUs (with thousands of cores). Moreover, dynamic analysis is critically dependent on input test data, that is often biased and ineffective at testing all the cases in all the code.
This is where a Spanish deep tech company, Appentra, comes into play. Appentra is a spin-off of the University of Coruña, leveraging more than 10 years of R&D on parallel computing led by the company co-founder and CEO, Prof. Manuel Arenaz. The company has developed a unique technology, Appentra’s Parallelware, that is able to generate new versions of real-life applications using parallel code, enabling them to run faster and meet the required business goals, when deployed in distributed HW architectures.
In a nutshell, as illustrated in the chart below, Appentra works as a static code analysis tool that ingests both sequential code or already parallelized code and:
- Identifies, through an AI-powered engine, code sections that can be executed in parallel;
- Generates, for those code patterns and underlying memory and control flows, parallel code using standard directives according to the “offloading” HW architecture.
One important structural element is that Appentra is an evolutive technology: (i) it currently handles an initial set of potentially parallelizable code patterns, but the AI-engine is already learning new mutations and additional code patterns, progressively enriching Appentra’s knowledge base; (ii) it currently works with an initial set of working programming languages (C and C++), yet the underlying compiler infrastructure used by Appentra has the potential to deal with many other input programming languages, given that it is based on an abstract code representation scheme.
Let’s see now how Appentra is able to address the key market challenges discussed above:
- Software developers, even if not proficient in parallel computing topics, have a tool to automatically identify and generate (or correct) bug-free parallel code;
- Legacy applications, developed in a sequential coding paradigm, can be ingested by Appentra’s technology to generate new parallel versions, avoiding the need to manually re-write those applications;
- Complex applications (potentially with millions of lines of code) can now be parallelized according to different execution HW (particularly very complex ones like GPUs with thousands of cores).
In summary and through a business lens, Appentra brings to the enterprise market and to the high performance computing industry:
- A tool to accelerate the runtime of business applications: (i) enabling real-time use cases like autonomous driving systems or other unmanned robotic systems; (ii) Allowing faster and more complex scientific models as in drug discovery, new materials exploration or 5G infrastructure deployment studies. This provides a direct competitive edge for the enterprises exploring those applications;
- An automated tool, supporting software development teams in the process of generating fast and bug-free parallel code.
Several years have been invested in: (i) developing Appentra’s underlying technology; (ii) converting that technology into 2 commercial products; (iii) acquiring initial market validation and traction … yet Appentra’s journey is just about to begin … and we, at Armilar, are thrilled to have the opportunity to be part of it!
Article wrote by João Dias, Principal at Armilar Venture Partners