How embedded FPGAs fit AI applications
Applications span diverse markets such as autonomous driving, medical diagnostics, home appliances, industrial automation, adaptive websites, financial analytics and network infrastructure.
These applications, especially when implemented on the edge, demand high performance and, low latency to respond successfully to real-time changes in conditions. They also require low power consumption, rendering energy-intensive cloud-based solutions unusable. A further requirement is for these embedded systems to always be on and ready to respond even in the absence of a network connection to the cloud. This combination of factors calls for a change in the way that hardware is designed.
Many algorithms can be used for machine learning, but the most successful ones today are deep neural networks. Inspired by biological processes and structures, a deep neural network can employ ten or more layers in a feed-forward arrangement. Each layer uses virtual neurons that perform a weighted sum on a set of inputs and then passing the result to one or more neurons in the next layer.
Although there is a common-core approach to constructing most deep neural networks, there is no one-size-fits-all architecture for deep learning. Increasingly, deep-learning applications are incorporating elements that are not based on simulated neurons. As the technology continues to develop, many different architectures will emerge. Much like the organic brain itself, plasticity is a major requirement for any organization that aims to build machine learning into its product designs.
Next: Training and inference
Training and Inference
One important contrast between the organic brain and AI is the ability to separate activities such as training and the inferencing stage when the trained network is called upon to make decisions. In the mid-2000s, efficient techniques were discovered to allow the training of multiple layers at once. The techniques rely on enormous compute power generally supplied by servers that use many processors for the task. The training process is performed in the background – often in the cloud – and does not require a result to be produced in real time.
For inferencing, the compute demand is lower than with training but is generally called upon to provide a real-time response in most real-world applications. Energy-efficient parallel processing is a key requirement for inferencing systems because many will not have a permanent external power source.
Typically, training demands high precision in the floating-point arithmetic used to compute neural weights to minimize the errors that could accumulate from multiple rounding errors during passes backwards and forwards through the deep layered structure. In most cases, 32-bit floating point has been shown to be sufficient in terms of precision.
For inferencing, errors are less likely to accumulate and networks can tolerate much lower precision representations for many of the connections. Often 8-bit fixed-point arithmetic is sufficient, and, for some connections, 4-bit resolution does not increase errors significantly. Systems will benefit from the ability to reconfigure datapaths so they can process many streams in parallel at 4- or 8-bit precision. But designers will want to retain the ability to combine execution units for high-precision arithmetic where needed.
Clearly, machine-learning systems call for a hardware substrate that provides a combination of high performance and plasticity.
Next: Substrates for machine learning
Substrates for machine learning
A number of processing fabrics are available to support high-performance machine learning. But for use in real-time embedded systems, some will be ruled out at an early stage due to power consumption and performance reasons.
In the first half of this decade, the general-purpose graphics processing unit (GPGPU) became a popular choice for both training and inferencing. The GPGPU provides hundreds of on-chip floating-point units, able to sum the inputs for many neurons in parallel much faster than clusters of general-purpose CPUs.
There are drawbacks with applying GPGPUs to deep-learning architectures — these devices are designed primarily for accelerating 2D and 3D graphics applications, which employ homogeneous and predictable memory access patterns. Their structure favors algorithms with arithmetically intensive operations on data that can easily be grouped closely together in memory, which can then be used to process convolutional neural-network layers reasonably efficiently. However, other types of layers can prove troublesome because they place a greater emphasis on data transfers between neurons thus making the local-memory architecture less efficient, reducing both performance and energy efficiency.
An ASIC implementation with custom logic and memory managers can overcome the bottlenecks that challenge GPGPUs in the implementation of deep-learning systems. Memory management units that can be tuned for the different access patterns encountered in neural-network code can do a much better job of enhancing overall speed. In structures such as convolutional neural network (CNN) layers, power savings can be achieved by not transferring data in and out of local or intermediate memories. Instead, the fabric can adopt the structure of a systolic array and pump results directly to the execution units that need them.
The problem that faces any ASIC implementation is its relative inflexibility compared with software-based processors. It is possible to prototype a wide range of deep-learning structures and then choose to optimize one for deployment in silicon. A particular application may need to deploy more convolution layers or increase the complexity of the filter kernels to handle a particular kind of input. Supporting this complexity may require an increased number of filter-kernel processors relative to other hardware accelerators. This structure can be accommodated by an ASIC, but it may prove to be a poor fit for a changing algorithm or an adjacent application.
FPGAs provide a way of achieving the benefits of custom processors and memory-management techniques without locking the implementation to a specific, immutable hardware structure. Many FPGA architectures today provide a mix of fully customizable logic and digital signal processing (DSP) engines that provide support for both fixed- and floating-point arithmetic. In many cases, the DSP engines employ a building-block approach, composed of 8- or 16-bit units, that allows them to be combined to support higher-precision data types. Low precision can be accommodated via logic implemented in the fabric’s look-up tables (LUTs).
The ability to rework the logic array within an FPGA makes it easy to tune the structure of the parallel processors and the routing between them for the specific needs of the application. The freedom remains to make changes later when the results from training indicate ways in which layers can be expanded or rearranged to improve performance. However, the relative inefficiency of the programmable logic array may mean a user has to compromise on performance – sharing functions among different layers within the neural network when the application really demands dedicated functionality for some high-throughput parts of the network. One approach is to augment the FPGA with a smaller ASIC that provides acceleration for commonly used functions, such as convolution kernels or max-pooling calculations.
Embedding an FPGA fabric in a system on a chip (SoC) provides a solution to the drawbacks of both the standalone FPGA and ASIC, and the issues of passing data between them. One or more FPGA slices embedded into an ASIC provides the ability to tune the performance of the neural network on the fly, delivering the high data-transfer bandwidth required to make full use of customized engines.
Embedded FPGAs make it possible to achieve the best balance between throughput and reprogrammability and deliver the performance that real-world machine-learning systems require.
The ability to bring FPGA blocks on-chip also saves significant silicon area by:
1) Eliminating the large, power hungry I/O associated with a standalone FPGA
2) Moving fixed functions to more efficient ASIC blocks
3) Conversion of repetitive functions to custom blocks.
Next: eFPGAs in machine learning
eFPGAs in machine learning
EFPGAs are a highly flexible solution that support the data throughput required in high-performance machine-learning applications. Different architecture provides designers with the ability to mix and match eFPGA functions as required by the application. With some core functions including logic based on four-input LUTs, small logic-oriented memories (LRAMs) for register files and similar uses, larger block RAMs (BRAMs), and configurable DSP blocks.
Core functions can also be augmented with custom blocks that provide more specialized features that are silicon-intensive in programmable logic, such as Ternary content-addressable memories, ultrawide multiplexers and memory blocks optimized for pipelined accesses.
Through the embeddable architecture, access to the programmable fabric is available to custom cores in the SoC without the energy and performance penalties of off-chip accesses. With no need for programmable I/O buffers around the FPGA fabric, overall die area within the solution is reduced. Moreover, the modular nature of the architecture makes it easy to port the technology to a wide variety of process technologies, down to the emerging 7 nm nodes.
The result of these features is an architecture that provides the best possible starting point for real-time AI acceleration for embedded systems that range from consumer appliances through to advanced robotics and autonomous vehicles.
Machine-learning techniques represent a new frontier for embedded systems. Real-time AI will augment a wide variety of applications, but it can only deliver on its promise if it can be performed in a cost-effective, power-efficient way. Existing solutions such as multi-core CPUs, GPGPU and standalone FPGAs can be used to support advanced AI algorithms such as deep learning, but they are poorly positioned to handle the increased demands developers are placing on hardware as their machine-learning architectures evolve.
AI requires a careful balance of data and performance, memory latency, and throughput that requires an approach based on pulling as much of the functionality as possible into an ASIC or SoC. But that single-chip device needs plasticity to be able to handle the changes in structure that are inevitable in machine-learning projects. Adding eFPGA technology provides the mixture of flexibility and support for custom logic that the market requires.
Alok Sanghavi is a senior product marketing manager with Achronix Semiconductor Corp.
Related links and articles: