Microchip to Boost Edge AI with NVIDIA Holoscan

Image
 The NVIDIA Holoscan AI sensor processing platform, SDK and development ecosystem has helped to streamline the design and deployment of AI and high-performance computing ( HPC ) applications at the edge for real-time insights. Now, FPGAs are unlocking new edge-to-cloud applications for this advanced AI platform while enabling AI/ML inferencing and facilitating the adoption of AI in the medical, industrial and automotive markets.  Microchip's new PolarFire FPGA Ethernet Sensor Bridge is empowering developers to create innovative, real-time solutions with NVIDIA’s edge AI and robotics platforms that will revolutionize sensor interfaces across a wide range of powerful applications. To enable developers building artificial intelligence (AI)-driven sensor processing systems, Microchip Technology has released its PolarFire FPGA Ethernet Sensor Bridge that works with the NVIDIA Holoscan sensor processing platform.  Accelerating Real-time Edge AI with NVIDIA Holoscan The Pola...

NPU : Neural Processing Unit

Neural processing units (NPUs) are specialized computer processors designed to mimic the information processing mechanisms of the human brain. They are specifically optimized for neural networks, deep learning and machine learning tasks and applications. NPUs are designed to optimize AI computations thanks to their significant performance improvements over traditional CPUs and GPUs.




Differing from general-purpose central processing units (CPUs) or graphics processing units (GPUs), NPUs are tailored to accelerate AI tasks and workloads, such as calculating neural network layers composed of scalar, vector and tensor math.

 Matrix multiplications, convolutions, and other linear algebraic operations essential to neural networks are best performed by NPUs. NPUs make impressive speed and efficiency advantages by utilizing parallel processing on top of memory architectures that have been tuned.

Often utilized in heterogeneous computing architectures that integrate various processors (such as CPUs and GPUs), NPUs are often referred to as AI chips or AI accelerators. Large-scale data centers can use stand-alone NPUs attached directly to a system’s motherboard; however, most consumer applications, such as smartphones, mobile devices and laptops, combine the NPU with other coprocessors on a single semiconductor microchip known as a system-on-chip (SoC).

By integrating a dedicated NPU, manufacturers are able to offer on-device generative AI apps capable of processing AI applications, AI workloads and machine learning algorithms in real-time with relatively low power consumption and high output.


How NPUs work

Based on the neural networks of the brain, neural processing units (NPUs) work by simulating the behavior of human neurons and synapses at the circuit layer. This allows for the processing of deep learning instruction sets in which one instruction completes the processing of a set of virtual neurons.

NPUs, in contrast to conventional processors, are not designed for exact calculations. Rather, NPUs are designed to solve problems and can get better over time by learning from various inputs and data kinds. AI systems with NPUs may produce personalized solutions more quickly and with less manual programming by utilizing machine learning.

One notable aspect of NPUs is their improved parallel processing capabilities, which allow them to speed up AI processes by relieving high-capacity cores of the burden of handling many jobs. Specific modules for decompression, activation functions, 2D data operations, and multiplication and addition are all included in an NPU. Calculating matrix multiplication and addition, convolution, dot product, and other operations pertinent to the processing of neural network applications are carried out by the dedicated multiplication and addition module. 




An NPU may be able to do a comparable function with just one instruction, whereas conventional processors need thousands to accomplish this kind of neuron processing. Synaptic weights, a fluid computational variable assigned to network nodes that signals the probability of a "correct" or "desired" output that can modify or "learn" over time, are another way that an NPU will merge computation and storage for increased operational efficiency.

Testing has revealed that some NPUs can outperform a comparable GPU by more than 100 times while using the same amount of power, even though NPU research is still ongoing. 


Key features of NPUs

Neural Processing Units (NPUs) are designed to excel at low-latency, parallel computing tasks, making them ideal for intensive AI-driven applications like deep learning, speech and image recognition, natural language processing, video analysis, and object detection. Their architecture is optimized to handle massive amounts of data simultaneously, which is essential for achieving high performance and speed in these computationally heavy tasks. Here’s a look at some of their defining features:

  • Parallel Processing: NPUs excel at breaking down complex tasks into smaller, concurrent operations. This parallel processing capability allows them to perform multiple neural network computations at once, significantly speeding up tasks that require large data handling.
  • Low-Precision Arithmetic: To optimize energy efficiency and reduce computational load, NPUs often use 8-bit or lower precision operations. This lower precision is sufficient for many AI tasks, allowing NPUs to achieve high performance without excessive power use.
  • High-Bandwidth Memory: NPUs are typically equipped with high-bandwidth memory directly on the chip. This setup is crucial for handling the large datasets required for AI processing, ensuring that data flows smoothly and quickly between the memory and processing units.
  • Hardware Acceleration: Recent advancements in NPU design include hardware acceleration techniques like systolic array architectures and enhanced tensor processing. These features enable NPUs to process data-intensive tasks faster and more efficiently, giving them an edge in handling complex AI algorithms.



Comparing Processors: CPU, GPU, and NPU
  • Central processing units(CPU): The “brain” of the computer. CPUs typically allocate about 70% of their internal transistors to build cache memory and are part of a computer’s control unit. They contain relatively few cores, use serial computing architectures for linear problem solving and are designed for precise logic control operations. 
  • Graphic processing units(GPU): First developed to handle image and video processing, GPUs contain many more cores than CPUs and use most of their transistors to build multiple computational units, each with low computational complexity, enabling advanced parallel processing. Suitable for workloads requiring large-scale data processing, GPUs have found major extra utility in big data, backend server centers and blockchain applications.
  • Neural processing units(NPU): Building on the parallelism of GPUs, NPUs use a computer architecture designed to simulate the neurons of the human brain to provide highly efficient high performance. NPUs use synaptic weights to integrate both memory storage and computation functions, providing occasionally less precise solutions at a very low latency. While CPUs are designed for precise, linear computing, NPUs are built for machine learning, resulting in improved multitasking, parallel processing and the ability to adjust and customize operations overtime without the need for other programming. 




Devices with Integrated NPUs

Today, Neural Processing Units (NPUs) are embedded in a wide range of devices to enhance AI processing capabilities and support advanced functionalities:
  • Smartphones: Many leading smartphone brands now feature NPUs for tasks like image recognition and augmented reality. Examples include Apple’s Neural Engine in iPhones, Samsung’s Exynos in Galaxy devices, Huawei’s Kirin in its flagship phones, and Google’s Tensor processor in Pixel models.
  • Laptops: NPUs are increasingly present in laptops for improved AI-driven features and faster data processing. Notable models include Apple’s M1 and M2 chips in MacBooks, Dell’s XPS series, HP’s Envy lineup, and Lenovo’s ThinkPad series.
  • Data Center Servers: To handle large-scale AI workloads, data centers integrate NPUs into servers. Google uses its TPU (Tensor Processing Unit), Amazon offers Inferentia, and Microsoft employs BrainWave to accelerate machine learning and deep learning tasks in the cloud.
  • Gaming Consoles: NPUs enhance gaming consoles by improving graphics rendering and enabling real-time adjustments to in-game environments. Examples include Sony’s PlayStation 5 and Microsoft’s Xbox Series X.
  • Autonomous Vehicles: NPUs in autonomous vehicles power real-time decision-making and sensor data processing. Leading examples include Tesla’s custom-built NPU and Waymo’s proprietary system for autonomous driving, both essential for safe navigation and object detection.

Samsung’s Exynos 9

Samsung Electronics launched the premium mobile application processor (AP) Exynos 9 (9820) in November. With the addition of an AI-enabled NPU, the Exynos 9 (9820) has approximately 7 times the computing power of the previous model (the 9810).
This impressive speed comes into its own when working with technologies like augmented reality (AR) or virtual reality (VR) by providing a dynamic and entertaining user experience that requires fast and accurate identification of features in humans and objects. While AI operations were previously carried out through a server connection, the Exynos 9 (9820) allows AI operations to occur within a mobile device for improved security.
The NPU is essential for AI applications, and is expected to find uses beyond the mobile world in the fourth industries such as self-driving automobiles. It’ll be exciting to see what the age of artificial intelligence (AI) will hold for the future of NPUs.



AMD XDNA - AI Engine

AMD XDNA is a spatial dataflow NPU architecture consisting of a tiled array of AI Engine processors. Each AI Engine tile includes a vector processor, a scalar processor, and local data and program memories. Unlike traditional architectures that require repeatedly fetching data from caches (which consumes energy), AI Engine uses on-chip memories and custom dataflow, to enable efficient, low power computing for AI and signal processing.

Each AI Engine tile consists of a VLIW (Very Long Instruction Word), SIMD (Single Instruction Multiple Data) vector processor optimized for machine learning and advanced signal processing applications. The AI Engine processor can run up over 1.3GHz, enabling efficient, high throughput, low latency functions. Each tile also contains program and local memory to store data, weights, activations, and coefficients; a RISC scalar processor and different modes of interconnect to handle different types of data communication.

Next-Generation AMD XDNA 2 is built for Generative AI experiences in PCs, delivering exceptional compute performance, bandwidth, and energy efficiency. 



Intel's Meteor Lake

When Intel’s “Meteor Lake” processors launch, they’ll feature not just CPU cores spread across two on-chip tiles, alongside an on-die GPU portion, but also the company’s first-ever Neural Processing Unit (NPU) devoted to AI workloads. 

According to Intel, the NPU means generative AI programs such as Stable Diffusion, or today's chatbots working off of a locally hosted language model, can run natively on Meteor Lake laptops, without suffering from slow performance, poor quality, or crippling power demands. 

Like a GPU, the NPU works like a PCIe device and takes commands from the host. Instead of building a custom command processor, Intel uses Movidius’s 32-bit LEON microcontrollers. LEON use the SPARC instruction set and run a real time operating system. One of those cores, called “LeonRT”, initializes the NPU and processes host commands. Further down, “LeonNN” sends work to the NCE’s compute resources, acting as a low level hardware task scheduler. Both LEON cores use the SPARC instruction set and have their own caches. The NPU also uses Movidius’s SHAVE (Streaming Hybrid Architecture Vector Engine) DSPs, which sit alongside the MAC array. These DSP cores handle machine learning steps that can’t be mapped onto the MAC array.

Accelerator design is all about closely fitting hardware to particular tasks. That extends to the NPU’s memory hierarchy. Each NCE has 2 MB of software managed SRAM. Because they aren’t caches, the NPU doesn’t need to store tag or state arrays to track what’s in the cache. Accesses can directly pull data out of SRAM storage, without tag comparisons or address translation for virtual memory. But lack of cache places a heavier burden on Intel’s compiler and software, which has to explicitly move data into SRAM.



Conclusion

NPUs are specialized hardware optimized for accelerating machine learning tasks, especially deep learning. They enhance performance by efficiently processing large datasets and complex algorithms through parallel processing. NPUs are crucial for applications like image recognition, natural language processing, and real-time data analysis, driving advancements in AI technology while reducing power consumption. Overall, they play a key role in the future of AI hardware.

Comments

Popular posts from this blog

Registers: The Backbone of Computer Memory

Preprocessor Directives in C

Function Generator