
At the heart of the at-memory compute architecture is a memory bank: 385Kbytes of SRAM with a 2D array of 512 processing elements. With 511 banks per chip, each device offers 200Mbytes of memory and operates up to 502TOPS in its "sport" mode.
Multiple function-specific buses support the movement of information in the north-south and east-west directions but the emphasis of the architecture is minimizing data movement as much as possible. A zero-detect function allows for processing elements to be switched off which can save as much as 50 percent of power consumption.
Beachler, who is a veteran of FPGA company Altera, commented that the resulting array architecture as similarities to an FPGA. There is also a custom 32bit RISC processor that is tailored for AI loads on the chip.
At the PCIe card level this translates into over 80,000 frames per second of ResNet-50 v1.5 throughput at batch=1. Benchmarks show that this performance is 3 times that of nearest rivals. For natural language processing, tsunAImi accelerator cards can process more than 12,000 queries per second (qps) of BERT-base, four times faster than any announced product.
Key to the ability to such performance is the software development kit, known as imAIgine.
The imAIgine SDK provides push-button quantization, optimization, physical allocation, and multi-chip partitioning. It also provides an extensive visualization toolkit, cycle-accurate simulator, and a runtime API.
The tsunAImi accelerator card is sampling now and will be commercially available in 1Q2021.
Related links and articles:
News articles:
AI startup appoints FPGA, embedded veteran as CEO
Mixed-signal designers form near-memory AI startup
Server processor startup raises $240 million
Groq enters production with A0 tensor processor