World's first 1,000-processor chip

World's first 1,000-processor chip
By splitting programs across a large number of processor cores, the KiloCore chip designed at UC Davis can run at high clock speeds with high energy efficiency. Credit: Andy Fell/UC Davis

A microchip containing 1,000 independent programmable processors has been designed by a team at the University of California, Davis, Department of Electrical and Computer Engineering. The energy-efficient "KiloCore" chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors. The KiloCore was presented at the 2016 Symposium on VLSI Technology and Circuits in Honolulu on June 16.

"To the best of our knowledge, it is the world's first 1,000-processor chip and it is the highest clock-rate processor ever designed in a university," said Bevan Baas, professor of electrical and computer engineering, who led the team that designed the . While other multiple-processor chips have been created, none exceed about 300 , according to an analysis by Baas' team. Most were created for research purposes and few are sold commercially. The KiloCore chip was fabricated by IBM using their 32 nm CMOS technology.

Each processor core can run its own small program independently of the others, which is a fundamentally more flexible approach than so-called Single-Instruction-Multiple-Data approaches utilized by processors such as GPUs; the idea is to break an application up into many small pieces, each of which can run in parallel on different processors, enabling high throughput with lower energy use, Baas said.

Because each processor is independently clocked, it can shut itself down to further save energy when not needed, said graduate student Brent Bohnenstiehl, who developed the principal architecture. Cores operate at an average maximum clock frequency of 1.78 GHz, and they transfer data directly to each other rather than using a pooled memory area that can become a bottleneck for data.

The chip is the most energy-efficient "many-core" processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.

Applications already developed for the chip include wireless coding/decoding, video processing, encryption, and others involving large amounts of parallel data such as scientific data applications and datacenter record processing.

The team has completed a compiler and automatic program mapping tools for use in programming the .

Explore further

New 167-processor chip is super-fast, ultra energy-efficient

Provided by UC Davis
Citation: World's first 1,000-processor chip (2016, June 17) retrieved 19 August 2019 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Feedback to editors

User comments

Jun 17, 2016
"The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor."

But due to Amdahl's law, won't perform 100 times more in most tasks.

And beyond Amdahl's law one has to take into account the overhead of communicating between the processing units, which increases as the number of CPUs mount, and eventually overtakes the amount of processing time available to the task itself which results in lower performance the more units are assigned to any particular task. This happens at a surprisingly low number of CPU cores around 10-20.

The types of problems that can use more than that turn out to be exactly the kind of "inflexible" SIMD operations the GPUs use because the task tends to be simpler operations over large data, not requiring full CPU cores but simply a big bunch of calculators and a fast memory bus.

Jun 17, 2016
Oh? Show me yours.

Jun 18, 2016
"Oh? Show me yours."

Well, mine has 1024 SIMD processors divided into 16 control blocks that can compute 1.76 trillion operations per second. Note operations rather than instructions, since a single operation can involve multiple instructions.

The main difference is in power consumption. GPUs use a lot more power to transfer the huge amounts of data they process at hundreds of gigabytes per second. More than 50% of the power consumption of a typical CPU comes from running the caches and data logistics.

That's also why the 0.7 Watts of power consumption is a bit misleading. You can run a "no operation" instruction billions of times a second and use practically no power, but when you start processing something the power consumption jumps up.

Jun 18, 2016
"But due to Amdahl's law..."

Amdahl's law only applies if you try to split one task to n cores.

But if you have 1000 tasks to run which are independent of one another (e.g. 1000 user search requests) then it doesn't apply. In that case you get a fully linear speedup (provided there are no other bottlenecks).

It also doesn't apply if the task isn't globally connected but just locally connected (e.g. only neighboring processors have to talk to one another - like in the case of a neural network or a finite element simulation with closed domains)

Jun 18, 2016
"Amdahl's law only applies if you try to split one task to n cores. "


"In that case you get a fully linear speedup (provided there are no other bottlenecks)"

Yep. That's what's called an "embarassingly parallel" task. If there's no sharing of data between the cores, and the size of the problem isn't too large to congest the memory bandwdith to the chip, then you get linear speedup.

"like in the case of a neural network or a finite element simulation with closed domains"

Although you'll have troubles getting the simulation data in and out of the chip. Each core has a very limited amount of local memory, which limits the size of the problem (eg. number of neurons/domains per core) and the amount of intermediate results you can keep before you have to dump the data out.

Jun 18, 2016
And when you do need to get the data in/out, you run into the data logistics bottleneck.

The reason why modern GPUs are paired with 8 GB of RAM even though the games only really handle <1 GB of data at a time is because they put multiple DRAM chips in parallel to obtain a wide memory bus that can transfer at the required speed. They basically throw more hardware at the problem, and as a result the graphics cards have more memory than they need, and the algorithms they run are extremely wasteful on memory use on the point of getting around the DRAM timing limitations.

The side effect is high power consumption up in hundreds of watts

It's kinda like how programmers with the earliest drum-type hard drives made sure to record the program data on the hard drive in such order that the physical arm that read the data would have the shortest path to the next bit that the program needed. That wasted most of the physical space on the drum, but made the program load and run fast.

Jun 18, 2016
The problem in logistics is called the Von Neumann Bottleneck


The solution to the problem is to provide more fast local memory to the CPUs to avoid transmitting data through the bottleneck, and trying to predict how the code branches to know what data to order from the main store before it is needed, but that ends up consuming more chip area and more transistors, which translates directly to more power consumption.

Jun 21, 2016

Turns out the processor uses about 40 Watts at full speed. However such speed is only theoretical because the packaging only allows full power for the central 160 cores.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more