Moore’s Law has ruled the semiconductor industry since 1965. If we use the whole wafer for one chip, did we break the law?
Background
This post will go deep into some semiconductor industry history and technology. In case you don’t live in my world, let’s cover the basics.
In 1965 Gordon Moore, engineer CEO of Intel said:
“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Gordon Moore – from Wikipedia
This prediction has held in the semiconductor industry ever since. Every few years there is an article about how Moore’s Law is dead. Someday a lucky author will be correct with their prediction, but so far, they’re all wrong.
Wafer scale integration is a new (newer) idea to use the entire silicon wafer to make a single massive device. A new AI Chip company called Cerebras generated some buzz, following the AI hype cycle by building an AI chip using wafer scale integration. This is not a critiscm of Cerebras, but an examination of wafer scale as a technology.
Here is a great overview of chip fabrication. I’ve used the numbers for wafer size and die per wafer from the video, so no proprietary information from past employers will be shared in this post. (Keep the lawyers happy.)
I have worked as a Product Engineer at Micron. We worked directly with the fab on DDR4 memory chips and the Automata processor in memory part. I am familiar with fab technology. Fab’s are expensive, and making semiconductors is difficult. Does it make sense (money) to do one part per wafer?
How many parts are we talking about?
A modern fab uses a 300mm diameter wafer. Check out the video about minute 22 for wafer details. High school geometry tells us A=πr2 so 3.14*(300/2)2 = 70650mm2 .
As a first approximation how many chips go on one wafer? Depends on how big the chip is of course. According to the video, 230 chips per wafer (8:02 min in the video), so each chip is about: 70650mm2 / 230 = 307mm2 and if we do a square root, or about 17.5mm2 . Not perfect, but good enough for a first approximation.
This web site has pictures that compare Intel i7, and two i9 parts. 9.2mmx16.7 to 22.4mm. Smaller than the video, but probably using a smaller feature size and newer process.
There is of course overhead and lost real-estate. The images show where rectangular die are fit into a circular wafer, so some waste around the edges. Area is also lost for the kerf, where the saw cuts. Don’t worry, designers put test patterns to monitor the process in the fab in the kerf area. The test devices and structures are destroyed when the parts are cut into individual die.
The video said a DRAM part is smaller, so 952 parts per wafer. If we make a small ARM Microcontroller, there could be as many as 2500-3500 parts per wafer.
What do you mean scaling?
Moore’s Law is about scaling semiconductor fabrication. What do we mean when we talk about scale? If the smallest feature on a chip can be reduced with each new generation (fab node or just node in the industry jargon), then semi companies make money.
How does scaling make money? Well we have two choices when planning the road-map for the next node and new parts. Let’s assume (as in the YouTube video above) we are talking about a CPU as a single chip. I can make the same part in the new node, but each part will be smaller, so more parts fit on a single wafer.
Comparing the video to the pictures from the de8auer web site, a newer CPU is smaller, so more fit on a wafer. If the CPU in the video was 307mm2 and the ones pictured are 9.2 x 16.7mm = 153.64mm2 it’s obvious, even from some approximate numbers off random web sites, they can get more parts per wafer on the new node. The quick math of 70650/(9,2×16.7) = 459, let’s call it 450 because of the wasted area mentioned above.
The cost for a wafer is constant. The cost to run a given wafer through the fab is about the same. (OK, it is not, but close, for comparison we can pretend for a few paragraphs.) More chips per wafer means more chips to sell at the same fixed cost, so more money.
The dilemma as a product planner is about what to make at the new node. We could take the existing CPU and shrink it, scale it down a size and get more parts per wafer. My CEO is happy because we make more money.
OR we could make an improved part, add stuff to the design. Maybe a new AI thingamajig. The new process node means we can get the same number of parts with more transistors, so more stuff and sell at a higher price. My CEO is happy because we make more money.
Intel had their famous tick/tock strategy. An existing design, would be built on a new process node, a die shrink, meaning more parts with the same design. Like pancakes, the first one on a new process never comes out quite right. All of the problems, issues, and tuning for the new fab process get debugged and fixed with a known good design.
The now known process would then be used for a next generation part. Adding new stuff to improve speed or functionality. The new parts get sold for more money. Best of both worlds, until the leap to the next process node stumbles. That is Intel’s story, not the point of this blog.
And the yield?
The video mentions binning and defective chips. Every part on a wafer may have defects. With many parts per wafer, some fail completely. Each design includes redundant elements to repair the part, and some partially working parts are sold as different bins.
For example, the cache memory, or a DRAM part have more actual memory cells than the memory capacity of the part. Usually about 20% more. A single failing memory cell can be repaired out of the memory array. If there are too many fails, the part gets binned to a smaller cache size part or discarded if no repair can be made.
The ability to repair parts, or remove a defective die directly impacts the fab yield. The yield for any part is a closely held secret. Good yield means good profits. To be on a cutting edge 7nm, 5nm or even smaller node costs more. There will be more failures due to the process until the fab has time to work out issues and get the manufacturing flow repeatable.
So, going to TSMC as a fabless semiconductor company, you really don’t want to be on the bleeding edge, smallest possible feature size node. If you want to be there it costs. It costs a lot, so you should have a really good reason that makes economic sense.
What about wafer scale?
By definition the number of parts per wafer is 1.
Even if the die shrink cuts the size in half, there will not be 4 parts per wafer. The rectangular shape on a round wafer makes that not possible. So, maybe at 2 node shrinks we can get 4 per wafer of a 3 year old design. The semiconductor industry moves fast, and demand for a 3 year old design maybe limited.
The strategy could be to pack more stuff on one wafer on every process shrink. New design plus new process is complicated.
To make a wafer scale part work, there will be failures due to manufacturing defects, so the design has to include redundant elements. Selecting the right trade off for what should be repairable is difficult. There is overhead associated with the ability to repair something, more than just an area penalty. Remember area = money. The repair has to be testable, included in the logic if enabled or not, and meet timing. All of those costs are paid once at design time and at test time for each part.
Finally lets talk about packaging. What sort of a heat sink goes on a wafer scale part? Don’t get the screws too tight and crack the silicon. I hope someone smarter than me figured out a good way to hold the part, bond out pads and get power across the entire wafer.
Wafers are processed in lots. A lot is usually 25 wafers going through the fab together. They are all in the same boat, travelling to the same machines, and seeing the same issues and defects. When you look at failures in the fab, they will often relate to the lot. A wafer scale part one lot is only 25 pieces. A lot of memory chips would be 25 x 952 = 23800 die or of CPUs, 230 x 25 =5750 die. Doing lot failure statistics as a product engineer on 25 pieces at a time seems unimaginable.
Just to make things harder, the part has memory and processing together on one wafer. A good memory process is very different than a logic process. The steps required to build memory and CPUs are not the same. I assume they are using some dynamic memory. If it is static memory, again there is an area penalty.
Conclusion
I get what they are trying to do with the Wafer Scale Engine (WSE). Not going off the chip connect to the system or PCIe devices will be faster, and use less power. It seems to make sense, but the business model does not appear to be make money by selling parts. They are building super AI computers and renting them.
So for wafer scale, yikes! it does not make much sense. You’re working against Moore’s law, and that makes it hard to make money.
I am writing this as someone who knows the semiconductor industry, but sees something that just does not make sense.
If someone works for or has worked for Cerebras or another company planning to use wafer scale and can point out specific errors, I would love to hear from you, please comment. I would be happy to make corrections or follow up.
Leave a Reply