Undervolting & Overclocking
4GHz E-Core Deep Dive

‘Gracemont’ Efficient Cores @ 4Ghz Latency Deep-Dive

Now we will take a look at the micro-architecture performance of the Efficient Cores when overclocked from 3.7GHz to 4GHz. The Performance Cores will perform roughly the same since I have not overclocked them. Therefore I will not test them at the moment, but later in the article I will benchmark the Performance Cores when I overclock the DDR5 RAM.

As a reminder the results below are using the DDR5 RAM with stock frequency and timings. (DDR5-4800Mhz - CAS 40) All of the benchmarks so far in this article have used the default DDR5 frequency.

Looking at the data we will start with the L3 Cache. Nearly all of the Efficient Cores show lower latencies as expected. Four of the eight slowest cores (#17, #19, #21 & #23) all show lower latencies which should allow better throughput. Core #23 shows the largest drop of -1.8ns drop which is great. Lower is always better when it comes to latency. The L3 Cache latency average improved from 14ns to 13ns for all cores during my testing.

The L2 Cache latency has improved across the board by 4%. It sounds minor, but it can be major when it comes to data and bandwidth across 2 clusters or 8 E-Cores.

L1 Cache reports the same latency which is fine since L1 Cache is always the fastest by design.

All 8 Efficient Cores @ 4Ghz vs Stock (3.7Ghz) Results

Now we will look at various workloads while running the E-Cores at 4.0GHz. The final results will show how much data can pass through both Clusters when they are working together. As a reminder there are x2 Clusters and each cluster contains 4 Efficient “Atom” Cores.

How much can a small 300Mhz bump increase improve the Efficient Core performance while dropping the CPU Package wattage by 33 watts(1.17v) up to 40 watts(1.11v)? The answer is an extra 11.5 GB\s in performance. In my original article using only the 8 E-Cores (3.7GHz) I average 175.55 GB\s across several workloads and now I am averaging 187.09 GB\s with the E-Cores overclocked to 4GHz. Now we can start to understand how the Cinebench showed an extra 523 points when overclocking the E-Cores to 4Ghz and how y-cruncher was able to complete the 1 Billion Decimal benchmark quicker.

Efficient Clusters @ 4Ghz vs Stock (3.7Ghz) Results

Now I will follow the same methodology as I did in my first article and separate the E-Core Cluster results. The results below will show how well each Cluster (8 E-Cores) work independently on data without transferring data to the opposite cluster or Performance Cores.

Cluster #1 increased by 4GB\s while Cluster #2 increase by 7 GB\s. Cluster #1 total shows 158.7 GB\s and Cluster #2 showed 158.92GB\s. They are now more or less equal in performance. This means that if Intel’s Thread Director and Microsoft’s Windows 11 scheduler ensures that certain workloads can stay within the cluster we could see up to 318 GB\s from the Efficient Cores. In my initial Alder Lake-S Review article we saw that the both Efficient Clusters could theoretically allow up to 307 GB\s of bandwidth. That would make the 300MHz increase allow an additional 11 GB\s while allowing a lower voltage & wattage ouput.

Maximum ‘Single’ Efficient Core @ 4Ghz vs Stock (3.7Ghz) Results

In my initial Alder Lake-S Review I ran a benchmark to show the absolute max performance for a single P-Core and E-Core. I will be performing the same tests in this article, but at the moment we are focusing on the Efficient Core (E-Core). Using the same requirements in the previous article I will select the best Efficient Core based on the lowest latency during my deep-dive into the micro-architecture.

Above you can see that a single Efficient Core can complete approximately 3 extra gigabytes of data (3GB\s). So we can expect all 8 Efficient Cores to get near or exceed this number in specific circumstances. In the original review I hit 283GB\s for a single Efficient ‘Atom’ Core, but this time around with a 4GHz overclock I increased that number to 286 GB\s. The best case scenario was 433GB\s with the 4GHz overclock which is a 26 GB\s increase over my original stock (3.7GHz) results of only 407 GB\s.

All Efficient Cores @ 4Ghz vs Stock (3.7Ghz) Max Results

Now we will take a look at the absolute best case scenario for the Efficient Cores working together.

Going deeper into Alder Lake’s micro-architecture shows how quickly and how much data the cores can crunch. We see an increase of 365 GB\s pushing the overclocked Efficient ‘Atom’ Cores up to 2.79 TB\s (terabytes) a second. That is a respectable increase over the stock 2.42 TB\s which used higher voltage in my original Alder Lake-S Review article.

Efficient Cores @ 4Ghz Conclusion

Wrapping it all up shows some very impressive results. I managed to make the Efficient Cores more “efficient” by lowering the voltage for the entire CPU (1.17v) and by overclocking the E-Cores to 4GHz. The Efficient Cores (8 Atom Cores) temperature only averaged 54 Celsius. I believe there is more headroom in the E-Cores and I will explore this more in future articles when I receive my LGA1700 brackets for my CPU liquid cooler.

Cinebench R23 shows a 523 point increase with lower CPU temps and power usage. Y-cruncher was able to crunch data 318 milliseconds quicker and the latency dropped even lower at the L3 and L2 cache's which should naturally allow for higher bandwidth and better throughput. To support that theory I ran several low level benchmarks to take a closer look at the 'Gracemont' micro-architecture. The Efficient Cores performed much better than expected with only a 300MHz increase across all tests. (Stock 3.7GHz vs Overclock 4.0GHz). The L3 Cache latency decreased from 14ns to 13ns with the 4GHz overclock while the L2 decreased by 4%. Remember that when it comes to latency lower is always better. All of the results so far are only using the CPU with 1.17v and DRR5-4800Mhz frequency. On the next page we will overclock the DDR5 RAM and see how far I can push Alder Lake with lower voltages (1.11v).