the release of the Chinese DeepSeek R1 model caused a really big splash on the stock market and this week the discussions around it continued meanwhile new models and new Chinese GPUs made headlines so in this video I want to focus on the impact of all of this on the GPU market and why it's in fact a huge opportunity which may not happen again I will also break down some new very interesting Chinese GPUs that are on the way right now so if you want to know where the market is shifting watch till end in December 2024 the Chinese AI company DeepSeek released their V3 based model which was extremely efficient but no one paid any attention to it at the end of January they released the Reasoning Model R1 which they claimed achieved comparable performance to Open's AI 01 and this release just exploded and the first reason was due to its Hardware training costs what many people concluded is that the best NVIDIA GPUs may not be needed to make big strides in AI and I think it's very important to discuss what's going on here first of all High-Flyer is a hedge fund that also founded DeepSeek and High-Flyer used to be one of the biggest NVIDIA customers in the Chinese market purchasing tens of thousands of A100 and H100 GPUs so no way they are a threat to NVIDIA but still this $6 million figure training cost made all the headlines because it's very low compared to Open's AI estimated 100 million for a similar model and then we all witnessed how the media as always messed up the whole story if we now look at the bigger picture DeepSeek reportedly has access to roughly 50,000 GPUs among them are older A100 GPUs but mostly different adaptations of H100 Hopper GPU for the Chinese market among them are H800 and H20 versions what's interesting if we look at the specs H800 GPU almost matches the peak performance of H100 GPU you in the most performance metrics biggest difference is in the memory and NVLink bandwidth in practice this means slower data movement between memory and the processing cores as well as in between GPUs as discussed before by now H800 is also not allowed and only H20 is allowed and this is quite funny because H20 is in fact preferable because it has more memory and in 2024 NVIDIA sold roughly 1 million H20 GPUs to China and the next NVIDIA GPU to come to the Chinese market was B20 which is a derivative of B200 Blackwell GPU but the exact specs are yet unknown when I saw NVIDIA stock dropping I was like shopping time because long term I think this DeepSeek drama will only increase the evaluation of NVIDIA straighten expert controls and it will all come to the fact that Chinese companies will have to move to the domestic options which are getting better and better and now we are at the most interesting part let's discuss which options do they have and what is yet to come in fact DeepSeek are one reduced requirements on the compute side open the door to many domestic hardware and yes before the restrictions took effect NVIDIA share on the Chinese market was roughly 90% but over the last few years Chinese companies have been working on getting a share of this pie including companies like Huawei Alibaba Moore Threads Biren Tencent Enflame Hygon and many more among them the most interesting story is Huawei their Ascend 910b GPU is the most powerful GPU which is designed and manufactured in China and it's in a very high demand now if you look at the official specs its peak performance at 8bit precision is 512 TeraFLOPs so theoretically it has higher FLOPs than NVIDIA H20 GPU and now Huawei is ramping up its R1 model on Huawei Cloud which is partially built out of Ascend 910b GPUs Huawei is challenging Nvidia with a new chip for Artificial Intelligence according to the Wall Street Journal Huawei has reportedly told potential clients that the chip is comparable to NVIDIA's H100 at the same time the new Ascend 910c GPU is in development they've already manufactured the first samples and plan to ramp up mass production already this year if you previously watched this video you know that SMIC or SMIC Chinese semiconductor manufacturing giant is currently struggling with a yield in N+3 process which is roughly at 20% now and this number is far off from what is typically required to bring a product such as this GPU to mass production if you want to know more details on this subscribe to the channel now and watch this video right after this one now talking of 910c GPU is manufactured in N+3 process node by SMIC which is equivalent to 6nm process by TSMC or N6 and it's rumored to be a doubled die design means doubling the same silicon of 910b GPU and this is very interesting for many reasons first of all because it's following the general industry trend of building larger GPUs because larger chips can handle more data simultaneously and it resembles the idea behind the latest NVIDIA Blackwell GPU where we have two large GPU dies which basically contain the core logic and they are linked by a very fast interconnect bridge and through this bridge one die communicates with the other and every die is surrounded by four memories and to package something as complex as this they are using an advanced Chip on Wafer on Substrate L (CoWoS-L Packaging) technology available from TSMC now manufacturing of this doubled die design and this complex packaging is very challenging because you have to align many many pins and you may have heard about NVIDIA delaying the release and the shipment of their Blackwell GPUs due to the manufacturing and thermal challenges now here the secret sauce is this special packaging and huge TSMC experience and they eventually able to nail it down while this kind of packaging is not available at SMIC now this kind of advanced Packaging Technology is not supported by SMIC in fact they are not supporting any of the advanced Packaging Technologies including CoWoS Packaging it will be interesting to see how SMIC going to handle this or it's done on a single piece of silicon so doubled die design on a single piece of silicon then no doubt they're going to struggle with yield as they already struggling with a yield even for smaller designs let me know what you think in the comments in fact Huawei GPUs as well as many other Chinese Hardware domestic companies we will discuss in a moment are all relying on SMIC fab which first of all has a pretty limited capacity often prioritized for Huawei products and also struggles with yields manufacturing yield is a percentage of the chips which is successfully produced without defects and are usable in the final product according to the last available reports from the end of last here SMIC yield in N+2 process node was roughly 30% and this is a really bad number because this means 70% of the produced chips are defective and have to be scrapped away while the Ascend 910c GPU will be done in N+3 process node which means potentially even lower yield another big challenge for China is memory to be self-sufficient they need to fabricate high bandwidth memory domestically and they have no high bandwidth manufacturing at the moment but ChangXin Memory Technologies and Huawei are trying to solve this probably the most interesting part of Huawei story is that they're not just designing their own silicon and building their own EDA Tools (Electronic Design Automation Tools) which support engineers in designing those chips they are now buying manufacturing equipment securing wafer manufacturing memory manufacturing and basically trying to cover the entire supply chain this will help them to achieve self-sufficiency reduce reliance on SMIC their yields their capacity also reduce dependency on foreign suppliers but we all know that this is challenging to achieve because still many critical technologies and critical tools are relying on the foreign suppliers let me know your thought on this in the comments next we will discuss the rest of Chinese domestic GPU market and what's coming and also tricks which DeepSeek is used and why Mark Zuckerberg started it all before this as you may know I'm building my own startup now so I'm traveling a lot meeting investors customers and when I travel I use public Wi-Fi that lacks security controls making it easy for anyone to access them and potentially steal your private data including sensitive information like login credentials banking details and personal messages as we saw recently someone can just hijack your session and access your accounts without needing any credentials and this is scary that's where Surfshark VPN has been really helpful for me it encrypts all the information sent between your devices and the internet making it significantly harder for bad actors to mass with your personal data the best part Surfshark comes with Antivirus and Surfshark Alert which notifies you immediately if your data has been compromised I recommend you to try out Surfshark VPN it's an easy and affordable way to strengthen your online security go to surfshark.com/intech for 4 extra months of Surfshark thank you Surfshark for sponsoring this episode as you will see now there is no shortage of NVIDIA competition in China including many government-backed startups like Hygon Moore Threads Intellifusion a very interesting player among them is Moore Threads I made quite some effort inviting them to the channel not successful yet but stay tuned Moore Threads is a Chinese startup that has been developing gaming and data center GPUs their latest GPU S4000 is designed for AI acceleration in data centers its peak performance is 200 TeraFLOPs at 8bit precision and 100 TeraFLOPs at 16bit precision so it's not super impressive when we compare it to NVIDIA GPUs or Huawei GPUs but it might be just enough for a model with reduced compute requirements by now they've already built multiple computing clusters with tens of thousands of their GPUs and use it for training for example of a 70 billion parameters Aquila2 model also it supports training and fine-tuning of all the mainstream models like Llama3 and Qwen from Alibaba group and also it supports already the distilled version of the DeepSeek-R1 model now how did DeepSeek manage to build a model which requires significantly less computing resources for both inference and training in fact here they implemented several interesting tricks the main trick is reasoning and their clever implementation of the mixture of experts architecture which allowed to reduce GPU computer requirement by 1/3 the idea that the model is divided into sub-networks so-called experts and each of them is trained for the particular task on the particular data set for example one expert focuses on syntax while another specializes on the semantic meaning just like our brain might work in our brain the frontal lobe is responsible for planning and decision- making while the temporal lobe processes auditorial information and then we have the fusiform face area which is great at recognizing faces in a mixture of expert architecture this is equivalent to an expert trained for facial recognition tasks and then this mixture of experts is connected to so-called Gating Network which takes an input and decides which is the most relevant expert to be activated for this particular task and this is in fact how they manage to significantly reduce the computational requirements and this is the big difference to the Llama3 model which is not implementing this mixture of experts architecture and it's a 405 billion parameters model means for each token prediction it activates 405 billion parameters in comparison DeepSeek-V3 has roughly 671 billion parameters but for each token prediction they managed to activate roughly 40 billion parameters so now just imagine for each token prediction for each pass forward they activate 10 times fewer parameters and this is where this huge saving in compute is coming from this is very clever but it's not entirely new other AI Labs been implementing it as well DeepSeek was just the first to combine all the tricks and to implement the training of this model based on this architecture that efficiently another trick was training the model at 8bit precision from the very beginning when you use fewer decimals and calculation this helps you to reduce training time right computing resources and memory usage again not entirely new many other labs been doing it as well but all these innovations coming together allowed them to reduce GPU resources with that they managed to train a model which is comparable to Open's AI 01 and on many benchmarks similar to Gemini Flash 2.0 which was released just a week before but no one put any attention because here are geopolitical aspects play big role the second thing which made DeepSeek so attractive is the open source part they released the open weights which is sort of the output of the training data and it's open source and usable you can download it and modify it yourself and the paper is very detailed I will link it below and this immediately puts pressure on OpenAI Claude Google and other AI Labs what I find interesting here Mark Zuckerberg love him or hate him he in fact disrupted the industry you remember back in 2023 the Llama1 was leaked and we all know this sort of leaks right starting from Llama2 he officially open sourced it and he kept doing it ever since what Meta did with Llama was indeed disruptive and shifted the industry and since then DeepSeek was just a matter of time and Meta's strategy is make LLMs a compliment to the Meta's product so Mark basically decided to make it a commodity and this is a very smart move so Meta fully focused on their own core product keeping users on Instagram and Facebook as long as as possible and their products are benefiting from LLMs while for OpenAI Claude Anthropic LLMs are at the core of their main product and business it's a business strategy whereby you make complements of your core business a commodity it appears counterintuitive but essentially reducing the price for a complement typically increases demand for your core product NVIDIA did exactly the same with with CUDA and all the models software around their core product around GPUs and this driving up the value of their core product this whole story is actually about Google and OpenAI as they are clearly in red ocean competing on LLM- -based products Google released its Deep Research feature and AI feature to conduct comprehensive research on complex topics and weeks later OpenAI released Deep Research and called it the same thing so they directly competing with each other the general trend is that LLMs are getting better and better cheaper and cheaper reducing the gap between the free and the paid product and DeepSeek was inevitable and considering the 6 Tigers Chinese Six Tigers there is more to come so where are we heading with all of this it seems like LLMs are are actually becoming a commodity let me know what you think in the comments if we go back to NVIDIA NVIDIA has a pressure but not from DeepSeek but it's coming from their CUDA mode because it's not clear how long it's going to last if you're not familiar with CUDA CUDA is an entire ecosystem that allows AI researchers to program GPU clusters less as a distributed system and more like one giant GPU CUDA is NVIDIA's mode it's something which is a complement to their Hardware which driving the value of Hardware higher and we don't know how long this mode will last unless they reinvent themselves like they're now trying to do with COSMOS in fact another trick that DeepSeek team did is instead of using high level NVIDIA framework for GPU configuration they used lower level so assembly like language PTX (Parallel Thread Execution) to reconfigure those GPUs and with that they managed to improve the data compression and decompression and they implemented a bunch of other tricks with configuration for inter GPU communication and this allowed them to further improve the overall efficiency of the training and just remember DeepSeek was highly motivated to squeeze every bit of performance from those GPUs they have access to because long term the scarce of resources making this maximum GPU utilization a necessity in any case longterm those premium Hardware margins will have to go down and it will be getting cheaper and cheaper and if you believe in this trend of throwing more and more compute longterm the one who can innovate and get access to lots of cheap energy will win when we look at China the cost of energy is lower than in the US about 8 cents vs 13 cents but looking at their energy split it's not looking good still like 50% of the energy coming from burning coal and oil and it's very polluting now but looking at this plan long-term strategy is to switch to renewables I think long-term this is not about access to the semiconductor manufacturing EV tools or talent it's about access to cheap energy and we will need tons of it that's why companies like Meta are building natural gas plants and the next obvious step is nuclear power plants but here we have to keep in mind how long it takes to build one like a decade I'm sure this was just the first release that got so much attention but there is more to come there are many interesting players on Chinese AI market at the moment the race is dominated by Alibaba and ByteDance and then there are Six Tigers which are considered to be leading AI Labs in China and as competition hits up we can expect more breakthroughs from these players as well as strong responses from the US and EU based AI labs and let's hope for the best outcome for the whole world now I'm looking look forward to reading your comments and if you watched that far consider sharing this video with your friends colleagues and on social media and subscribe for more content like this to stay up to dat with what's next in technology it's free but makes me very happy and a little update I'm hiring a researcher into my team and the description is in the description box below have a look and if you feel like you are a good fit feel free to apply thank you see you in the next episode ciao
2025-02-24 10:22