Build a LangGraph AI Agent to Transcribe & Summarize YouTube Videos

Show Video

Let's build an AI agent in Langraph using JavaScript. Agents can be really helpful to automate parts of your life. Or for example, take data from one source and turn it into something else.

In this video, we'll be building a YouTube transcription agent that's able to pull transcriptions from a YouTube video and summarize them on your screen. For this we'll be using models running locally using Ollama, we'll be using Next .js to build a frontend app, and then we'll be using a YouTube transcription tool from WXFlows. So let's dive into VSCode and see how it's built. In VSCode, I've already set up a new project. In here, I'm going to run a command to bootstrap a new Next .js application. For this, I'll be using the Create Next App CLI and I'm going to make sure I use the latest version.

I also need to provide it with a name of my project, which will be the Landgraph YouTube agent. Setting this up is going to require you to answer a few questions. For example, would you like to use TypeScript? And then it's going to ask a few other defaults. It also asks us to install Tailwind, which is a nice library to help you write CSS. Depending on your needs, you might make different decisions when building your own project. Once this is done, it's going to generate a new directory with all my files.

I'm going to move into this new Landgraph YouTube agent directory, where I can find all the boilerplate code for my Next .js application. In here you can find a file called page .tsx and this will be the main file that's rendered when someone sees my application. I'm going to delete all the code that's in there. I'm going to replace it with something else. In here I'm going to add my own boilerplate for this application. I also need to make sure that I set this component up to be a client-side component, and this means I can use state management later on.

Within this div, which has some Tailwind class names connected to it, I can set up all the code I need in order to render my application. That starts with a header, and for this I'm going to be adding a header that contains the name of the agent, which is the YouTube transcription agent, and then I'm going to be adding a bit where I have an input bar to submit a video link. Put this in there as well.

Once I save this, I could already visit the application in the browser by starting npm run dev, but first, I will also add a placeholder video. So this will mock the application setup that we'll have later on. So right below my input bar, I'm going to paste this final bit, which is going to show a input bar with a button to submit a video link, and then it's going to show an iframe for a YouTube video.

I need to make sure that all these definitions have the correct format though, because React has different requirements than any other JavaScript application. I need to make sure I update the referral policy, the frame border, and this allow full screen option. So let me format this file and then save it.

In my terminal, I can start the Next .js application by running npm run dev. And this should open up a new page in my browser, which I can visit to see my application. In the browser, you can see we have a header, we have a bar to submit a video link, we have a button to actually submit the link, and then we have a place where we render the video, including an embedded iframe from YouTube. So we're going to add a LandGraph agent that's able to fill this space dynamically by using both YouTube and also a model running using ollama. So let me go back to VS Code, where I'm first going to kill the process of running Next .js in my terminal,

and we're going to check if I have ollama installed properly. So with ollama, you can run models locally on your machine. So these are all open source models that you need to download to your machine first. So if you run this command for the very first time, you might see an extra command to actually download the llama 3.2 model. These are all open source models, so you can run them wherever you want.

For example, they're also available in watsonx.ai. I can see I have llama 3.2 installed, so I can just close this process by running Ctrl D. Let me clear my terminal so we can proceed by installing LangGraph. So LangGraph is a superset of Langchain, meaning that you need to install some Langchain libraries in order to use it.

So I'm going to be installing LangChain, LangGraph, and then some other core libraries. Once these have been installed, I need to create a new file, and this file I'm going to call it actions .ts. In this actions .ts file, I'm going to create my LandGraph agent.

I also need to make sure that this file is set up to run server-side by setting use server at the very top of the file, and in here I can start to create my transcribe function, which will include the LangGraph agent to retrieve YouTube details and show them on the screen. I'm going to be calling this function transcribe. It takes one input, which is the video URL, and the video URL should be a string. In here, I also need to import a lot of different libraries that we just installed.

So let's break down which libraries we need. We need ChatOllama, which is the chat interface for Ollama models running on your machine. We need a function called createReactAgent, which is used to create the agent in LangGraph. We then need to import two libraries related to creation of tools, and finally, as we are using TypeScript, we need to have some type definitions in here as well. So this means we can proceed by setting up the agent inside the transcribe function.

You can see again I'm using the chat interface from Ollama. I'm setting my model to be llama 3.2. I have the temperature set to zero. And I also will be forcing the large language model to return JSON format. So this is going to be important later on.

When we look at the system prompt where we're going to force the LLM to return something that is in a JSON format. So if you look at our response, it needs a few messages. One is a system prompt and the other is a human message. This is your question usually. But as we're now building a chat app, we're having a predefined question and the video URL is the dynamic one.

In the system prompt, you can see that the LLM should retrieve the video ID for a given YouTube URL. So we're going to rely on the LLM to dissect the video ID from its URL. And then return the output in a JSON structure, which includes the video ID, and finally, we need to return this back. So whenever you call the transcribe function, you're going to get this JSON object in return. So let's save the actions .ts file and then connect it all via the page .tsx component.

At the top of this component, we need to set a few state variables. We're going to create two. We're going to create one state variable to make the input bar a controlled component. Meaning that whenever you type in the input bar, it's going to update the state with the latest part of the video URL. For this, I'm going to create a variable which I call video URL. And then I'll also be creating a function to update the video URL.

So this will be used to look at the onChange function of our input bar. For this, I need to use state hook from React, which can be imported at the very top. And I'm going to set an empty string as the default state. Then I'll also be creating a state for the video. So whenever we retrieve a video using the agent, we want to store it in local state.

Meaning that it can be used across this component. I have const, then I have set video. And use state. This time will be empty.

I will be creating a type safe definition later on for this state variable. After setting this, I can hook it up to the input box. But first I'm going to create a function that will actually call the transcribe function to use the agent for transcriptions. This transcribe video function needs the transcribe function which I have in actions .ts. It then needs to parse the results.

Because even though we forced the LLM to return JSON, whatever length chain or length graph returns to me is always a string. So we need to parse this string and make sure we get the actual JSON. And this JSON will be put in state using the set video function. If I scroll down, I can hook up my input box to look at the video URL as its value.

So we can have video URL in here, and then we need to set an unchanged function and hook up the event to store whatever you type in that input box in state. So I have the function set video URL, and this will take e.target.value as its input. Once I save this, I only need to make sure that my transcribe video function is connected to this button. After doing this, I should be good to actually submit the request to the large language model that's connected to the agent to retrieve my video transcription. It's not retrieving the full transcription yet.

It's only going to make sure we get the video ID. But later on, we'll add a tool to actually retrieve the transcription. Let me scroll down a bit to this iframe section and let's make sure we wrap it in curly brackets and check for the presence of the video state first before we render this. And this means that we can actually hook up this dynamic video ID in this source. So instead of having the source which we have right here, we're going to create a new source, which is a template literal. It's going to take all of this because we still need to have the embed URL.

But now the video ID won't be hard coded, but instead we're going to take the dynamic variable which is available in the state. This is video ID, and just let me delete this one. Once I save this, we should be able to start our application and view the first part of our application in the browser. I'm going to run npm rundev and this should start the application. In my browser, I need to make sure I refresh the page and then enter a video link.

I probably need to set up a loading indicator so I know something is happening whenever I press this button, but we can see that the video is being pulled in correctly and the video ID is being passed in to the YouTube iframe. If we go back to actions .ts, we need to do a bit more. You saw we already imported some tool libraries, so we're going to use this to create a tool to retrieve information from a YouTube video page.

And for this we'll be using Playwright. Playwright is a library to open a website programmatically and retrieve details from that website. So I'm going to close the command running on my terminal and then here I'll be running npm install playwright. And this will install the Playwright library from npm. After installing the Playwright library, we need to import it of course in our actions .ts file.

So I can import the library at the very top and then we can start to define our YouTube function. Well first I'm going to define the tool definition. Because a tool in Langchain or LangGraph, which is LangGraph but then agentic, is going to need both a tool execution function and also a tool definition. So if I have a tool for example which I call get YouTube details, I will be using the tool function from LangChain to create this tool.

I'm going to make this a async function later on because the execution function should be async but the rest is fine to be synchronous. And then I'm going to be adding my tool definition in here. Once I clean this up, you can see that I have added the tool definition for a tool called get YouTube details, which is described as a tool to get the title and description of a YouTube video. And its input is a video ID. So this is the video ID that we have the LLM dissect from a given YouTube URL.

The actual callback function that should be executed whenever you have the LLM propose to call the get YouTube detail function is something we hook up here. In here we need to look for the input variables. We have an async function.

We should be calling playwright. So let me put this bit of code in here and make sure we take the input argument. Let me clean it up a bit. So what Playwright is doing in here, it's launching a Chromium browser.

So Chromium is related to Chrome. The very first time you run this, you might need to install the command, you might need to run the command npm install Playwright which you can use to download the Chromium browser to your project. It's going to open a given YouTube page in the browser and then it's going to take different locators and store them as objects.

So first it's going to look for the H1 element which has the title of the video and then it's going to look for the description of the video which is somewhere in a div, and then of course it's going to close the browser because it doesn't need to be open all the time. So I created this get YouTube detail tool which has both the callback function to call and then also the tool definition. So let me save this and make sure that we pass the get YouTube detail tool to Ollama which is then hooked up in Landgraph to form our agent. I'm going to save this but before we actually try it out we need to update our system message because now there is additional details that need to be retrieved. It also needs to retrieve the title and the description of the video.

So we want both of these to be present in the object that's being sent to your front end app. I'll update this a little bit as well. Use any tool at your disposal if needed is still valid and we probably want to tell it to don't return any data unless all the fields are filled. All fields are populated.

So by creating the tool and updating the system prompt we should be able to try this out in a browser. For this I'm going to run npm run dev and this should make our application available back in our browser. Let me copy this YouTube URL and refresh the page because that way we're certain we have a fresh history.

I'm going to put in the URL right here and then let's wait for the LLM and the agent are going to generate for us. You can see now it's still retrieving the videos. We don't have the title and description yet because we didn't connect this in our front end app. So let's go back to the code and open page .tsx. In here we can replace retrieved video with the actual video title and then we can replace lorem ipsum with the video description.

We are getting some TypeScript errors here so let's make sure we make this application type safe by creating a type at the very top. Let's create a type called video which has a video ID which is a string. It also has a title which is a string as well and finally it has a description which again is represented as a string. This type should be used by our local state right here so we can make sure whatever is being set as video state is actually matching the video type definition and then we should do the same whenever we get the JSON back from the large language model. We need to make sure that whatever is being parsed ends up being type video and this should resolve some of the TypeScript errors we saw in the bottom of our screen.

We still get an error for description and this is why I like TypeScript we forgot to put an i there. And now it should be all good. If we visit the browser we should see your application with the video title and the video description being pulled dynamically from the YouTube video page. So you can see we have the title here building AI apps with large language models, which matches the embedded video title and then description you can see the amount of views is in there whenever it was posted together with the rest of the video description.

So this is a great start and we want to do something more because I told you in the beginning we're rebuilding an agent that's able to transcribe YouTube videos, and for this we need to import a community tool from wxflows. So let me go back to VS Code where I killed the process to run the app and I will be creating a new directory which I'll be calling wxflows. We need to move into this directory from which we can use the wxflows CLI to start importing community tools. So these tools are able for you to be pulled from GitHub.

First, I need to make sure I have the CLI installed correctly and you can find the installation instructions on the GitHub instruction for wxflows. We can run the command WXFlows--version and it should render a version in your terminal. Once you've verified it's installed correctly we can start by setting up our project by running wxflows in it. It's going to ask us for an endpoint name so all the tools you create in here they will be represented as endpoints.

I always like to use the name of my project as the endpoint name that way I don't get confused later on. You can see in the wxflows directory there's a new configuration file. So we can proceed by importing the YouTube transcription tool and for this I'm going to run the next command which will take a tool from YouTube that's available on GitHub and put it in our project. A couple of files are now being created including tools graphql.

In here you can see we have a new tool called YouTube underscore transcript. It takes a description which is retrieve transcript for a given video ID and then it has some formatting requirements for the video URL. I don't need to save any of this but I do need to deploy it. So as mentioned all the tools are represented as endpoints so by running deploy you can deploy this to an endpoint.

And this endpoint is what we connect to from our LangGraph agent. The YouTube... The endpoint here is the endpoint that you need for the SDK and together with your API key. So we're going to move back into the main project directory and in here we're going to create a .environment file. In this .environment file we need to set the endpoint and also the API key.

So these are two details that you really need. Without these you won't be able to execute the YouTube transcription tool. For the endpoint I'm able to copy paste my endpoint that was in my terminal after running wxflows deploy. For the API key I need to run the command wxflows bmi dash dash API key and this will return the API key right here in your terminal. Make sure to save the environment file and then close it. We also need to install the wxflows SDK and this SDK is being used to connect and get retrieved to tools.

For this I run a command, now this command, for this command I need to run npm install wxflows sdk add beta. So I'm going to run npm install add wxflows slash sdk. I need to make sure I install the beta version of the SDK as this is under active development. Once it's installed I can hook it up to my LangGraph agent which is in actions .ts. At the very top I need to import wxflows and I need to import the LangChain integrated version.

I can scroll down a bit. I won't be needing my get youtube detail tool anymore but I might be using it later on because you can still use different tools side by side. In here I'm going to create a tool client and the tool client is able to retrieve and execute tools that are available on wxflows. I have my endpoint and API key coming from the .environment file in here, and then I need to retrieve the tools by looking at the lctools variable that's available on the tool client.

The tools that I retrieved here I need to connect them to my LangGraph agent like this. You might be getting some errors here along the way especially here because we're trying to put an array inside an array and it's never a good idea. So let me update this and save it. Before we actually try it out in our browser we might want to make one small change. We might want to update the system message because now it's retrieving tools. So we can actually give it a very specific description for the new tool we gave it.

So next to retrieving the title and description for a given video we also want to retrieve the transcript for a video using the tool that we just provided. We want the LLM to also use all the tools that are available and we're going to give it some examples on how to use the YouTube transcript tool, and then something that's quite interesting we're going to be using the transcript to generate the description. So you might remember earlier on we retrieved the description from the YouTube video page this time we'll be generating it using the LLM based off the captions that were provided by our transcript tool. So let's make sure to add the captions here as well because we also want to see the captions in our JSON output. So this is the video captions and let me save this. I think we're all done in the actions .ts file so we can make some changes in our page.tsx.

In here we now also need to retrieve the captions. So let's add captions to the type definition as a string. We are still parsing the result and the captions should be part of this result, and once we scroll down a bit more we probably want to show the captions on the screen as well. So not only do we use the captions to generate a description we'll also be using these captions to generate. We'll also be using these captions and display them right here on the page.

Let me clean this up a bit. Why am I getting an error here? And then format the page. So now if I run my application again using npm run dev I should be able to open the browser and see the captions in there for my video next to title and description.

So I'm going back to my browser and copy this URL and make sure to refresh the page. I'm going to be pasting this link and then let's see what the agent is generating for us using the large language model and the available tools. And as you can see here we now have a title for the video. We have our embedded video using the video ID and then we have the description. The description was generated by looking at the transcript. I do see the transcript is a bit cut off.

So a couple of things that could happen here, maybe the amount of tokens that we give to the LLM isn't sufficient to retrieve the entire transcript or maybe it's just a styling thing. If you want to update parameters such as the max new tokens you can set them all in our actions .ts file. If you scroll up a bit you're connecting here to the chat.ollama function. You can give it parameters like this.

You can also set things like max new tokens or max retries. You can set a lot of different things here. If you want to know more about setting this up make sure to go to the line chain documentation.

And that's how easy it is to create your LangGraph agent using JavaScript. In this video we use Next.js to build a frontline application. Then we use models running in LLM and connect it to LangGraph. And finally we used a YouTube transcription tool from wxflows to transcribe videos from YouTube.

If you want to know more about building this application make sure to have a look at the link in the video description.

2025-03-09 19:48

Show Video

Other news

The Engine of Our Dreams Exists. It's a Clean, Powerful, Supercharged and Rotary Valved Two Stroke 2025-06-01 10:00

Claude 4: Everything you need to know 2025-05-29 15:05

OpenAI's UAE Data Center, AT&T-Lumen Consumer Fiber Deal | Bloomberg Technology 2025-05-24 01:41