Prompt Engineering for Generative AI • James Phoenix, Mike Taylor & Phil Winder • GOTO 2025

Prompt Engineering for Generative AI • James Phoenix, Mike Taylor & Phil Winder • GOTO 2025

Show Video

Hello. Welcome to another episode of  GOTO Book Club. My name's Phil Winder,   and I'm the CEO of Winder AI. Since 2013, Winder  AI has been developing AI applications. And if you   need AI engineers for your next project,  then give me a call. I'm here today with   James Phoenix and Mike Taylor. They are the  authors of a brand new-ish book, end of 2024,   "Prompt Engineering for Generative AI."  I'm just laying there, proof. Today,   I'm hoping to basically delve through the book and  ask lots of questions, follow-up questions that I   had based upon all of the interesting insight  that was provided in the book. Mike, would you  

like to go first just to introduce yourself,  tell us about what you do, and what made you   interested in prompt engineering, to begin with. Sure, Phil. Thanks for having us here. I also love   that you use the word delve, by the way, because  that's one of the words that we had to search for   the book and make sure that it wasn't in too often  because it's what ChatGPT uses quite a lot.  I   got into prompt engineering in 2020 because I was  actually just leaving my first company. I founded   a marketing agency, grew it to 50 people, and then  I was looking for something to do during the COVID   lockdowns, and got access to GT3. So, the rest is  history. I managed to start replacing all my work   with AI and then eventually started to work in AI  full-time. We could talk a bit more about that,  

but that was my journey. Perfect. James, over to you.  Nice to meet everyone. I'm James Phoenix.  I'm a software engineer, indie hacker,   and predominantly been using GPT and  OpenAI for a good couple of years now.   I really enjoy using it for coding. So, I'm  really, really into the whole Cursor stuff  

and sort of, you know, replacing any part  of my workflow. I do a range of different   projects for clients. I've got one client at the  moment who I'm helping build some sort of rank   tracker for LLMs and, yeah, classification  pipelines and that kind of stuff. So, yeah.  Interesting. Well, thank you again for the book.  I found it really, really interesting. I think  

I'd like to start off by asking a few general  questions about the use of LLMs and generative AI,   in general, just to bring everybody up to  speed if they're not fully aware of what it is. I guess let's get the big one over  and done with. How would you define   prompt engineering? What is prompt engineering? Obviously, everyone thinks about it differently.  

Some people think of it as just kind of adding  some magic word on the end of the prompt that   tricks the AI into doing a good job, and that's  what it started as, I think, in a lot of cases.   But we take a broader view of prompt engineering  being this rigorous process of figuring out, you   know, through testing, through evaluation, what  combination of inputs lead to the best outputs   and the most reliable outputs as well because  I would say, you know, you don't need prompt   engineering that much when you're just using  ChatGPT as a consumer. But if you're building an   AI application, it's going to be very important.  There's a great source of improvement in accuracy  

just by getting the context right, by rewriting  the prompt in a certain way, it will give you much   better results. So, it's that kind of process for  really applying the scientific method as much as   possible with these LLMs to test and learn,  "Okay, when I write it this way with this   context, I get this type of response." So, it's a lot about rigor there. It's   engineering rigor that you're trying to  apply to a slightly non-deterministic   system, depending on how you configure it. Exactly. Yeah. And I come from a marketing   background. You know, growth marketing has a  great culture of A-B testing and that kind of  

came naturally to me. And then, James Phoenix,  I think you could talk a bit more about this,   but coming from a data science background as well,  I think that's another rich profile for people   getting into AI because it comes natural that you  start to really think about what are the inputs,   let's look at the raw data, let's look  at patterns, let's see what opportunities   there are to improve performance. Absolutely. James, what do you think?  I think definitely, and then there's also some  other dimensions that you'll often do whilst   you're working in this kind of environment.  Things like using more powerful models,   switching the models quite a lot, and as well  as that playing around with the parameters like   temperature and log probabilities. It's one  of these things where prompt engineering is   definitely part of the stack. But as you know,  an AI engineer or someone working in that space,   you'll often use a mixture of techniques  alongside prompt engineering. So, you know,  

changing the models, you might do fine-tuning.  Like, there's a variety of different techniques   that you're kind of trying to apply to maximize  that output from a non-deterministic model.  I guess one of the challenges I always have with  my AI projects, data science projects, in general,   is the problem definition, is defining the  thing that you're trying to solve. And so,   does that still apply in prompt  engineering? How important is the goal?   I think you mentioned the word goal there, Mike. It's like 80% of the work. I find quite often when  

I work with clients, they don't have any formal  process for evaluation, and that becomes most of   the hard work. But because once you have a way  to measure whether this is a good response or a   bad response that doesn't require the CEO of the  company to come and check each response manually,   once you have a programmatic evaluation  method that you can run after every response,   that's when it really blooms. You know, it really  opens up the amount of things you can do in prompt   engineering. You can A-B test, especially if  you're not a domain expert in this area. So,  

say you're working with a team of lawyers,  like, James, one of your projects, it's a real   bottleneck if you have to go back to the lawyers  every time to check whether the response is good   or not. You need to build up a test set, different  test cases, you need to know the right answer in   those cases, and then you need to set up some  evaluation metric that you can use to optimize.   Once you have that, you can do A-B testing.  You can build more interesting architectures   with retries and things like that. So, that tends  to be the main key. And if you don't have that,   you can't really do prompt engineering. Just to add onto this, I think the other thing  

that's really interesting is, for example, it's  much harder to evaluate content generation versus   a classification pipeline because you can easily  see when a classification has gone wrong. So,   depending upon what you're trying to do  as the goal, if your goal is to produce,   you know, social media posts, the evaluation  isn't necessarily as important there. I mean,   you could make the argument that brand  text and guidelines are quite important,   but there's less of an effort on that side.  Whilst if you've got very wrong classifications,   that's going to have a lot more of an impact on  downstream data and, you know, the applications   that are consuming that data.There's a risk of  how well the evaluations have to work and also   how easy it is to tell whether it's good or bad.  If it's classification, a binary classification,  

it's very easy to see if it's wrong. It's a lot  less easy to see whether it's wrong when you   have a human evaluating this piece of marketing  material versus that part of marketing material   because it is a lot more vague. There's a  lot more nuance to language than, you know,   binary classification as well. Those are all  the types of things you also run into as well. Cracking. I guess before we dive any deeper,  are there any misconceptions or anything that   we'd need to sort of clear up before we dig in? One of them is just that prompt engineering isn't   an art as much as it is a science. I would  say that prompting is an art and, you know,   there's a creative element to coming up with new  ideas for prompts. But I wouldn't call it prompt  

engineering unless you're doing testing and,  you know, trying to work at scale. And that's   something I struggle with as well as someone who  has a Udemy course with James, and we wrote the   book as well. Those are targeted more towards a  technical audience, but initially, a lot of the   audience was non-technical. We get a huge demand  from people who don't know how to code, who want   to read the book or, you know, want to get better  at prompting. And it's just kind of two different   things, right? Like, there's only so much testing  you can do manually without, like, you know, being   able to code and being able to run it 1,000 times  or 10,000 times and see how often it breaks. So,   that's something that we still don't have,  I think, a good handle on in the industry,   whether...you need to know how to code to be a  prompt engineer because when I deliver training,  

quite often the non-technical people say, "This is  too technical." And then the technical people say,   "This is too simplistic." And so, what I try  to do is I call it prompt engineering if it's   for a technical audience, and then I just call it  prompting if it's not. But not everyone has that   same definition in their head. So, that's the  thing I would love for more people to clear up. You're almost saying that it's actually more  like programming than it is writing. So,  

do you consider prompt engineering  to be a new style of programming?  It feels that way. I really like Andrej Karpathy,  his take on it, that it's basically a new kind of   abstraction on top of machine learning, which was  an abstraction on top of programming initially,   right? Rather than building a special-purpose  computer that's the size of a room to do a   specific counting task. That was the first  programming, right? It was actually literally   building the computer. Then we had general-purpose  computers where you just have to write the program   and run it. Then you have machine learning where  you just need to give it the data and it will  

write the program. We have these pre-trained  models. Now, we just need to kind of write in   plain English what you want and then find some  way of evaluating it. It does feel like a new   way of programming. James, I mean, I don't know if  you want to talk a bit more about your work with   Cursor. You're very deep into using... I think the one thing...  I don't know how much code you're actually writing  anymore or if you're going to see yourself more   as an engineering manager of AI agents. For sure. Cursor is definitely like writing   probably 60% to 70% of my code. I will say that  when I'm learning something new, I actually  

manually type it out by hand. I think that's  a really good way to still learn. Otherwise,   you're sort of blindly copying and pasting. But,  yeah, definitely what I'm starting to find is even   in engineering workflows, you can specifically  have prompts to just take that piece of work. So,  

the good one is if you're in composer for the  whole day in Cursor, you can have a separate   prompt or a separate notepad that will generate  a progress report in a specifically standardized   way. Like, what were the key learnings today,  the key blockers, the next steps? You just say,   "Generate a progress report." And because Cursor  has now got an agent mode, it can also run Linux   commands to get the right date and time and  format and that kind of stuff. Or you can do  

things like have a prompt to generate a git commit  message based on git conventional commits, such as   a prefixing with fix or feed or chore. You can kind of automate these smaller routines   or subroutines that you're doing as a programmer,  as well as the code that you're also generating.   I definitely find that, yeah, LLMs are speeding us  all up. I think going back to what Mike was saying   about, you know, you've got this non-technical  versus technical audience, I think with the   technical audience, it's very much more focused  around scientific rigor and experimentation when   it comes to AI engineering and prompt engineering.  Then I think for the non-technical audience,   what I always recommend is just enriching the  context as much as possible because they're   not going to sit down and run O1 Pro 100 times or  O1 100 times because it's just the latency is too   large. But if they get the right context  in there, then it's probably going to be   good enough if they're just using one response  or two responses. So, I think, yeah, putting an  

emphasis on the richness of context is really  great for a non-technical audience, for sure. Interesting. Let's move on to the book. One  of the big headline topics that you kind of   go back to throughout the book is these five  principles of prompting. I guess let's have   a little bit of a background. So it's a nice  catchy title. I can see why you've done it.   But I guess why did you come up with them in  the first place? Like, why not six or four?  Good question. It came from self-preservation  essentially. When GPT-4 came out, we had a bit   of a panic, I think, because a lot of the prompt  engineering techniques that we used to use GPT-3   just weren't really necessary anymore. You  didn't have to threaten it to return JSON. They  

started to follow instructions much better. All  frontier models are pretty good at following   instructions. You don't need to kind of  hack the prompt as much as you used to. But,   you know, looking ahead, I thought, "Okay,  we're just about to write a book. You know,   we're creating a course as well. We don't want to  have to update those two assets too often. And so,  

let's think really deeply about what would  form the core of these principles. What things,   you know, were we doing with GPT-3 that we're  still doing today with GPT-4? And looking ahead,   like, what do we think will still  be useful with GPT-5 and so on?"  That was really just an attempt to make sure we  didn't have to do version 2 of the book three   weeks after it was released in Britain. We brought  it down, and we tried to condense the principles   as much as possible. Then actually, we were  pretty happy when OpenAI came out with their  

principles. They have a prompt engineering guide  now. It came out after we'd already written most   of the book. I quickly checked it, and I was like,  "That principle maps to this principle." It felt   good. But I feel like pretty much anyone who works  with these tools will arrive at a similar set of   principles. And there's nothing magic to it.  Anyone who's very experienced will get these  

straight away and will recognize them. But   starting with those five principles in the   first chapter just really helps people ramp up  if they're not as familiar, if they don't have   as much experience in prompt engineering yet. Yes. James, what do you think about the five?   Like, have you got any ones  that you particularly like?  I think division of labor is something that I  use quite a lot because when you're combining   and composing multiple prompt chains, you are  essentially breaking down a larger problem into a   series of sub-problems that basically once solved  will solve the larger problem. And I find that   works incredibly well. I think, you know, that it  also works quite well with imperative programming  

languages. You don't really want to do everything  up until this point because you don't have certain   data. Maybe you have to go to the database at  that point. And just from the nature of things,   you'll find that when you're doing prompt  chaining and creating these chains, they   naturally don't all sit together anyway within  the code and within the data pipeline flow for   backend applications. So, yeah, definitely  division of labor, I think that's a really  

good one. It's probably my favorite, for sure. I think it might help actually if we just give   a little example there, James. So, if division  of labor is your favorite, could you just walk   through how the average person might do that for  a simple task like writing an email or something?   How would you approach that in terms of dividing? So, if we're thinking about writing an email,   the first thing you might want to do is gather  some relevant context about that person or that   company. The other thing you might want to do is  gather context about previous emails that have   been sent because there might be several email  threads that are happening, which actually would   be really beneficial to know about. So, those  are the kinds of things you would do as upstream   chains. Then the second thing you might do is  then generate the email. Then the third thing  

is you might have a human approval loop or a human  approval stage. And then the fourth thing is you   might get another LLM to critique the email and  look for any spelling mistakes or discrepancies   or things that are missing. And then you would  then have a final step to say, "Send the email."  So, rather than just say, "Let's create an  email from this email," there's a step one,   which is gathering the right relevant context. So,  that might be doing some additional information,   or maybe you've got a database full of  the people that you're talking with,   or it could just be also gathering relevant emails  from your Gmail or the API. The second stage is  

generate the email, and then you've got maybe some  human approval step or a human in the loop step,   and then you've got a critique step, and then send  the email. So, rather than trying to do everything   in one step, you're kind of breaking that down  into about four steps. And the reason why is that   then you can have a deterministic way of doing  that while still using LLMs within that workflow. 

Yeah. If you find you're having trouble with  one of the steps, you can then continue to break   that down further. So, one of the things  I found quite good for creative tasks,   like the actual generation of the email, is  split that into, "Okay, write the hook first,   and then based on this hook, write the email."  And just by splitting that into two smaller tasks,   I find that you end up getting much more creative  responses. You can test whether this model is   just not good at writing a hook, like a way to  draw people in, or whether this model is just   not good at taking a hook and writing it out.You  might find that you use two different models for  

those two things. Like, you might have to use a  much better model for the creative task than you   need for the actual email writing task itself. The other thing to note here is depending upon   how cognitively intelligent the models are will  depend upon when you need to split tasks into   multiple sub-tasks. So, when you had GPT-3, it  was quite advantageous to have more division of  

labor at smaller and more easy tasks. Now that  you've got o1 and O1 Pro, you can essentially   just one-shot entire scripts or one-shot  entire processes. Obviously, that retrieval   step might be done separately. But if we're  viewing very much like editing a coding file,   you could maybe do all of that in one go. So, your  division of labor now goes up to, "Okay, now I can   solve a larger problem and do division of labor  with o1 to then maybe analyze, I don't know,   10 or 15 files and refactor all those files in  different ways. So, you can still kind of use that   principle even with a larger model. You're just  basically using that to make o1 solve things that  

it can't inherently solve in a one-shot process,  if that makes sense, because every model has some   sort of top end capability. You basically have to  discover when does o1 consistently fail at doing   some type of task. And that's when you would then  start breaking that down into a more deterministic   workflow and using division of labor to then  make o1 or O1 Pro do something that it can't   do without division of labor, if that makes sense. That makes sense. Then two questions sort of pop   into my mind. The first is, it was really  interesting to note that despite the focus   on prompt engineering and working  with language models, in general,   at least half of the steps that you described  there, James, were getting data, ingesting   data from other sources in order to use it in the  context of generation. And it's almost like you're   still writing software effectively. You're still  plugging things together, you know, to generate  

the final thing. So, I found that interesting.  But the second point is, you mentioned there that   you're constantly trying to find the capabilities  of the model that you're using. Have you had any   experience of maintenance of a solution over  the lifespan of multiple different models? Like,   what happens when OpenAI depreciate, you know,  GPT-4 or something? Does that mean all of your   stuff is suddenly going to break because suddenly  the capabilities of the model have changed?  I think, in general, things get better.  There are some nuances to that where,   you know, like, if you're comparing, like, the  new reasoning models versus the chat models,   then they're completely different paradigms. But  in general, classifications probably get more  

accurate. The output of the text feels kind of  more human. That being said, you know, there are   scenarios where you can get regressions because  the newer model doesn't work with the old prompt   in the same way. And we actually did experience  that for one of our products called Vexpower when   we had an automation script that would basically  generate a large part of the course from listening   to a video transcript and ingesting the course  materials and sort of helping with the FAQ and the   exercise generation. We found that when we moved  from, I think, GPT-3.5 to GPT-4, and GPT-4 Turbo,   the original prompt just basically broke, and I  had to go in and re-architect and just basically   go in and... And so, yes, you can get changes. I think that is also happening less and less. And   the reason for that is a lot of developers are now  using something called structured output parsing,   which is a native-supported output format for  OpenAI and Anthropic also offer something similar,   which basically allow you to define Python,  Pydantic models, or Zod models for data   validation. And these models that you can use, so,  for example, GPT-4o Mini, have been fine-tuned to  

do JSON decoding and generation of JSON characters  in a very deterministic way so that when it does   produce JSON, it doesn't run into validation  errors when you're parsing that string into a   JSON type. We are finding that the more that you  structure outputs, then potentially the less is   going to change in terms of the data that's coming  out. So, the structured outputs API is definitely   worth looking at. It's something that I use  actively on most of my projects. And that's been  

a massive improvement versus what we had to do two  years ago where you had to specifically put in the   prompt, "This is the kind of JSON structure that I  want." And sometimes it wouldn't conform to that.  With the new capabilities as well, you find that  even if your old code doesn't break, it's just   a really inefficient way to do things because the  older models tend to cost more and the new models   tend to be more...they have more capabilities. So,  I had a recent example where I had a client. They   were doing video transcription, and they were  using a service for the transcription of the   audio. And then they're taking that transcript  and then doing some kind of entity extraction   from that transcript. We found with Google  Gemini, which takes audio as a native input,   it doesn't need a transcription model on top.  It doesn't need Whisper, which is what they were  

using before. Now, you can just dump the whole  thing into Gemini, and they don't need to chunk   it up into different sections first. They didn't  need to transcribe it first. Gemini just does   it all in one shot. Because they have, I think  it's like a million or two million token input,   you can put the whole call transcript in  there...oh, actually the whole call audio   in there and then just get back out entities. So,  it just massively simplified the whole pipeline  

once we got that working. It just took a lot of  effort to optimize the prompt, but now it's at the   same sort of 75% accuracy, something like that,  that the previous system was, but at a much lower   cost. I think it's about 60% lower cost. Much lower maintenance burden as well. I think we sort of almost highlighted a little  bit there about talking about task decomposition   and specifying JSON output structure formats and  things, about specificity, about being specific   about what you want the model to do. But there was  a really interesting quote from your book that I   found interesting. "If your prompt is overly  specific, there might not be enough samples in   the training data to generate a response that's  consistent with all your criteria." So, that's   saying that you can go so deep sometimes that you  actually end up in a space within the model that   doesn't have any previous examples. Therefore what  you get out might not be appropriate. I guess that  

might be more of a problem for image models,  possibly, than it is for text because I would   guess there's maybe more gaps in the training  data for image models than there are for texts.  There's a really good example I can give  of this. I've actually got an example.   I was doing basically an interview for a job and  sort of a placement there. And they created a   custom data structure to basically map that kind  of backend system to the React front end. And you   can't use prompt engineering if you have a custom  data structure because these models don't have   any data in the pre-training data about that type  of custom data structure. So, the only thing you  

can really do at that point is a few-shot learning  with their own custom data structures or building   a fine-tuning model, which takes a lot more time. When you're working with coding, if you're   creating kind of a custom way of doing some type  of CRUD operation and you're not using standard   backend, so you're maybe using a JSON file  as your backend, and you have your own custom   data structure, and you have your own kind of  ways of manipulating that JSON in an API layer   that gets exposed to the front end, then all of  that is very difficult to actually work with in   terms of figuring out how does that work? How  can we automate parts of that? How can we know   which API endpoints to call because the data  structure is specifically custom? Now, if you   start using Postgres and you give it the Postgres  tables or, you know, maybe you're using MongoDB,   there's enough data in the pre-training to know  what types of operations you can do on any type   of database, in general. That's where, you know,  there is a trade-off between if you do start doing   things in a different way that is off the standard  path, then you're going to end up where there's   probably less likely to be data in pre-training  for that. Another example of this is if a package  

changes tomorrow, there's probably no pre-training  data in that foundational model about that package   at this point in time. Therefore, if you try and  generate code about that newer package, you're   going to get older package code that's generated.  So, you know, you've got to be very careful about   specifically those types of problems. I've seen the same thing. Specifically   on image models, which is the part of the book  that I handled most of, you used to get to these   really sparse areas in latent space where there  just...you know, there might be a lot of pictures   of Ronald McDonald, and there might be a lot of  pictures of astronauts, but there's not that many   pictures of Ronald McDonald as an astronaut,  right? And then the more variables you layer   on top of that, the more likely the model  is to get confused and not know how much of   Ronald McDonald to put in there versus how much  of an astronaut to put in there because there's   some kind of inherent conflict. And that could  lead to really creative results in some cases,  

but also quite often it leads to poor results when  you step too far out of the training material.  The solution there is, you know, you can train  new weights for the model. You can use DreamBooth,   which is a fine-tuning technique, or LoRA, and  give it examples. But then when you're training   the model, fine-tuning the model, then you're  in some ways kind of constraining the creativity   of the model as well. So, this is always a  trade-off because if you think about the more  

examples you give it, and this is true of text  generation as well, the more examples you give   it, and the more you steer it in one direction,  then the less freedom it has to come up with   something that you didn't think of. Usually, what  I try to avoid getting stuck in this trap is I'll   try prompt engineering first. I'll try and see  what it comes up with just natively. Then if   it's too out of bounds, then I'll start to layer  on a few short examples or, you know, step towards   fine-tuning. But I don't jump straight to a  few-shot and fine-tuning for that reason. I  

want to give it an opportunity to surprise me and  give me something that I wouldn't have thought of,   so that then I can adapt my vision accordingly. How do you actually go about finding that middle   ground, that perfect balance between specificity  and creativity? Is it an iterative process based   upon the problem that you're trying to  solve, or are there tricks of the trade?  For image generation, in particular, it's  much more iterative. I think image models are   starting to get smarter, starting to get better at  adherence to the prompt and character consistency   as well, especially some of the video models  where character consistency is really important,   because you can't have your character change  their face halfway through the video. It doesn't   work that way. So, that is something, I think,  that's a very strong active development space.  

And the models will just get better natively at  it. But until it works better out of the box,   it is very iterative right now. And it  just takes a lot of trial and error. The next sort of bunch of questions I had were  all sort of starting to dig into the use of RAG,   really. Actually, I think that was really  interesting. So, the quote that I pulled out was,   "The real unlock is realizing that every part  of the system can be broken down to a series of   iterative steps." And we talked about that, and  we touched upon it earlier with your comment,   James, where you were talking about task  decomposition and how the newer models are   actually able to do that themselves. But what  struck me, I think, with one of your examples,  

you mentioned in there, using the words, literally  the word "step by step forces it into this mode of   planning effectively." It made me wonder what  other words are hidden within the models that   force certain behaviors? And why are they there? I think maybe it's just the way that the model   learned from the pre-training data, that when  it has that, then the next bit of information   that it sees or produces is more based on  a reasoning chain. Thinking step by step,   or let's explore this, or let's think through  all the steps, or let's use a chain of thought.  

So, all of those ways, well, are early ways to  determine whether or not you could get some type   of more reasoning and using the reasoning from  that to then hopefully produce a better answer.   It's interesting because a lot of reasoning models  do this now. So, O1 Pro will do reasoning for you.   I do think chain of thought is becoming less  of a used technique. It's still used in chat   models. But specifically for reasoning models,  they're already doing that. And they're doing a  

turn-based approach where they go through a series  of steps, and they'll use the reasoning tokens   within that single step, which then goes into the  input of the secondary step. I would actually say   that we're using those types of things less. I  think the principles that are kind of standard   are things like if we're giving direction, for  example, you know, getting that relevant context   is still very important. Specifying format,  again, is also we're using that quite a lot with   structured outputs. And few-shot learning is  still really important. But I would say that   chain-of-thought reasoning is probably going  away, at least in reasoning models, for sure.   The new one I've been seeing a lot of and  soon to be baked into most models is more   like self-evaluation or backtracking. I don't know  if you've seen this when using Claude, where it  

will start writing out the answer, and it'll say,  "Oh, no, sorry, I made a mistake there," and then   it will change its direction. So, it's kind of  evaluating itself. So, rather than a chain of   thought, which is like planning ahead, then  you have this kind of other concept of like   after it started to generate, does it backtrack  or change direction? But I would say, generally,   the reason why they're there in the training set  is just that this is what people do, right? When   I was running my marketing agency, I would tell  people to plan what they're going to do for a   presentation before they create the presentation.  They would tell them to write an outline for a   blog post before they wrote the blog post. And  that's because thinking step by step through   a problem really helps people access high-level  thinking and make sure they don't forget anything,   make sure that they are doing things in the right  way to achieve a goal. If step by step reasoning   helps humans, then LLMs are kind of like a human  brain simulator, if you think about it. They're  

trying to predict what a human would say in that  situation. It makes sense that the techniques that   work for people will just work for LLMs as well.  That's what they've seen in the training data.   And when people think step by step, they tend to  end up producing better results. So, it's kind of   a way to steer them towards those better results. It's just specifically the use of that word that   I find a bit weird. It's not a word that  I've used in the past and that you haven't  

read in the past. And then all of a  sudden, it's a word that everybody's   using because it steers the model so much. Because it's the word that they specifically   used in one of the scientific papers. Exactly. Then everyone bakes that into   their libraries and blog posts and prompts. And  then you might learn from that second order or  

third order by reading our book or whatever.  But, yeah, you don't have to specifically use   that phrase. Actually, it's a big misconception.  I think a lot of people think that... They think,   "Oh, you have to use the phrase. Let's  think step by step." But actually,   any combination of words for getting it to think  through the problem first is going to work.  I think there's also some emotional prompting  that still works with O1 Pro, for example. So,  

if you tell O1 Pro, "I'm not really bothered about  this answer," it will give you something pretty   small. But if you say, "Oh, I have to get this  right because my livelihood depends on this," O1   Pro is going to think for much longer, and it will  give you a much longer output as well. So, even   emotional prompting still works. We don't have  that thought step by step anymore. But emotional   prompting definitely still works with O1 Pro. Interesting. I don't think that was one of your   recipes in the book there, emotional prompting.  Your next book should be a self-help book. I  

think it'll be useful there. Therapy with LLMs.   Absolutely. There were quite a few  other strategies that were mentioned.   I think the one that jumped out to  me is the query planning prompting,   attempting to address multiple user intents. Could  you talk a little bit more about query planning? 

Query planning is basically where you have a user  query, and then rather than sending that straight   to an LLM, you can decompose that down into a  series of intents, and then you can work out   the order of those intents and then execute those.  So, that can be useful to specifically figure out   what to do at what point. There's also something  else that you should be aware of called routing,   which is another way of doing query planning,  where you can basically take a user query,   and you have a bunch of destinations. Let's  say you've got three destinations, A, B,   and C. You can take a user query, and you can  specifically find what type of route should  

that user query be forwarded to. Let's say you've  got three functions, one's generate a summary,   one's write an email, and one's look for some  order information, if the user query is like,   "I really want to summarize this information,"  the LLM router will basically take that query   and decide, "Okay, it needs to go to this generate  summary route," and it will just forward over the   user query. So, that's another approach. Query planning is basically where you're   trying to decompose the query, figure  out all the individually related intents,   the nested intents and dependencies there, and  execute those. And then you've got routing,  

which I've just described. There's also in the  OpenAI now, natively, they do support something   called parallel tool execution, or parallel  function calling, which basically means you   can have a query that has mixed intents. So, if  we say we have a function that can send an email,   if I have a user query that says, for example,  "I want to send an email to John and Jane," and   you've had like, you know, "And I want to say this  to John, and I want to say this to Jane," OpenAI's   function calling or their tool calling natively  now supports the ability to recognize multiple   intents and then execute those in parallel. Query planning is very useful. Routing is useful.   And we now also have native support for being able  to handle mixed intent using, you know, standard   packages with parallel function calling. So,  yeah, there's a range of things. I think routing   is a very interesting one. You can obviously use  that for forwarding requests. And then we've got  

parallel function calling, which is great for  handling those kinds of multiple intents within   the same user query. I would say, though, that if  you had maybe 5 or 10 intents in a query, that's   when, you know, manually breaking that down and  figuring out those things is probably going to be   more useful than sending it straight to an agent  and relying on the agent to specifically find,   you know, what tool calls to generate and what  tool calls to execute on your backend because,   yeah, basically, the more things that could  go wrong and the less control you're having.  That's why you've got a lot of these frameworks  like LangGraph, which are basically trying to,   you know, not always generate every single tool  in the kind of recursive while loop. They are  

kind of breaking these things out into DAGs, so  direct acyclic graphs, where you have a series   of functions kind of hop along and then it goes  back into the agentic loop. And you can do that   natively in Python. And the way you should  do that is let's say you've got a tool call,   rather than just returning it straight back to the  agent, you could do three different types of other   Python functions inside that Python function.  You've got step one, step two, step three. So,   you can kind of already just write your own kind  of LangGraph approaches to this problem where you   don't always want the agent to be responsible for  generating the flow of information. Therefore,   you know, it can be called tool A, but tool A  also does three things inside of that tool. So,  

there's a variety of different things that you  should be looking at. Routing is important,   obviously, tool calling is important, but, yeah,   you don't always have to rely on the agent to  decide exactly what tools need to be generated   and what ones need to be called. Okay. Makes sense. Thanks. James, you mentioned the word agent, which I think  is a heavily overloaded term at the moment. It's   like the word AI. I kind of struggle to use  it because it's so broad. But one thing that  

struck me when you started talking about agents  in your book was the parallels to reinforcement   learning. So, as you know, I wrote a book  on reinforcement learning, and that's all   about the idea of learning through trial and  error and having actions in an environment and   feeding those signals back to the agent and  using a reward signal to be able to decide   whether it did well. The parallels here is that  it's almost sounding like that the agents need a   similar kind of structure. It's almost like the  thinking of the prompt as the reward definition.   That's what decides whether it's rewarding or  not. The user context is kind of defining the   actions and then the environment and the various  tools that can be used by the agent. So, I guess,  

let me step back a little bit. How do agents  differ from standard prompt based approaches?  With a prompt-based approach, you're basically  having a Python function or some type of code   that will, you know, call an LLM, it'll produce  an output, and you're sort of putting that into   your existing software architecture, your sort  of software where you've got an AWS Lambda, this   little bit. Now, you kind of do that as a fuzzy  function and non-deterministic step. And then   you're kind of embedding that into your existing  workflows. And how that differs from an agent is   an agent is basically a different architecture,  which relies very much on...it has a higher amount   of control. It has a higher amount of autonomy.  And rather than basically you as the programmer  

and imperatively determining what's going to  happen when this code is executed at runtime,   you're basically relying on the agent's prompt,  its tools, and the tool definitions that it has,   and the context that that has in the  prompt to decide what to do given a series   of messages that have already happened before. Then once those messages are basically executed,   those will generate some type of tool calls or  maybe not. They'll maybe just reply. And then   after that, you have to have some type of stopping  criteria. So, generally, the easier way of writing   is something like while there are still tool calls  to do, then keep going with the agent. When we've   hit stop, reason, finish, or there's no tool calls  left, then the agent tick while loop. That being   said, you can also create your own objectives.  So, if the agent does stop prematurely,  

you can also add an additional message saying,  "No, we haven't stopped yet. You haven't hit this   programmatically or non-deterministic goal."  So, imagine you're trying to get 100 leads,   and you've got this agent that's using a Google  search tool and maybe a web page reading tool.   You can also say, you know, if it's finished  early, because there are no tool calls left,   you can add an additional message in there and  tell it, "No, you actually need to keep going."  The thing with agents is they have these tools.  The tools give them the ability to execute. You   know, the more tools that it has, the more useful  the agent becomes because it can bundle those   tools and execute those tools in a variety of  different ways. So, it is essentially building  

a computation graph at runtime as it executes  these tools. Now, the problem with them is,   obviously, you're giving a lot more autonomy to  the agent. And if one of those steps is wrong,   you're going to get compound error across all the  rest of the steps. So, having reduced error is   really, really important. And if it's got too many  tools, that can be a problem, or if the user query   isn't specific enough or can't be solved by the  given tool set, that's also a problem. You've also   got problems with if one of the tools fails,  what does the agent do in that scenario? So,   maybe it couldn't connect to the database, it  couldn't get the output. What does the agent do? 

There's lots of different problems that happen,  both from the DevOps side, from IO issues, from,   you know, the user issues in terms of the  user query. And a lot of people are trying   to figure out how to innovate in that space. So,  adding humans in the loop steps is one approach.   People are building robust execution workflows to  make sure that steps don't fail so that the agent   always succeeds. If there's an IO error, it will  retry for that step. At its heart, an agent has  

tools. It's basically in some type of wild, true  loop. It's doing all of those tool calls until it   thinks that it's finished with the result. Mike, just following on from that then,   how much of a role does prompting still  have in agentic systems then? Does it   make it harder in some sense because you've  got disparate agents working independently?  It's a good question. And I would say the system  prompt for the agent still needs a lot of work  

because you want to protect against those edge  cases. So, in some respects, it's more important   because, you know, when the agent runs, it's  going to do a lot of steps itself. And therefore,   that system prompt needs to be much better than  it would be if it was just doing one step because,   like James said, those errors compound. If it  chooses the wrong steps, then that's a bad thing.  

So, all of the traditional prompt engineering  techniques still apply. It's just that they   have higher leverage because now the agent is  going out on its own with your instructions.   You better make sure your instructions are good. In some respects, it also takes away from prompt   engineering because the big difference actually  between reinforcement learning and agents,   just to make that link, is that with reinforcement  learning, there's some objective truth, typically,   right? You can calculate it quickly. Whereas  with agents, they have to decide whether   they've done a good job or not based on your  goal. It's a much fuzzier reward mechanism   and self-determined in some ways. So, I think  that's why we're not seeing that many true agents  

in production because the current crop of models  just aren't as reliable in terms of deciding   whether they've done a good job and also in  deciding what to do based on their observations.  As we get better models, then those loops will get  tighter, and they'll make less mistakes. But it's   a bit of a trap in that if the AI can't do that  task very well, it's also probably not going to   be very good at judging whether that task has been  done well, to some degree. So, I think that's the   major difference. That's why when people talk  about agents from a marketing perspective,  

I know you hear Salesforce or Microsoft talk about  agents, they're not really talking about agents.   They're talking about quite a deterministic  chain of prompts still, and maybe some retrieval   with RAG. It's very rare that you have a true  agent in production these days, although that   should change quite a lot in the next year. Just to add on to that, like, the other thing   for the prompt engineering side is, obviously, you  can write the tool definitions more explicitly. If  

the tool definitions are very detailed and tell  exactly when to use this tool, when to not use   this tool, all that information is really valuable  to the agent because it will pick different   tools based on those tool descriptions and the  arguments of those tool descriptions, etc., which   is generally in a JSON schema specification. So,  that also has an impact. You can also do hybrids,   by the way. So, you can have like the agent uses  the search function and it calls a search function   and that search function could use something  like Q-learning where it has a Google search,   a tabu search, a variety of different searches,  and then it could use Q-learning inside of that   tool call. So, it's also possible to have an  agent that has a mixture of an LLM-based agent   with a tool call that will use different types  of searches based on an updated Q-learning table.  I think, obviously, that's where most of  the innovation is going on at the moment.  

Be interesting to see where that goes. But I  think we've got to the end of our time slot. So,   I think the only thing that remains is to thank  you both for joining me today. It's been really   interesting. To everybody else, the book again,  "Prompt Engineering for Generative AI," I'll plug  

it for you. There you go. Really great book.  Read it almost like in a couple of nights,   really. Just kept going and kept going. It's a  really fun book, really interesting, and I really   enjoyed it. So, I definitely recommend it. Any  final closing words? Anything you want to plug?  I appreciate you shouting out the book and  also reading it so quickly as well. Yeah,  

I'm glad it's digestible. One thing I would  like to end with is just to remind people not   to get too anxious about how fast everything is  moving. Because if you're following everything   on Twitter or Reddit and you're seeing  new things come out every single week,   it can get anxiety-inducing thinking, "How am  I going to keep up with all this?" But what I   found is that after a couple of years of being  in the mix, very few things actually change.   Even though the models are getting better,  they're getting better at a predictable rate.   Costs are coming down at a predictable rate. So,  it's just a case of zooming out a little bit,  

thinking ahead and going, "Okay, well,  if I start working on this project now,   where are things going to be in six months?" I  would say that even though things are moving fast,   if you're in a specific niche or a specific  domain, you cannot drive yourself crazy trying to   keep up with everything. Ultimately, if something  really big happens, other people will tell you   about it. That would be my advice for coping,  is just kind of pick a niche, learn everything   about that niche, and then don't worry too much  about all the other craziness that's happening.  Nice. I think my plug will be if  you haven't checked out Cursor yet,   definitely give it a go. I think the other thing  as well is this idea of bottom-up coding. You can   give a very large goal to Claude and then break  that down into lots of different smaller tasks,   which you work through in chat or Composer. But  then there's also this bottom or top-down approach  

to coding where you give O1 Pro 10 files and you  say, "Generate me three or five new files," which   is a completely different paradigm. And when do  you use either of these? Claude's Sonnet has very   low latency, but it has more regressions and more  hallucinations, while O1 Pro has incredibly high   latency, but very high accuracy. There's something  to be had with sometimes you should pick Claude   and sometimes you should pick O1 Pro. There's  also scenarios where sometimes you'll use Claude   and it will generate code, and you'll kind of get  stuck in a loop and it can't really figure it out.   You jump to O1 Pro, go make a cup of coffee or a  tea, and then you come back and it's figured out   on the first go. So, have a think about when  you should be using these kinds of reasoning  

models for your development work versus when you  should be using a lighter chat model to be doing   quicker edits or doing that bottom-up approach  to coding rather than the top-down. Have a think   about that and obviously let me know. Yeah. Okay, thank you both. See you later.  All right. Thanks, Phil Winder. All right. Thanks, Bye.

2025-03-11 12:33

Show Video

Other news

ServiceNow's $3 Billion AI Bet, Crypto Slumps | Bloomberg Technology 2025-03-15 21:41
Mastering AI: The New Infrastructure Rules 2025-03-13 03:42
4G LTE: One Standard To Rule Them All 2025-03-08 16:40