Implementing AI Technologies into Developer Tools with Tsavo Knott

Show video

[Music] Let's get into it. We won't be  holding everyone up too much. We've got   a good talk today. Turns out we build  developer tools, but we also want to   talk about how we build them. We are kind of a  product that was born two to three years ago,   and everything we do is basically based  in AI. Small models to do specific tasks,   large models to do conversational interactions,  vision processing models for all that stuff.

We're going to show today how we've built a  world-class product that ships across macOS,   Linux, and Windows to tens of thousands of  daily and monthly active users. That's what   a next-gen AI product looks like.  So, we're going to get into it. Real quick, quick introductions. My name is Tsavo,  Tsavo Knot. I am a technical co-founder here at   Pieces and also the CEO, which is not super fun  as of late. We just closed our Series A, so the   past three months were not much code and a lot of  Excel and PowerPoint. I'm glad to be here doing   another tech talk. We're really excited about what  we're building. Sam is here with us as well. Sam?

Yep, I'm the Chief AI Officer. I'm based out  of Barcelona. So, we really span the globe,   right? And with that comes a lot of challenges  and informs what we do with our workflow. Awesome. Well, let's get into it. Okay,  so the main thing we're going to get into   first is what we're building and why. Then  we're going to talk about how we've layered   in machine learning in a very thoughtful way  into every experience and aspect of our product. These days, as people hear about generative AI,  vision processing, and all these capabilities,   it's super important to think about how they  compose into a product experience that makes   sense. That's what we're going to show  you. We also are building developer tools,  

which means every single person here can use  our product. We'll take the feedback from it,   and we want to get your thoughts. We'll  have some time at the end for questions. It's going to be a bit of a technical  talk. Sam's going to get into some of   the challenges and nuances of bringing AI into  production, be it small machine learning models   or large language models at scale. We were  one of the first companies in the world to  

ship Mistral-5 and Llama-2 across macOS, Linux,  and Windows in a hardware-accelerated regard.   Sam's an absolute beast when it comes to that.  We'll show you some really cool stuff that's   coming up with phase three and phase four of where  the product is going. Anything to add there, Sam? No, that's about it. Great. Let's start out with what we're building  and why. Maybe some of you were in the main  

stage earlier and got a little teaser. Long  story short, myself as a developer, I've been   writing software since I was 16. It's almost  a decade, 11 years now. I feel a lot of pain   in the form of chaos, especially as I navigate  large projects. Be it in the browser where I'm   researching and problem-solving, or in  a ton of projects where I'm looking at   this Rust project, this Dart project, this  C++ project. I'm jumping around a bunch. Our team went from six people to 12 people  to 26 people very quickly, jumping around on   cross-functional collaboration channels. For  me, it was already painful a year ago, and   it feels like it's getting more  painful. Raise your hand here if  

you've used GitHub Copilot or some type  of autocomplete tool to write code. Well,   guess what? Every single person now can go from  the couple of languages they used to write,   three or four maybe five if you're an absolute  beast, to six, seven, eight, nine, ten. Your   ability to move faster, write more code in half  the time in different languages is going up.

But what happens when you increase your  ability to write a couple of languages in   a couple of different environments to now  five or six? Well, that's five times the   amount of documentation, five times the  amount of PR requests you're able to do,   or the cross-functional teams you can work  on. We're seeing this advent where the volume   and velocity of materials that you're now  interacting with, the dots to be connected,   are fundamentally hinting at how the developer  role is changing in this era of generative AI.   For us, we want to build a system that helps  you maintain the surface area as a developer   in those workflow pillars: the browser,  the IDE, and the collaborative environment. So, this chaos, how can we distill it down  and make it a little bit more helpful,   a little bit more streamlined for you  throughout your work in progress journey? I'll kind of power through some of these  other slides. Maybe it's hard to see,   but how many people, this is what your  desktop looks like at the end of a workday:   lots of browser tabs, lots of IDE  tabs, maybe Gchat open or Teams or   Slack. That's what chaos feels like.  Then you close it all down because you   want to go home and drink some tea. Then you  wake up the next day and you're like, "Oh,  

it's all gone. Where do I get back into flow?"  That's what we're talking about, developer chaos. In reality, it starts to feel like there's  so much to capture, so much to document,   so much to share, and there's not enough  time. How do we do that for you in a proactive   way? How can we make these experiences  where it's like, "I found this config,   it's super important. I changed this  implementation but didn't want to lose   the old one," or "Here's a really great bit  of documentation from a conversation I had,   just fire and forget and save that somewhere," and  have that somewhere be contextualized and fully   understand where it came from, what it was related  to, who you were talking to about that material,   and then maybe give it back to you at  the right time given the right context? What we're talking about here is the  work-in-progress journey. Most of the tabs   you interact with a day, most of the projects you  have, most of the messages inside Gchat or Slack,   all of that is lost. You maybe capture 10%  in your mind, and that's why as a developer,   our job is to half remember something and  look up the rest because we simply cannot   capture all these important context clues  and these dots. We're kind of like in this  

era now where that stuff doesn't have  to be lost. We have machine learning   and capabilities that can capture this stuff  for us. That's what we're going to get into. Just to put it simply, we're talking about an  AI cohesion layer, a layer that sits between   the browser, the IDE, and the collaborative space  and says, "This is related to that, and that is   related to XYZ." We're going to capture that and  save that for you, and then give it back to you   given that same context at the right time. Phase  one was really simple. It was a place for humans   to save. This is our first principle, fundamental  truth. People need to save things. As a developer,  

you don't realize you should have saved  something until you realize you didn't save it,   and that's really painful if your job is to  half remember something to look up the rest. We want to make a really great place for people  to save things, and that was phase one, roughly   two years ago. This is kind of like an AI-powered  micro repository. I'll do a quick live demo here,   where when you save something, all that stuff you  didn't want to do but should have done to organize   it—tags, titles, related links, capturing where it  came from, who it's related to—all of that we want   to do with AI. That's what we started out doing  two years ago. I'll just do a quick example here.  

Sam, maybe if you want to hold that for me, and  Sam's about to get into the tech on this as well. By the way, I'm not even lying, look  at my tabs that I have open right now,   just like 50 tabs, docs, GitHub, the whole  nine yards, got my email, everything like   that. But I wanted to feel natural, right?  I'm not up here with just a couple of tabs,   like I'm in your workflow every single day, same  deal as a technical co-founder. Let me go ahead   and show you this. Maybe you're on, you know,  open source, right? You're looking at an image   picker implementation. So, I'll kind of show  you what this looked like before. You're here,   you're checking it out, maybe I'll use  this, maybe I won't, I don't really know,   but I'm going through the doc and I'm  like, "Okay, I want to capture this."

So, this is what it looked like before.  Maybe you go to your notes, I don't know,   maybe use text files or something way  more primitive, you paste it in there,   right? It's not capturing the link it came  from, you're not titling it, tagging it,   it's going to be really hard to find this later,  whether it's Google Docs, Notion, you name it. A couple of years ago, we said, "Can we build  a product that has AI capabilities to do a   little bit more magic for you?" We're going to  grab that same code snippet, and by the way,   I have very limited WiFi, so I'm actually  going to do this in on-device mode. Here we go,  

we're going to do this completely on-device,  and I'll go ahead and... so what I just did   there was I switched the entire processing mode to  on-device to give you an idea of what this edge ML   looks like. That same code snippet that's on my  clipboard, I'm just going to go ahead and paste   that into Pieces, and you're going to notice  a couple of things. Maybe it's a little dark,   so I'll switch to light mode. Classic, if  Macs in the crowd, we got some buttons. Either way, what you see there is I just pasted  that bit of code, and it right away said, "Hey,   that wasn't just text, that was actually code."  This is using on-device machine learning, and it   also said it wasn't just any code, it was Dart  code. We took it a step further. So that title I  

was talking about, it's an image picker camera  delegate class for Dart. If you flip it over,   you're going to see there's a whole bunch  more enrichment as well. We gave this thing a   description, we're giving it tags and suggested  searches, and we're also going out to the   internet and aggregating related documentation  links. When you save something, we're pulling  

the outside world in to make sure that stuff is  indexed and contextualized. So I can say, "Hey,   this image picker is highly related to this  bit of doc right here on the animated class." That's the power of machine learning. I just took  a paste from what it looked like three or four   years ago into my notes app, and I pasted it into  Pieces. Instantly, with no internet connection,   it enriched it. It said, "Here's all the  details, the metadata that you're going to   use for organization, for search, for filtering,  and sharing later on." It also took it a step  

further and said, "Here are the things that it is  associated with—the people, the links, the docs,   etc." We're going to get into that a little  bit. That's just super simple paste. There's   a whole lot more to Pieces that we'll talk a bit  more about shortly, but let's keep it rolling. Sam, can you tell us a little bit about how that  happened, what was going on there? Take us away.

One of the big challenges we faced was this  happened incrementally. We started off by   classifying code versus text, then language  classifying, then tagging, then adding a   description, a title, all the rest of it. Once  we had all these models, effectively what we   had were seven different architectures, each  reliant on their individual training pipeline,   each needing separate testing, optimization,  spread instruction, and all the rest of it. One of the challenges we faced there was once you  have all these models, how do you deploy those   across these different operating systems,  especially when you're training in Python   but not deploying them in Python if  they're on-device? We were really   lucky. We sort of hit this problem  just as the whole ONNX framework   started stabilizing and being really  useful. Eventually, we plumped for that. Of course, this had a lot of challenges.  You're training seven different models,  

each requiring different data, which is a huge  overload on cleaning and preparation. It doesn't   scale well for local deployment. We were sort  of creaking at five models by the time we got   to seven. Things were taking up a lot of RAM, and  it was quite intense. It's also super slow and   expensive to retrain and train these models.  Each one's working on a separate pipeline.   We're using CNNs, LSTMs, there's not  much pre-training happening there.

One of the solutions around this came with  GPT-3.5. Suddenly, there's one model that   can do all of these tasks. Great solution,  done. You can wrap GPT, wrap a cloud model,   get all this information, fantastic. The only  problem is the LLMs are too slow. What you  

saw there was seven models executing like  that. You can't do that with cloud calls,   even if just over the network  cost. It's not going to happen. What we saw when we integrated LLMs was this  whole experience became super slow. Also,   these cloud models, they're fantastic general  learners, but they're really not there for coding,   for your normal coding tasks. I was reading  a paper a couple of weeks ago using GPT for  

Python coding tasks. You had an issue,  you gave it the bits of code it needed   to change to solve the issue, and  GPT-4 managed 1.4% of these tasks. So, two challenges: slow and not  really there to work with coding. I would add one more: security, privacy,  and performance. We have DoD users and   banking users and things like that. They  can't send their code over the wire and   back. There was a huge thing around air-gapped  systems that were able to do this on-device.

Exactly. We were looking for ways to take  this sort of experience you have with the   large language models, which is very nice,  generating tags, generating descriptions,   getting really nice stuff back, and put  it locally, serve it on-device. One of the   techniques that really helped was this low-rank  adaption paper that came out from Microsoft, which   solved a lot of our problems. Lots of you maybe,  who's heard of LoRA before? There we go. I'll run  

through what it does and why that really works  for us when it comes to serving these models. With low-rank adaption, you take a  general learner, a large cloud model,   use that cloud model to generate data  for your specific task. It doesn't have   to be a lot of data. You then take a small  model, something that executes very quickly,   and you tune a very specific part of the weights  in that model. We were using transformers, a T5,   an excellent foundation model. Specifically, we  were using a T5 that trained extensively on code,   so it was good at generating code, very  bad at generating descriptions. We found  

if we took this general learner, used  it to generate maybe 1,000 examples,   we were then able to take this T5 trained on  code and distill that behavior into that model. The way you do that is you focus on a  particular part of the architecture,   the attention layer. An attention layer,  there are lots of analogies to what it does,   but it's basically just mean pooling.  What you do is you parameterize   that and learn these weight parameters, which  are a transformation on the attention layer.   At inference time, you run inference as normal.  When you get to the end of the attention layer,  

you sum these LoRA weights with the attention  layer, pass it forward, and what you get   is a highly performant task-specific  learner from a more generalized one. This had several valuable features for us.  The first was no additional latency. There   are a number of ways you can take the knowledge  from a large learner and distill it to a small   one. This one was particularly nice because we  don't have to pay any extra cost for inference.   That was number one. Second, training is  much faster. You can get nice results with   maybe 1,000 training examples, very nice  results with 5,000-10,000, you're solid.

What that meant is we went from training  seven different models with different data   pipelines on hundreds of thousands of examples,  sometimes millions, to reducing that to 5 or 10K,   which is great for speed of iteration. It also  means our engineers can train these models on   their Macs. We're talking maybe six or seven  minutes to train all of these locally, which means   we can move a lot faster, deploy a lot more, and  retrain and update these models almost on the fly. The other really interesting thing is this works  really well with other methods. You can use LoRA.   If you're not getting quite what you want, you can  use prefix fine-tuning models. Most importantly,   we went from serving seven separate  models, which impacted the size of our app,   the memory usage, and all the rest of it, to  serving one foundational model, a small T5,   coming in at 190 megabytes, and then just  adding these LoRA modules on the way. This  

meant we could serve seven different specialized  tasks on the back of one foundation model with   these modules. It meant we could move a lot  faster. It was a huge turning point for us. Phase two. Phase one was okay; I can save small  things to Pieces. I can paste some code in there,   a screenshot in there. It's going to give me  everything I need. When Sam's talking about   those LoRA modules, those embedding modules,  we're talking about modules for titles,   descriptions, annotations, related links,  tags, etc. We have this entire pipeline  

that's been optimized. To sum it up, the  LoRA breakthrough was pretty profound for us. Then we moved on to another challenge: how do  we generate code completely air-gapped or in   the cloud using your context of your operating  system or a current code base or things like   that? How many people out there are hearing  about gen AI or AI features from your product   teams or management? Who's getting pushed to do  that? You're going to face challenges like which   embedding provider to use, which orchestration  layer to integrate, which vector database to   use. These were all challenges we evaluated early  on. We ended up at our approach, where we want   our users to bring their own large language  models. If you work at Microsoft, you can use   Azure or OpenAI services. If you work at Google,  you can use Gemini. We'll do a quick demo of this. From the initial idea that humans need a place to  save, we moved to phase two, which was how do I   conversationally generate or interact with all  the materials I have saved in Pieces, and then   use the things saved in Pieces to ground the model  and give me the relevant context? When I go in and   ask it a question about a database, I don't have  to say what database, what language, or anything   about that. It just understands my workstream  context. It even understands questions like who  

to talk to on my team about this because Pieces  is grounded in that horizontal workflow context. I'll do a quick demo of this and then we'll  keep it trucking along. Let's take a look at   what it means to have a conversational  co-pilot on device across macOS, Linux,   and Windows. We're going to be doing this in  blended processing only. I am offline right now,   probably a bit far from the building. One  thing I wanted to mention super quick before   I get into this is Pieces understands when  you're saving things. What I was talking   about is the additional context: when you're  interacting with things, your workflow activity,   what you were searching for, what you were  sharing with whom, etc. That's the when,  

which is an important temporal embedding  space for the large language model. Over time, you begin to aggregate all your  materials. All of these materials are used to   ground both large language models and smaller  models on-device. These are the things I have  

already, but let me go over to co-pilot  chat. Something interesting here is that   you can basically think of this as a ChatGPT for  developers, grounded in your workflow context. I   can come in here and it gives me suggested prompts  based on the materials I have. I'll do a little   bit of a setup. I'll use an on-device model, as  I mentioned. You can obviously use your OpenAI   models, Google models, whatever you like. I'll use  an on-device model. I'm going to use the F2 model  

from Microsoft. I'll activate that one. Let's ask  a question. This is how you know it's a live demo.   "How to disable a naming convention in ESLint?"  Who's had to do that before, by the way? Any   pain? Okay, well, you're lucky. Let's see what the  co-pilot brings up here. I'm running on a MacBook   Air here, an M2. We were really lucky to intersect  the hardware space starting to support these   on-device models. That's looking pretty good. It  looks like we got a naming style with the rules.  

Ignore naming style check. Great. When in doubt,  if you're having linting problems, just ignore it. Besides that, that's a generative co-pilot  grounded, air-gapped, private, secure,   and performant. We think that this world  of proliferated smaller models working in   harmony with larger models is going  to be the future, integrated across   your entire workflow. Everything you saw here  in the desktop app is available in VS Code,   JupyterHub, and Obsidian, and all of our  other plugins. Definitely worth a try. Let's get back to the tech. Just wanted  to show you how this stuff actually looks,   all the crazy stuff that Sam says.  How do we actually productionize that?

Sam, you want to take us away? For sure. What you saw there, a couple of things  are fundamental. First, you need to design a   system that's isomorphic across cloud models and  local models, which is super challenging. You need   something you can plug into any model because the  entire model space is moving so fast. If you don't  

do that, you're going to be locked in and left  behind. Second, you need to take full advantage   of the system. You've got to use any native  tooling, any hardware acceleration. Finally,   you need a system where you can pipe in context  from anywhere. A lot of AI products out there,   you're stuck within one ecosystem or application.  We wanted to break that. We're a tool between   tools. We want to enable you to add context  to your AI interactions from anywhere. These were our design principles for the co-pilot.

How did we do this? It was an emerging ecosystem  when we got into this. We looked at ONNX Runtime,   which was great but didn't have the tooling  to run a vast selection of models. We settled   for TVM because it abstracted away a lot of  the hardware optimizations. Specifically,   we used MLC. Has anyone used MLC or TVM? This is  your go-to if you want to deploy a local model. MLC was great because it's open source.  A lot of the work when you're compiling a  

machine learning model is designing specific  optimizations for certain operations across   a wide range of hardware. That's almost  impossible with a four-person ML team.   Leveraging open source was fantastic. This  is all in C++. We had to get it into Dart.   Excellent tooling in Dart to do that. We used  FFI to handle the integration into the app.

You've got this thing. Large language models, if  you want them to perform well at a specific task,   you have to put a lot of effort into your  prompt. That's just a fact. You need a way to   slam context into that prompt as well. If you're  prompt engineering for a specific cloud model,   that's not going to do well across  specific local models or even other   cloud providers. We built a system that  allowed us to dynamically prompt these  

models based not only on what the model was  but where the information was coming from. We wanted to integrate multimodal data. A lot  of your information is in text, but I watched a   lot of videos when learning to code. My team  insists on sending me screenshots of code,   which is super annoying, but there you go. We  wanted to take away the pain and allow you to   add video and images to your conversations  with these models. We did that using fast  

OCR. Initially, we used a fine-tuned version of  Tesseract. Does that ring any bells? Super nice   open-source OCR system. We pushed that as far  as it could go for us. Now we use native OCR,   which at the time seemed like a  great move away from Tesseract,   a clunky system developed over two decades.  Using native OCR had its own challenges.

Real quick on the Tesseract, who here uses  Linux? Okay, we'll have to figure that out. What   we're talking about next is the evolution of our  product. Phase one was humans can save to Pieces,   and Pieces will enrich that, index  that, make it searchable, shareable,   and all the like. Phase two is you move  into the generate, iterate, curate loop,   using conversational co-pilots to understand,  explain, and write code grounded in the things   you have saved. Now we are moving into phase three  and phase four. We've just closed our Series A,   but let's show what all this is about. You guys  are some of the first in the world to check   out this crazy stuff we're about to release  before it's even on TechCrunch or whatever.

We want to build systems that are truly helpful,   systems that you do not have to put in  all the energy to use. You don't have to   memorize what this button does and so on.  We want systems that are fast, performant,   and private, that feel like you're working with  a teammate. We know you're going to need this,   you're about to open this project, talk to this  person, or open this documentation link. Can we,   based on your current workflow context,  give you the right things at the right time? The other thing is I don't want to build a  100-person integrations team. There are so   many tools that developers use every single  day. Building and managing those hardline  

integration pipelines is a really difficult  thing to do. Standardizing those data models,   piping them into the system, and so on. We've  spent a lot of time in the past year and a   half on some in-house vision processing that runs  completely on device, is extremely complementary   with the triangulation and the large language  models that process the outputs, but it allows   us to see your workstream as you do. Understand  who you're talking to, the links you're visiting,   what you're doing in your IDE, what's actually  changing. This is the same approach that Tesla   took with their self-driving stuff. Those  cars run on-device, see the world as you do,   and aren't designed to hardline program every  rule for every country for every driving   scenario. We wanted to scale very quickly across  all the tools, both in an air-gapped environment  

or connected to the cloud. From there, can we  truly capture and triangulate what's important? That's hilarious because this is straight  from our Series A pitch deck. It says for   designers and knowledge workers; we're talking  about developers today. That's hilarious. Sam, do you want to add anything here? I would say something super important when  we're deep in your workflow is we need to   stop you from surfacing any secrets or sensitive  information. A big component of this is filtering   and secrets removal and giving you the ability  to run it completely air-gapped on-device.

That's right. Just like the human brain, you take  in so much more than you actually process. The   job is to filter out 90% of the information  and capture and triangulate and resurface   the relevant 10%. That's what Sam's talking  about. There's a lot of machine learning and   high-performance models involved in the filtration  process, but it's not just machine learning. I don't know about you, but I'm fed up  with interacting with AIs through a text   chat. I think there's so much more room for  innovation around how we interact with these   models beyond just assuming that's the best  way. Part of this is really exploring that. Yeah, I would say the joke I make to investors  is we went from a couple of keywords in a   Google search bar and then opening a link and  scrolling to now full-blown narratives in ChatGPT,   trying to prompt and reprompt and dig it out.  It feels like it's become a bit more verbose  

and laborious to use these conversational  co-pilots. We think the way they're going   to be integrated into your workflow is  going to be different. They're going to   be more heads-up display or autocomplete-esque  systems. That's what we're going to show you. Sam, do you want to hit on anything here? I think we've done it. That's good. Great. Let's do a quick demo here. The first  one, which is going to be pretty awesome,   hopefully we can see some of this.  Basically, I'm in my IDE and Sam,  

you could probably talk through some  of this as well. This is your demo. For sure. Let me remember. This is my workflow.  I'm just moving between windows. This is how you   knew I was offline during the demo. There we go.  I'm just doing a bit of simulation here. This is   a super early proof of concept from a while back.  Basically, we're moving between screens. To do   this previously, we'd need four integrations:  one for the IDE, one for chat, one for Not ion, and one for GitHub.

Let's see if I can switch to the hotspot real  quick. There we go. T3 for the win. By the way,   that's why you need on-device machine  learning. If we can't load a video,   how do you think we can do all the processing  in real-time for you? It's not going to keep   up with your workflow. That would be an  absolute nightmare. All right, here we go. What happened here is we've skipped around the  workflow, capturing the information from the   active windows. We're using your activity  in your workflow as an indicator of what's  

important. We're not capturing everything,  just what you're looking at. We're adding   it in as context to part of the chat. This  enables me to ask, "Could you tell me a little   bit about Dom's PR?" and it's all there.  This highlights one of our key principles:  

AI is interesting, and these large  language models are very convincing,   but without the actual context, the outputs  are useless. When you give them the context,   you can get these magical experiences where you  can talk about your team and all the rest of it. We decided to take that and push it further.  I'll just add real quick. The way this will  

actually manifest is in reality, you can go  to your co-pilot and say, "Give me workflow   context from the last hour or the last six hours."  Instead of adding a folder or adding a document,   you're just saying, "I want to use the last  three hours of my workstream to ground the   model." Then you can ask it, "What was Dom's PR?  What was going on? What are the specific changes?" In this video, you can see this real-time  output, real-time aggregation, or real-time   triangulation as Sam is just moving through the  browser, in Gchat, looking at research papers,   looking at documentation. Over on the left,  it outputs the links, who you're talking to,   the relevant documents, filtering  a bunch of stuff. That will also be   used to ground the model. You can say  for the last hour, "What was I doing?" We think those conversational co-pilots,  as I mentioned, are a bit laborious. We  

want to give you a heads-up display given  your current context. Wherever you are,   if you're in the browser looking at a webpack  config, can we give you relevant tags that you   can click on and expand? Can we give you  snippets you've already used related to   webpack? Can we tell you with high confidence  you're about to go look at this npm package or   check out this internal doc? Can we give you  a co-pilot conversation you've had about this   webpack config previously? Can we tell you to  go talk to Sam or someone on your team that's   worked on webpack? Also, can we point you in the  direction of an introduction webpack project? I didn't have to do anything here. Pieces are  running in real-time on your operating system,   giving you what you need when you need it based  on your current context in the browser, in IDE,   in collaborative environments. I don't have to  type out a sentence. I don't have to go search   for anything. I switch my context, and the feed  changes. These are the experiences we think are   going to be profound not only for developers  but for a lot of different knowledge workers   at large. Having the right thing at the right  time. You look at Instagram, YouTube, TikTok,  

Facebook, every feed out there in the world.  It's really good at giving you interesting   content. Why can't we apply that to your  workstream to help you move faster each day? That's a little about the workstream feed that'll  launch later this quarter. We've got some crazy   enterprise capabilities that'll launch later this  year. I think we are at a wrap here. We'd love to   have you come talk to us. We've got a booth with  no words, but it's got a really cool graphic. You  

can't miss it. It's black and white. Sam and I  will be around to answer questions. Thank you. Did you say it was watching you  scroll through a browser window? Yes, that's right. In real-time,  on-device, no internet. So it would see the URL that you might  hover, seeing what you're reading also? That's right. In a really smart and  performant way, by the way. A lot of   processing, so we do a lot of active cropping. I wanted to know if you could talk more  about how you decided to subdivide the   tasks from a larger model to smaller  models. Running these models locally,   did these smaller models provide  more leeway for the tasks? You want to talk about this? Well, it was one smaller model. That's the key  thing. We went from seven to one. That's a huge  

saving in terms of memory and all the rest of it.  In terms of what tasks we decided to focus on,   that was really informed by designing a  snippet saving app. All of this metadata   we generate is very much keyed into you  being able to find the resources you've   already used. Things that are useful to  collaborate. That's why we hit up links,  

related people, descriptions, all the rest of  it. Those tasks almost designed themselves. There are constraints. If you  want to generate a description,   there's a limit to how much memory you're using  at that time. If you can generate that description   in less than 100 milliseconds. We look at the  problem and say, "What's the probability we can   generate a two-paragraph description in less  than 100 milliseconds?" We work backwards to   solve for those tokens per second outputs. We work  backwards to solve for GPU, CPU, memory footprint,   whether it's cold started, you turn it off and  on to free up that memory. There are all types  

of challenges involved. I think generating a  description is the largest, but generating tags,   titles, those are all sub-problems of  descriptions. We executed that pretty well. Other questions? When are we going to be able to use AI for no  keyboard, no mouse interactions to do our tasks? We're waiting on Neuralink or Apple Vision Pro,  not really sure which one's going to hit first.   If I was thinking at the speed of OpenAI's output,  those tokens per second, I'd probably be pretty   bothered. We need to get that way up. We need to  make it efficient. It's quite expensive right now   to process all the data as we're just throwing  it up at these large language models. It's kind  

of like the human brain is a quantum computer  walking around and operating on 13.5 watts. We   need to hit that level first. I can't give you  the answer, but I know we're well on our way. I love my keyboard. I don't think I'd ever want   to be in a situation where  I just talk to my computer. We're getting there, at least heads-up displays. You showed how you literally watch  what's being done. Any way of going  

the other way around in the future,  instead of just being in a tiny window,   you have a draw on top of the apps to show what  buttons to click and what the definitions are? I think it'll be really tricky, but if  we fail at building developer tools,   at least we could open source the system  so robots could order at McDonald's. We   know that button's important to click, we're  going to do that one. We're looking at the URL,   the tab name, what's scrolling, and so on.  Vision processing is going to be important.   Our eyes are pretty important for us. These  systems need to be able to do that. If we   could build a great foundation on that to  overlay recommendations of actions to take,   that would be cool. But we'll get there.  That's maybe Series F or something.

One more question. That's exactly right. Our models strictly  understand technical language. If you asked   it a question about poetry, it would not  give you a good output. That's how we're  

able to detect if you're working on technical  language or natural language. Our models are   biased towards code. We've trained them on a lot  of code. When you distill a model down, you have   to be very specific in what type of environment  it's going to perform well in. When you take it  

outside of a technical language environment  into a heavy natural language environment,   the performance is capturable but not helpful.  If you ask the co-pilot about the bank statement,   it might do all right, but we don't  care about that. We filter all that out. Just don't tell your employer you're using  your work laptop for 25% personal stuff. We  

get around this with a lot of clustering.  We came to co-pilots after we'd done a lot   of work in embedding clustering and search.  We leverage that work. At the end of the day,   you can run this all on your device. If  you're using that device half and half   for personal and work, feel free to use our  co-pilots half and half for personal and work.

One more. I'm trying to think of this in the  context of my everyday workflow.   Let's say I've been using Pieces for three  months. It knows exactly what I've been doing.   It knows my Kubernetes clusters and all that.  Do you have something in your roadmap where   it interacts with Kubernetes, where I say, "Hey  Pieces, remember that API that I built that has   this CICD workflow in GitHub and is deployed to  this cluster. Do that again." Is that something?

We're really interested in human-in-the-loop  systems. A lot of work is being done around   these autonomous agents. We see a lot of  interplay between co-pilots. When we get   into the operational execution side of things,  we'll slow roll that a bit before we say,   "We know all the docs, all the people,  go out and do it for you." We're quite   far away from that and we'll let  GitHub take the lead there. For us,   the stuff you interact with as a developer,  that's at least 30-40 tabs worth, a couple   of conversations, and definitely a Jira  or GitHub issue ticket. All those dots we  

want to connect for you. When it comes to  executing, hopefully, you're still the one doing that. Otherwise, if it's fully autonomous,   we might have to change the way we  write code. We'll deal with that later.

Cool, I think that's it. Awesome, thank you guys. [Music]

2024-06-21

Show video