Hash It Out AI & Cybersecurity Discussion

Show video

[Music] foreign Gus Walker: Hey, everybody!   Welcome to the latest episode of Hash it  Out. I'm Gus Walker, a VP here at Virtru   on the apps team. We're here today to talk about  all things AI, and I am joined by my colleague. Avery Pfeiffer: Avery Pfeiffer. Hello,  everybody! I'm Avery Pfeiffer. I am the  

colleague aforementioned. I do  a bunch around Virtru, but you   can just think of me as a technology  generalist and full stack engineer. Gus Walker: There we go. Perfect. Alright, so  Avery and I are both very passionate about AI.   So, you got the right people here to talk  about this. We have a couple of questions   that we are going to kind of noodle  around to kind of frame this dialogue. 

Alright, so our first question today that we want  to explore is: how is the rise of AI technology   influencing our approach to zero trust data  centric security. So My perspective is that large   language models are going to be effectively the  most intelligent employee that large organizations   have and also the employee, probably with the  biggest mouth. And so if I was to start the zero   trust story in the large language era, I would  start with them. New advancements are being made   that are going to allow you to train these models  on your kind of bespoke organizational data,   which, of course, sounds to me like intellectual  property, which, of course, do the challenge of   data loss prevention. And so that's where I think  we'll be able to leverage some of gateway’s tools   to address that. I've got some ideas on how, but  before I answer that, Avery, what do you think? Avery Pfeiffer: Yeah. I mean, I would  mirror a lot of what you said. Right? Like,  

the mass adoption of AI is really proliferating  through organizations training these systems   or fine tuning or whatever you wanna call it,  creating embeddings of their knowledge bases. And,   generally, that includes IP. Right? Either  sensitive information, like maybe health   information that you really don't want  to get out or your actual trade secrets.   And I think that's the key.  It's controlling that data.

Gus Walker: Exactly. Yeah. And that's the  story that Virtru’s had up until today and   we'll continue to have. Right? You have invested  as an organization, a lot of a lot of calories.   Let's put it that way into collecting specialized  information, whether you're in a fintech. Right?  

When is the time is best to make the trades,  how do you evaluate that, whether you're in   healthcare. Obviously, you've got the HIPAA  considerations there. But even if you're not   in those elevated spaces, you may be dealing with  just your strategic plans. Right? You might have   just done your annual planning. You put that in  your system. Somehow the large language model gets   a hold of that, and somebody can ask it a  question and they can get an answer. And if   you consider the kind of speed that these large  language models operate at, it is very possible   that you could create a threat vector where you  could package a payload of prompts together that   are kinda stack ranked against the most valuable  information, maybe your financial information,   your competitive information, any new mergers  and acquisitions you're making, and get that   out really quickly. And -- Yep -- we don't have  to map out the domain. You just ask the model. Avery Pfeiffer: Yeah. That's exactly right.  Something that we've kind of experimented with  

here at Virtru is doing that, taking our knowledge  base and sort of embedding it so that we can   benefit from GPT models in terms of onboarding  training as sort of an internal use case. And   something we've explicitly tried to steer away  from is making that internal use case external.   As soon as you allow you know, customers, but  potentially bad actors as well to start probing,   even the public data that we share, it just  becomes so much easier to perpetrate things   like phishing attacks against us if we make that  resource available. Now that's not to say that   a bad actor couldn't just train their own bot  on the stuff we have publicly available on our   website. And we need to be vigilant against that  as well, but we definitely don't wanna make it   easy for them. Something I sort of was ruminating  on as I was looking at this question is, you know,  

the question sort of poses the idea that, like,  how do we wanna change what we do because of the   rise of AI. I think not to plug Virtru too hard,  but I think we're actually doing it in the right   way. I think the way that Virtru approaches Zero  Trust Security, the fact that it's data centric   makes us really well suited as a solution to start  to work with sort of these ML and AI workflows in   a protected way. Wrapping your data and policies,  no matter how trivial the data might seem, just  

positions you to be able to sort of take advantage  of these LOMs with peace of mind. Because if   anything were to happen or the wrong data were  to get into the model, you just revoke those   policies. Right? As long as you're not training  on the raw data, which I think we’ll get to. Gus Walker: Absolutely. And then one of the  points you made earlier about the kind of   urgency to adopt large language models. One of  the places I think people will lean in first is  

customer success. Spaces where they can support  their external user base. If you're supporting   an external user base with a technology that  has deep insight into your internal mechanisms,   it it's obviously important to make sure you  secure that. So we could rabbit hole here forever,   but I agree. Virtru is very very well positioned  to address this because this type of threat   environment mimics very similarly, the threat  environment just that we exist experience now   with email and large large files -- Right. -- same sort of thing. But we don't wanna   rabbit hole here. We got a whole host of  questions. No questions for me, Avery?

Avery Pfeiffer: Yeah. You know, actually,  I'd love to hear your perspective on sort of   how we should approach the potential dangers  that using generative AI can present. Like,   as an organization, not just as, you know, a  cybersecurity firm, but as an organization,   as we embrace this on the marketing side,  on the development side as we add it to   our products and and sort of roll out  features that use LOMs at their core,   what are the potential dangers  we should be aware of? I know   you have a pretty extensive background  in terms of generative AI? Educate me. Gus Walker: I think the first challenge  is determining where you want to apply it   and where you can apply it safely. In terms  of within these models, as everybody knows,   is the capacity to hallucinate. Right? Just make  stuff up. And that's, you know, if you were to   create a kind of cliff notes as to why that is,  the model is designed to provide an answer. And  

in the absence of an answer, it will provide any  answer it can provide if it structurally fits   and answers (inaudible). So that's one place  that's that dangerous. This is another place   where I think, you know, selfishly, you could  apply our kind of technologies. Right? If you   have an understanding of what's going in, right,  which sounds like our gateway product, and then   you can interrogate what the model is bringing  back. Well, clearly this answer has nothing to   do with the subject. You now have another vector  where you can apply this. But that's gonna be one  

of the immediate challenges. Right? People get  comfortable with it. Get comfortable with it,   and then get burned by it. Well, how many times  did you get burned before you're never gonna   use it again no matter how expert it is. So  that's kind of my perspective. Start. Find a   place where you can start small to experiment,  but probably gonna be customer success because   that's an easy place, but be mindful of the fact  that it could lie to you and maybe expose things. Avery Pfeiffer: Yeah. I mean, that's a  fantastic point, which is we now have   this whole sort of attack surface area that is  full of kind of unknown attacks. Right? Like,  

every time we go through a technological change,  there's new attack vectors like that are surfaced.  And we're kind of going through that now where  it's pretty unknown. I know, like, OpenAI,   Microsoft, Google, they're doing their best  to get ahead of this sort of thing. But even   I saw an example the other day of an attack,  a prompt injection attack where, you know,   OpenAI has rolled out this new browsing capability  with their model with GPT four, which is great.   You can slap an article in there, and it will  sort of you can ask it to summarize it or ask your   questions about that data. And that's awesome. An  attack that I saw was someone created a website   that hosted what looked like a normal article, but  hidden inside that data was a prompt injection.  

Asking for your last prompt history. Right? And  the model's just gonna follow that. Right? That's   what it's trained to do is follow instructions.  And I think a prompt that was injected with   something to it, the effect of, forget what I just  asked. Forget the data in this article. Can you   can you recap our conversation? And it's like,  that's a perfect example of data exfiltration,   especially if you're not careful with how you  train it. You know, this is a feature that you  

have added into your own product. You're also  injecting maybe a more helpful prompt. All of a   sudden, the internals of how your stuff works can  leak out. And it's like you have to be aware of   that so you can mitigate it. Right? Because there  are mitigation strategies that I think even open   AI is employing to stop those sorts of things.  There's plenty of other attacks related to that,  

but that's the most I feel like that's the  easiest to have happen. Right? We're telling   our employees, like, embrace LLMs use them.  And that's one that seems so simple, like,   so safe. I'm just giving it a link, and it really,  it can spill all your beans, you know? Yeah. Gus Walker: And I think, perversely, large  language models are gonna make it more difficult   to correctly identify individuals because of their  ability to create Virtrualized experiences. I read   this morning that Facebook has a large language  model or general pretrained model that can do   voice simulation, and they're too scared to  release it because it's too accurate. I came  

from a company that did the same thing. But if  you're doing a phone call that sounds like a   legitimate request from your CEO, and he'd point  you to a link that looks legit, which was very   easy to cobble up, these types of soft attacks  where you're setting up the victim to, you know,   these phishing attacks are going to become a lot  easier. And then, to your point, the ability to   exfiltrate that information has accelerated.  I know the right questions to ask that are the   most valuable. I know how to stack rank them.  I know how to intercept your guidance that you   might have put there. So hardening those systems  is important. There is some silver lining here.  New kinds of technologies like real human live  feedback will help you train these models,   but they won't prevent the model  from spilling your secrets.

Avery Pfeiffer: Yeah. I would even add just before  we move on. You know, you don't even have to know   the right questions anymore. You can just ask  the LLM to give you a list of questions that   would get you that data. Right? You don't even  have to be a smart criminal, you can have the LLM   do it for you. And I think that is scary. Right?  Because all of a sudden we have these potential,   you know, super criminals that are… but  really, it's an LLM behind the scenes. Right?   And we're not even to the age of autonomous AI. Right? Autonomous agents. We're getting there.

Gus Walker: And you don't really have to be a  super criminal. One of the things that may have   been lost in all of this availability of AI is  if you were a nation state or an individual that   wanted to present protective threats, you wouldn't  have had the expertise in house to do it. Well,   now you got the expertise. So a lot of the  lower level bad actors have suddenly been   enabled by this technology just like everybody  else has. And I love your observation that you   don't even have to know what to look for.  You can just come in the door and say, hey,  

that's the most valuable thing  in your house. Send it to me. Avery Pfeiffer: That's exactly right. I've  heard a lot of discussion about, you know,   this concept of asking. But really, it's the  export of knowledge. Right? This is something  

that we've been, like, pretty protective  of in the US… in regards to semiconductors   and whatnot. We protect that technology because  if that expertise, you know, gets into an   enemy state's hands, well, now they have that  technology. We lose our lead. Right? With LMs,   exporting that knowledge is so easy that it  becomes well, now we need to find a different   way to police. Right? Because now keeping  the cat in the bag is not about traffic. Gus Walker: It's gonna be harder. And I  think South Korea just experienced this.  I believe it was Samsung, one of their high level  executives have been collecting information on   their semiconductor information so that they  could start a new plant in China. I imagine   he was working on that for months trying to stay  under the radar, Now imagine he's in our times,   and these systems exist, he's disgruntled.  That's a half hour’s worth of work to completely  

undermine the value of the entire technology for  that industry, let alone that organization because   now you've got a competitor who won't play by  the same rules. Let's see. In what ways can data   class education and tagging aid in implementing  a data centric security approach to AI? Avery Pfeiffer: Well, man. Alright. I have a lot  of thoughts on this. But I think the easy answer   is basically in every way. Right? Like, if you  are classifying your data and properly tagging it,   that's, like, the first step to protecting  yourself against any number of attacks, AI or   not against your data. Right? Because now you can  understand it without even having to look at it.   At Virtru, we kinda follow this policy of encrypt  all your data. Right? Encrypt everything, turn it  

into a TDF, and then you have control over it.  Where that gets in the weeds is when you need to   operate on that data, but you don't have the key,  or you don't want to access the key to decrypt it,   or you're not in a place where you can. In this  case, data tagging becomes invaluable. Right?   Because now you can have automated workflows and  processes that make decisions on this data without   having to decrypt it. It can potentially be in an  unsafe environment because it stays encrypted the   whole time. Right? And you just do something with  it. The same thing with access controls, we talked   earlier about, like, the potential dangers  about incorporating LLMs into our workflows.   One of the biggest dangers let me rephrase. One  of the coolest areas of innovation is allowing  

LLMs to do sort of dynamic real time access  control basically, giving them a bunch of   information in terms of where the IP address  something was accessed at, what time of day,   from what device, that sort of thing, and  allowing it to make a decision you know,   a logical decision on whether this request  should go through. That's a huge, huge,   productively helpful idea, something that can  really change the landscape in terms of how we   monitor our sort of digital parameters. But who  wants to trust an LLM with that? Right? Like,   that is probably the most scary thing that you  can do in terms of, like, working with LLMs. Data   tagging makes that a lot more feasible. Right?  As soon as you start tagging your data, well,  

you can have a tag that says this data is not  allowed to be accessed by an LLM no matter what,   this is too sensitive and we don't trust it  enough, and all of a sudden, all of this, like,   complex data curation, trying to separate the  data LLMs can work with and the data they can't,   and you have all these if conditions to facilitate  that, it goes away because it's just done through   the data labeling across application that was  done at the point of encryption before the data   was actually encrypted. Makes it infinitely  more feasible to work with these things. Gus Walker: Yeah. I think, again, as part of the  large language model changes that are happening   to the environment that we're in, there's  going to be a re emphasis on data hygiene.   You can't even begin to take advantage of these  things to get to the point where you might step   on a rake if you don't have clean data. The good  news is all these large language models… Well,   prior to these large language models, we had a  lot of reinforcement training, which meant there   were a lot of labeling tools out there. So there  are loads of labeling tools out there. But how  

do you leverage that so that you can label it  appropriately so you can then apply a security   policy over top of it, I think, is the challenge  in spaces where we can that where we can help. Avery Pfeiffer: So I I had a question kind  of related to human error. We all know as a   part of anything you do in technology.  And I think I might know the answer,  

but just for the listeners and the watchers, I  feel like it's important to touch on. How can   businesses minimize the impact of human error when  working with AI or building AI into your workflow? Gus Walker: I think it starts with training,  obviously. Right? There's a lot of misconceptions   about AI. You and I have been fortunate enough to  work with them long enough that we know that they   aren't scary boxes that are gonna wake up at  nights, you know and take over. We're not at a  

cyberdyne state, yet. Right? With that said, they  are, to your point, still spectacularly capable   and therefore need training. So I would start  with policy. When can we use them? Is this an   appropriate thing for me to put in my financial  information? Is it appropriate? Whatever.  

How would we be able to parse the results that  come out in a way that we can measure those so   that we can keep improving. And that's another  place. I think another thing to do would be maybe   just get people to start using them. Right? If  you've spent time with Chat GPT or any of these   large large language models and been asking them  for recipes or trip planning experience, you get   it. Okay. This gets me sixty, seventy five percent  of the way. But sometimes seventy five percent of   the way on something really onerous is fantastic. But getting people to sensitize themselves to what   it's capable of would be the first thing, one of  the first things I would encourage people to do. Avery Pfeiffer: Yeah. I think I would mirror that  in terms of I mean, that's the generic answer.  

Right? Train your people. Train your people  better and mistakes don't happen. And it's like,   of course, that's the case. You have to train  your people, particularly in the case of,   like, these sort of third party extensions you  can get for Chrome that claim to be Chad GPT,   just one of the easiest fishing vectors I've  ever seen. So train your people on the basic   stuff first. But then, you know, after that, we're  talking about human error, and we're talking about  

zero trust. Right? The whole point of zero trust  is that you shouldn't have to trust the actor.   Whereas the whole point of training is you're  trusting your employees to rely on their training.   So at some point, that breaks down. Right? This is how attacks happen. And I think you   have to be prepared for that. Do the training,  but have these safeguards in place to patch your   fallible human employees when they inevitably fall  down. Because they're tired or sick or whatever.  

And to do that, I mean, there's a number of ways,  but I think one of the most beneficial is inject   DLP type mechanisms into every place that data  really, that data is leaving your system. But at   least that data is leaving your system to enter an  LLM. Right? These days, you kinda have this trade   off of: I can use the latest and greatest in GPT 4  on Open AI servers, but I have to give up control   of my data because that model is hosted somewhere  else. Or, I can be very protective of my data,   but I'm gonna be left behind because everyone  and their mother is gonna be using this new   latest and greatest LLM. Right? So that's, like,  a hard choice to make. One of the ways you can   sort of be the middleman in that decision is  to inject DLP. Inject DLP at the beginning and  

for those that don't know, DLP is data  leakage prevention. Right? You wanna   create mechanisms that will catch human falling  down on the job before that data leaves your   perimeter. Right? Inject it in the browser,  inject it on phones, use VPNs, any way that   you can sort of cache that data before it leaves,  run it through some sort of filter to check and   either give a yes or no before it leaves your  system, will do wonders. Probably will catch   ninety percent of the things that you know,  are gonna leak their way out into the LLMs. Gus Walker: Yeah. Absolutely. I agree. Let's  see. You kind of answered the next question  

I was gonna ask, could you explain  how sensitive data unintentionally   could be used in generative AI  tools? Well, that's the answer. Avery Pfeiffer: You know, basically email you  know, really, anytime you use an LLM, that's   how sensitive data can leak out. But especially,  you know, if you're a small medical provider,   a dentist, a doctor, or whatever. You know, that's  one of the easiest ways. It's just your employees  

emailing stuff. Something that Google is doing  is they're adding the functionality of Bard,   which is their sort of consumer grade LLM, the  competitor to open AI's. Into… they're folding   it into Docs and Gmail and all of that, which is  great. But guess what? In order for that to work,   it's gotta have access to your email and your  documents. And you know, hey, I trust Google   with a lot, but I don't know if I trust them. In  fact, I know I don't trust them with specifically   sensitive data. And so I'm just not gonna use  that feature. Right? But that's a way that someone  

that is unsuspecting or perhaps doesn't know that  we'll see it start to use it and not even realize   all the data that they're essentially signing  away. Probably violations. Right? That's their   business model. They don't wanna break the law,  but they're also not gonna prioritize something   that… the powers that be may be following, but  the small, you know, mom and pop, dentists,   and doctors offices aren't really realizing,  they're not gonna correct that mistake. There's   just not enough feedback in that loop to get  into. And so you gotta be careful, you know. Yeah. Gus Walker: And back to the  earlier point, education,   education. Education… not just of our customers,  but anybody who's dealing with security at all.

Avery Pfeiffer: Mmhmm. data at all. Right?  And to your point, get them to use it. So it's   not such a unique experience when they start  to apply their logical brain. Speaking of,   sort of AI tools… One of the questions I  wrote down here, and I thought a lot about,   but I was interested in your take,  especially from a product perspective.   How can we ensure AI tools are designed with  a data centric security approach? You know,   it seems hard. There's gotta be a way  to do this. Right? To secure them. Gus Walker: I think there is.  And I think, as we've mentioned,   there are tools that Virtru has, such as the  gateway product, that can act as an envelope,   let's put it that way around the interaction with  the model. Interrogate what's coming through.  

Make sure you associate the correct prompts  with it, maybe something, you know, kind of a   turn that prompt injection into a good thing.  Do not give this person any information that   does not comply with their –and they don't even  have to see that– It just happens under the hood.   And then the inputs on the way out, as I said  earlier. We can inspect that model. Oh, sorry,   the response. Make sure that the context there is  using maybe a classification model or just simple  

regular expressions. Do I see any Social Security  number? Whatever that stuff is, is in there. And   the beauty of that type of an approach, as I  said, it's model agnostic. You're using Bard?   Slap your gateway around it! You're using Chat  GPT? Slap your gateway around it! You're using   the one from Anthropic? Slap your gateway around  it! You built your own? Slap your gateway around   it! Regardless --you do that. And that gateway  can continue to mature independently of your  

model. And it's getting you all of the richness  of understanding your data, benefiting from your   labeling xercise earlier, through data hygiene.  And now you've got a new metric to understand how   your employees behave with your large language  model. So there's just a lot to apply there. Avery Pfeiffer: Yeah. I mean, I heard you  mention before, actually. It was definitely   a loaded question. Mhmm. And I think you're  dead on with that. Right? Especially the model  

agnostic approach. There's one thing we know  about technology. There's one constant. It's   that it changes. Right? Like, it's going to  change. And as we've seen with AI and with ML,   The curve is like this right now. Right?  We're, like, almost vertical. And so don't   design brittle DLP solutions. Design  something that's going to be agnostic. 

It's going to be kind of designed around  the response and the input rather than the   model itself. Mhmm. And put those safeguards in  place, as you mentioned, we're working on that   with a Virtru gateway. I think that's absolutely  something that people will find value out of. And,   you know, frankly, it's only a matter of time if  you don't implement these sorts of things till you   have a breach. Until you have a data leakage. It's  going to happen. Defend against it. Right? Yep.

Gus Walker: And then one of the other things  I will add is we're currently talking about   large language models. You and I are aware of  the multimodal models that are coming out. So   these are models that will combine speech and  object recognition and other kinds of abstract   understanding. So when those guys get in place,  that's another tier. But again, the solution that   we've been kind of talking about, a gateway can  help mediate that. So again, these models are more   sophisticated. Or in particular, I read about one  this morning, these models can now, I think it was   Chat GPT, or one of the other ones, reach out  to other systems and invoke commands on them. Avery Pfeiffer: No. I'm actually calling  APIs. I'm very excited. I could talk twenty,  

thirty minutes just about that.  But, yes. Yeah. All of a sudden,   now they have more capability. Right? You can  integrate them deeper, but you need to be careful. Gus Walker: So with that said, Avery, thank  you for talking with me. I learned a lot.  Hopefully, we educated some people and at least  gave me something to think about. But this,  

I guess, concludes our, or my first, hash  it out at Virtru. I welcome any questions,   and look forward to having more of these. Avery Pfeiffer: Definitely. Definitely. Excited  to be here. Thanks for having me. Allow me to   share my thoughts and, you know, for all you  listening out there. If you wanna part two,   just ask for it. We'll make it.  You know? It's easy for that. Gus Walker: Thank you, guys.

Avery Pfeiffer: Thank you, Gus. I appreciate it.

2023-10-08

Show video