CS50x 2024 - Cybersecurity
[MUSIC PLAYING] DAVID MALAN: All right, one last time. This is CS50, and we realize this has been a bit of a fire hose over the past-- thank you. [APPLAUSE] Thank you. We realize this has been a bit of a fire hose.
Indeed, recall that we began the class in week 0, months ago with this here MIT hack, wherein a fire hose was connected to a fire hydrant, in turn connected to a water fountain. And it really spoke to just how much information we predicted would be sort of flowing at you over the past few months. If you are feeling all these weeks later that it never actually got easy, and with pset 1 to pset 2, pset 3 on to pset 9, you never quite felt like you got your footing, realize that it's kind of by design because every time you did get your-- every time you did get your footing, our goal was to ratchet things up a little bit more so that you feel like you're still getting something out of that final week. And indeed, that final week is now behind us.
All that remains ahead of us is the final project. And what we thought we'd do today is recap a little bit of where we began and where you hopefully now are. Take a look at the world of cybersecurity, because it's a scary place out there, but hopefully you're all the more equipped now with a mental model and vocabulary to evaluate threats in the real world, and as educated people, make decisions, be it in industry, be it in government, be it in your own personal or professional lives. And we hope ultimately, too, that you've walked away with a very practical skill, including how to program in C, how to program in Python, how to program in SQL, how to program in JavaScript in the context, for instance, of even more HTML, CSS, and the like.
But most importantly, we hope that you've really walked away with an understanding of how to program. Like, you're not going to have CS50 by your side or even the duck by your side forever. You're going to have really, that foundation that hopefully you'll walk out of here today having accumulated over the past few months. And even though the world's languages are going to change, new technologies are going to exist tomorrow, hopefully, you'll find that a lot of the foundations over the past several months really do stay with you and allow you to bootstrap to a new understanding, even if you never take another CS course again. Ultimately, we claim that this was all about solving problems.
And hopefully, we've kind of cleaned up your thinking a little bit, given you more tools in your toolkit to think and evaluate and solve problems more methodically, not only in code, but just algorithmically as well. And keep this mind too. If you're still feeling like, oh, I never really quite got your footing-- my footing, think back to how hard Mario might have felt some three months ago.
But what ultimately matters in this course is indeed, not so much where you end up relative to your classmates, but where you end up relative to yourself when you began. So here we are, and consider that there delta. And if you don't believe me, like, literally go back this weekend or sometime soon, try implementing Mario in C. And I do dare say it's going to come a little more readily to you. Even if you need to Google something, ask the duck something, ask ChatGPT something just to remember some stupid syntactic detail, the ideas hopefully are with you now for some time.
So that there hack is actually fully documented here in MIT. Our friends down the road have a tradition of doing such things every year. One year, one of my favorites was they turned the dome of MIT into a recreation of R2-D2. So there's a rich history of going to great lengths to prank each other, or even us here Harvard folks akin to the Harvard Yale video we took a look at last time. And this duck has really become a defining characteristic of late of CS50, so much so that last year, the CS50 Hackathon, we invited the duck along. It posed, as it is here, for photographs with your classmates past.
And then around like, 4:00 AM, it disappeared, and the duck went missing. And we were about to head off to IHOP, our friends from Yale. Your former classmates had just kind of packed up and started driving back to New haven.
And I'm ashamed to say our first thought was that Yale took it. And we texted our TA friends on the shuttle buses, 4:30 AM asking, hey, did you take our duck because we kind of need it next week for the CS50 fair? And I'm ashamed to say that we thought so, but it was not in fact, them. It was this guy instead, down the road. Because a few hours later after I think, no sleep on much of our part, we got the equivalent of a ransom email. "Hi, David, it's your friend, bbd.
I hope you're well and not too worried after I left so abruptly yesterday night after such a successful Hackathon and semester so far. I just needed to unwind a bit and take a trip to new places and fresh air. Don't worry though, I will return safe, sound, healthy, home once I am more relaxed. As of right now, I'm just spending some few days with our tech friends up Massachusetts Avenue. They gave me a hand on moving tonight. For some reason, I could never find my feet, and they've been amazing hosts.
I will see you soon and I will miss you and Harvard specially our students. Sincerely yours, CS50 bbd." So almost a perfect hack.
They didn't quite get the DDB detail quite right. But after this, they proceeded to make a scavenger hunt of sorts of clues here. This here is Hundredville. And so in Hundredville, they handed out flyers to students at MIT, inviting folks to write a Python program to solve a mystery.
"The CS50 duck has been stolen. The town of Hundredville has been called on you to solve the mystery of the-- authorities believe that the thief stole the duck and then shortly thereafter took a walk out of town. Your goal is to identify who the thief is, what school the thief escaped to, and who the thief's accomplice is who helped them escape. This took place on December 2, 2022, and took place at the CS50 Hackathon."
In the days to come, we proceeded to receive a series of ransom postcards as the duck traveled, not only to MIT to Professor John Guttag 6.100B class, which is a rough equivalent of CS50 down the road. Pictured there our CS50 duck with some tape on its torso. But then the duck took, apparently, a ride, either in actuality or with Photoshop, not only there, took a tour of the Charles River in front of Harvard, the Charles in front of Boston. It went all the way over to Yale. We then received this postcard from Princeton all the way over from Stanford.
Duck took a flight according to this photo here, and then saw a bit of the world as well. So eventually, we received a follow-up email saying, "Hi, David. I intend to arrive for the fair between 8:37 AM and 9:47 AM. It would be easier for my MIT hacker friends to bring me to the right location if there's someone waiting there with a sign that says 'Duck'." I'm not sure if we actually stood there with a sign holding duck, but it turns out they came actually earlier in the morning to escape detection altogether. The duck found its home and everyone lived happily ever after.
And here the duck is again today. But our props to our friends down the road at MIT for returning the duck safely and for going to such crazy lengths to put us in the annals of MIT's Hacks Gallery. In fact, in exchange for this, we sent them a little package.
And without telling you what it is, you can read more about this here hack that's now been immortalized on hacks.mit.edu at this URL here. So maybe round of applause for our friends down the road for having pulled that off a year ago. [APPLAUSE] So before we dive into some of today's material, I wanted to give you a sense of what lies ahead as well. So this year's CS50 Hackathon is an annual tradition, whereby students here at Harvard and our friends from Yale who will take buses in the other direction to join us in about a week's time for an epic all-nighter, starting roughly at 7:00 PM ending roughly at 7:00 AM will be punctuated by multiple meals, first meal-- first dinner around 9:00 PM, second dinner around 1:00 AM.
And those of you who still have the energy and are still awake around 5:00 AM, we'll hop in a shuttle bus and head down to IHOP, the larger one down the road, not the one in the square, and have a little bit of breakfast together. The evening typically begins a little bit like this with a lot of energy, the focus of which is entirely on final projects. The staff will be present, but the intent is not to be 12 hours of office hours. Indeed, the staff will be working on their own projects or psets, final projects, and the like, but to guide you toward and point you in the direction of solutions to new problems you have.
And we do think that the duck, and in turn, AI, CS50.ai and other tools you'll now be able to use, including the actual ChatGPT, the actual GitHub Copilot, or other AI tools which are now reasonable to use at this point in the semester as you off board from CS50 and enter the real world. Should be an opportunity for you to take your newfound knowledge of software out for a spin and build something of your very own, something that even maybe the TFs and myself have never dabbled in before, but with all of this now software support by your side.
This here is our very own CS50 shuttles that will take us then to IHOP. And then a week after that is the epic CS50 fair, which will be an opportunity to showcase what it is you'll pull off over the next few weeks to students, faculty, and staff across campus. More details to come, but you'll bring over your laptop or phone to a large space on campus. We'll invite all of your friends, even family if they're around.
And the goal will be simply to have chats like this and present your final project to passersby. There'll be a bit of an incentive model, whereby anyone who chats you up about their project, you can give a little sticker to. And that will enter them into a raffle for fabulous prizes to grease the wheels of conversations as well.
And you'll see faculty from across campus join us as well. But ultimately, you walk out of that event with this here CS50 shirt, one like it, so you too, can proudly proclaim that you indeed took CS50. So all that and more to come, resting on finally, those final projects. But how to get there. So here are some general advice that's not necessarily going to be applicable to all final projects. But as we exit CS50 and enter the real world, here are some tips on what you might read, what you might download, sort of starting points so that in answer to the FAQ, what now? So for instance, if you would like to begin to experience on your own Mac or PC more of the programming environment that we provided to you, sort of turnkey style in the cloud using cs50.dev,
you can actually install command line tools on your own laptop, desktop, or the like. For instance, Apple has their own. Windows has their own. So you can open a terminal window on your own computer and execute much of the same commands that you've been doing in Linux this whole term. Learning Git, so Git is version control software.
And it's very, very popular in industry. And it's a mechanism for saving multiple versions of your files. Now, this is something you might be familiar with if still, even using file names in the real world, like on your Mac or PC-- maybe this is resume version 1, resume version 2, resume Monday night version, resume Tuesday, or whatever the case may be. If you're using Google documents, this happens automatically nowadays.
But with code, it can happen automatically, but also more methodically using this here tool. And Git is a very popular tool for collaborating with others as well. And you've actually been secretly using it underneath the hood for a lot of CS50's tools. But we've abstracted away some of the details.
But Brian, via this video and any number of other references, can peel back that abstraction and show you how to use it more manually. You don't need to use cs50.dev anymore but you are welcome to. You can instead install VS Code onto your own Mac or PC. If you go to this first URL here, it's a free download. It's actually open source.
So you can even poke around and see how it, itself is built. And at CS50's own documentation, we have some tips for making it look like CS50's environment even if longer term, you want to cut the cord entirely. What can you now do? Well, many of you for your final projects will typically tackle websites, sort of building on the ideas of problem set 9, CS50 finance and the like, or just generally something dynamic.
But if you instead want to host a portfolio, like just your resume, just projects you've worked on and the like, a static websites can be hosted for free via various services. A popular one is this URL here, called GitHub pages. There's another service that offers a free tier called Netlify that can allow you to host your own projects statically for free. But when it comes to more dynamic hosting, you have many more options.
And these are just some of the most popular. The first three are some of the biggest cloud providers nowadays, whether it's Amazon or Microsoft Azure or Google services. If you go to this fourth URL here, this is GitHub's education pack, they essentially broker with lots of different companies to give students, specifically, discounts on or free access to a lot of tools. So you might want to sign up for that while you're eligible. And then lastly, here are two other popular third-party, but not free services, but that are very commonly used when you want to host actual web applications.
So maybe it's Flask, maybe it's something else, but something that involves some input and output. Questions meanwhile-- so there's just lots of communities. If you want to keep an eye on what's happening in tech, these are just some of the popular options. And undoubtedly, if you have some techie friends, they'll have suggestions as well.
But you might find some of these destinations of interest. Of course increasingly, will you just ask questions of software itself, AI, whether it's ChatGPT, GitHub Copilot, or the like. And then classes, we're clearly a little biased here with what's on the screen. So these aren't college classes per se, but freely available OpenCourseWare courses that CS50's team has put together over time. And in a nutshell as you can infer from the suffix of each of these URLs, if you want to learn more about Python, CS50 has got a free, open online class for that, or SQL, thanks to Carter, web and AI stuff, thanks to Brian, a games class, thanks to Colton, cybersecurity, which will extend where we leave off today. And then if you're more interested, not so much in coding and going more deeply into software, but want to take a step higher level and focus more on intersections of computer science with business or law or technology, those two are freely available, if you're looking for something to do over January the summer or just to dabble over time.
And there's innumerable other free resources from other folks on the internet as well certainly too. All right, so a few invitations and thank yous. So one, after today, after we dive into and out of cybersecurity, please do stay in touch via any of CS50's online communities.
As we start to recruit next year's team for teaching fellows, teaching assistants, course assistants, we'll be in touch via email for those opportunities as well. And now some thanks for the group before we then dive into here today's topic. So one, allow me to thank our hosts here for giving us access to such a wonderful, privileged space to just hold classes in, the whole team for Memorial Hall.
Our thanks too, to ESS, which is the team that makes everything sound so good in spaces like this with music, mics, and the like, our friends, of course, Wesley down the road at Changsho, where we went most every other Friday this semester. If you've never actually been, or if you're hearing this online, please join our friends at Changsho show on Mass Ave down the road any time you might like. And then especially, CS50's team-- there's quite a few humans operating cameras in the room, both here and way in back, as well as online. My thanks. [APPLAUSE] Thank you to them for making this look and sound so good.
And what you don't see is when I do actually screw up, even if we don't fix it in real time, they very kindly help us go back in time, fix things, so that your successors have hopefully, an even improved version as well. And then as well, CS50's own Sophie Anderson, who is the daughter of one of CS50's teaching fellows who lives all the way over in New Zealand, who has wonderfully brought the CS50 duck to life in this animated form. thanks to Sophie, this duck is now everywhere, including most recently, on some T-shirts too.
But of course, we have this massive support structure in the form of the team. This is some of our past team members, but who wonderfully via Zoom you'll recall in week seven, showed us how TCP/IP works by passing those envelopes up, down, left, and right. I commented at the time, disclaim, that it actually took us quite a bit of effort to do that. And so I thought I would share as a representative thanks of our whole teaching team, whether it's Carter and Julia and Ozan and Cody and all of C50's team members in Cambridge in New Hey, thought I'd give you a look behind the scenes at how things go indeed, behind the scenes that you don't necessarily see.
So let me switch over here and hit play. [VIDEO PLAYBACK] [INAUDIBLE] [INAUDIBLE] Buffering. OK.
Josh? Nice. Helen? Oh. [CHUCKLING] [INAUDIBLE] Moni-- no, oh, wait.
That was amazing, Josh. Sophie. Amazing. That was perfect. Moni.
[LAUGHTER] I think I-- [INTERPOSING VOICES] - Over to you, [INAUDIBLE]. Guy. That was amazing. Thank you all.
- So good. [END PLAYBACK] DAVID MALAN: All right, these outtakes aside, my thanks to the whole teaching team for making this whole class possible. [APPLAUSE] So cybersecurity, this refers to the process of keeping secure our systems, our data, our accounts, and. More and it's something that's going to be increasingly important, as it already is, just because of the sheer omnipresence of technology on our desks, on our laps, in our pockets, and beyond. So exactly what is it? And how can we, as students of computer science over the past many weeks, think about things a little more methodically, a little more carefully, and maybe even put some numbers to the intuition that I think a lot of you probably have when it comes to deciding, is something secure or is it not? So first of all, what does it mean for something to be secure? How might you as citizens of the world now answer that question? What does it mean to be secure? AUDIENCE: Resistant to attack. DAVID MALAN: OK, so resistant to attack, I like that formulation.
Other thoughts on what it means to be secure? What does it mean? Yeah. AUDIENCE: You control who has access to it. DAVID MALAN: Yeah, so you control who has access to something. And there's these techniques known as authentication, like logging in, authorization, deciding whether or not that person, once authenticated, should have access to things.
And, of course, you and I are very commonly in the habit of using fairly primitive mechanisms still. Although, we'll touch today on some technologies that we'll see all the more of in the weeks and months and years to come. But you and I are pretty much in the habit of relying on passwords for most everything still today. And so we thought we'd begin with exactly this topic to consider just how secure or insecure is this mechanism and why and see if we can't evaluate it a little more methodically so that we can make more than intuitive arguments, but quantitative compelling arguments as well.
So unfortunately we humans are not so good at choosing passwords. And every year, accounts are hacked into. Maybe yours, maybe your friends, maybe your family members have experienced this already. And this unfortunately happens to so many people online.
But, fortunately, there are security researchers in the world that take a look at attacks once they have happened, particularly when data from attacks, databases, are posted online or on the so-called dark web or the like and downloaded by others for malicious purposes, they can also conversely provide us with some insights as to the behavior of us humans that might give us some insights as to when and why things are getting attacked successfully. So as of last year, here, for instance, according to one measure are the top 10 most popular, a.k.a. worst passwords-- at least according to the data that security researchers have been able to glean-- by attacks that have already happened. So the number one password as of last year, according to systems compromised, was 123456. The second most, admin. The third most, 12345678.
And thereafter, 123456789, 1234, 12345, password, 123, Aa123456, and then 1234567890. So you can actually infer-- sort of goofy as some of these are-- you can actually infer certain policies from these, right? The fact that we're taking such little effort to choose our password seems to correlate really with probably, what's the minimum length of a password required for systems? And you can see that at worst, some systems require only three digit passwords. And maybe they might require six or eight or nine or even 10. But you can kind of infer corporate or policies from these passwords alone. If you keep going through the list, there's some funnier ones even down the list that are nonetheless enlightening. So, for instance, lower on the list is Iloveyou, no spaces.
Sort of adorable, maybe it's meaningful to you. But if you can think of it, so can an adversary, so can some hacker, so much so that it's this popular on these lists. Qwertyuiop, it's not quite English, but its derivative of English keyboards. Anyone? Yeah, so this is, if you look at a US English keyboard, it's just the top row of keys if you just hit them all together left or right to choose your, therefore, password.
And then this one, "password," which has an at sign for the A and a zero for the O, which I guess I'm guessing some of you do similar tricks. But this is the thing too, if you think like you're being clever, well, there's a lot of other adversaries, there's a lot of adversaries out there who are just as good at being clever. So even heuristics like this that in the past, to be fair, you might have been taught to do because it confuses adversaries' or hackers' attempts, unfortunately, if you know to do it, so does the adversary. And so your accounts aren't necessarily any more secure as a result. So what are some of our takeaways from this? Well, one, if you have these lists of passwords, all too possible are, for instance, dictionary attacks. Like we literally have published on the internet-- and there's a citation in the slides if you're curious-- of these most popular passwords in the world.
So what's a smart adversary going to do when trying to get into your account? They're not necessarily going to try all possible passwords or try your birthday or things like that. They're just going to start with this top 10 list, this top 100 list. And odds are, statistically, in a room this big, they're probably going to get into at least one person's account.
But let's consider maybe a little more academically what we can do about this. And let's start with something simple like the simplest, the most omnipresent device we might all have now is some kind of mobile device like a phone. Generally speaking, Apple and Google and others are requiring of us that we at least have a passcode or at least you're prompted to set it up even if you therefore opt out of it.
But most of us probably have a passcode, be it numeric or alphabetic or something else. So what might we take away from that? Well, suppose that you do the bare minimum. And the default for years has generally been having at least four digits in your passcode. Well, what does that mean? Well, how secure is that? How quickly might it be hacked? And, in fact, Carter, would you mind joining me up here? Perhaps we can actually decide together how best to proceed here.
If you want to flip over to your other screen there, we're going to ask everyone to go to-- I'll pull it up here-- this URL here if you haven't already. And this is going to pull up a polling website that's going to allow you in a moment to answer some multiple choice questions. This is the same URL as earlier if you already logged in. And in just a moment, we're going to ask you a question. And I think, can we show the question before we do this? Here's the first question from Carter here.
How long might it take to crack-- that is, figure out-- a four-digit passcode on someone's phone, for instance? How long might it take to crack a four-digit passcode? Why don't we go ahead and flip over to see who is typing in what. And we'll see what the scores are already. All right, and it looks like most of you think a few seconds. Some of you think a few minutes, a few hours, a few days. So I'd say most of you are about to be very unpleasantly surprised. In fact, the winner here is indeed going to be a few seconds, but perhaps even faster than that.
So, in fact, let me go ahead and do this. Thank you to Carter. Let me flip over and let me introduce you to, unfortunately, what's a very real world problem known as a brute force attack. As the word kind of conjures, if you think to-- back to yesteryear when there was some kind of battering ram trying to brute force their way into a castle door, it just meant trying to hammer the heck out of a system. A castle, in that case, to get into the destination. Digitally though, this might mean being a little more clever.
We all know how to write code in a bunch of different languages now. You could maybe open up a text editor, write a Python program to try all possible four-digit codes from 0000 to 9999 in order to figure out exactly, how long does it actually take? So let's first consider this. Let me ask the next question. How many four-digit passcodes are there? Carter, if you wouldn't mind joining me and maybe just staying up with me here to run our second question at this same URL. How many four-digit passcodes are there in the world? On your phone or laptop, you should now see the second question. And the answers include 4, 40, 9,999, 10,000, or it's OK to be unsure.
Let's go ahead and flip over to the results. And it looks like most of you think 10,000. And, indeed, that is the case.
Because if I kind of led you with 0000 to 9999, that's 10,000 possibilities. So that is, in fact, a lot. But most of you thought it'd take maybe a few seconds to actually brute force your way into that.
Let's consider how we might measure how long that actually takes. So thank you. So in the world of a four-digit passcode-- and they are, indeed, digits, decimal digits from 0 to 9-- another way to think about it is there's 10 possibilities for the first digit, 10 for the next, 10 to the 10.
So that really gives us 10 times itself four times or 10,000 in total. But how long does that actually take? Well, let me go ahead and do this. I'm going to go ahead and open up on my Mac here, not even-- not even Codespaces or cs50.dev today. I'm going to open up VS Code itself. So before class, I went ahead and installed VS Code on my own Mac here. It looks almost the same as Codespaces, though the windows might look a little different and the menus as well.
And I've gone ahead here and begun a file called crack.py. To crack something means to break into it, to figure out in this case what the passcode actually is. Well, how might I write some code to try all 10,000 possible passcodes? And, heck, even though this isn't quite going to be like hacking into my actual phone, I bet I could find a USB or a lightning cable, connect the two devices, and maybe send all of these passcodes to my device trying to brute force my way in. And that's indeed how a hacker might go about doing this if the manufacturer doesn't protect against that.
So here's some code. Let me go ahead and do this. From string, import digits. This isn't strictly necessary. But in Python, there is a string library from which you can get all of the decimal digits just so I don't have to manually type out 0 through 9. But that's just a minor optimization.
But there's another library called itertools, tools related to iteration, doing things in like a looping fashion, where I can import a cross product function, a function that's going to allow me to combine like all numbers with all numbers again and again and again for the length of the passcode. Now I can do a simple Python for loop like this. For each passcode in the cross product of those 10 digits repeated four times.
In other words, this is just a programmatic Pythonic way to implement the idea of combining all 10 digits with itself four times in a loop in this fashion. And just so we can visualize this, let's just go ahead and print out the passcode. But if I did have a lightning cable or a USB cable, I wouldn't print it. I would maybe send it through the cable to the device to try to get through the passcode screen. So we can revisit now the question of how long might it take to get into this device.
Well, let's just try this. Python of crack.py. And assume, again, it's connected via cable.
So we'll see how long this program takes to run and break into this here phone. Done. So that's all it took for 10,000 iterations. And this is on a Mac that's not even the fastest one out there. You could imagine doing this even faster. So that's actually not necessarily all the best for our security.
So what could we do instead of 10 digits? Well, most of you have probably upgraded a lot of your passwords to maybe being alphabetical instead. So what if I instead were to ask the question-- and Carter, if you want to rejoin me here in a second-- what if I instead were to consider maybe four-letter passcodes? So now we have A through Z four times. And maybe we'll throw into the mix uppercase and-- well, let's just keep it four letters. Let's just go ahead and do maybe uppercase and lowercase, so 52 possibilities. This is going to give us 52 times 52 times 52 times 52. And anyone want to ballpark the math here, how many possible four-letter passcodes are there, roughly? 7 million, yeah, so roughly 7 million, which is way bigger than 10,000.
So, oh, I spoiled this, didn't I? Can you flip over? So how many four-letter passcodes are there? It seems that most of you, 93% of you, in fact, got the answer right. Those of you who are changing your answer-- there we go, no, definitely not that. So, anyhow, I screwed up. Order of operations matters in computing and, indeed, including lectures.
So 7 million, so the segue I wanted to make is, OK, how long does that actually take to implement in code? Well, let me just tweak our code here a little bit. Let me go ahead and go back into the VS Code on my Mac in which I had the same code as before. So let me shrink my terminal window, go back to the code from which I began.
And let's just actually make a simple change. Let me go ahead and simply change digits to something called ASCII letters. And this too is just a time saving technique. So I don't have to type out A through Z and uppercase and lowercase like 52 total times. And so I'm going to change digits to ASCII letters.
And we'll get a quantitative sense of how long this takes. So Python of crack.py, here's how long it takes to go through 7 million possibilities. All right, clearly slower because we haven't seen the end of the list yet. And you can see we're going through all of the lowercase letters here. We're about to hit Z. But now we're going through the uppercase letters.
So it looks like the answer this time is going to be a few seconds, indeed. But definitely less than a minute would seem, at least on this particular computer. So odds are if I'm the adversary and I've plugged this phone into someone's device-- maybe I'm not here in a lecture, but in Starbucks or an airport or anywhere where I have physical opportunity to grab that device and plug a cable in-- it's not going to take long to hack into that device either. So what might be better than just digits and letters from the real world? So add in some punctuation, which like almost every website requires that we do. Well, if we want to add punctuation into the mix, if I can get this segue correct so that we can now ask Carter one last time, how many four-character passcodes are possible where a character is an uppercase or lowercase letter or a decimal digit or a punctuation symbol? If you go to your device now, you'll see-- if we want to flip over to the screen-- these possibilities. There's a million, maybe, a billion, a trillion, a quadrillion, or a quintillion when it comes to a-- oh, wrong question.
Wow, we're new here, OK. OK, we're going to escalate things here. How many eight-character passcodes are possible? We're going to make things more secure, even though I said four. We're now making it more secure to eight.
All right, you want to flip over to the chart? All right, so it looks like most of you are now erring on the side of quintillion or quadrillion. 1% of you still said million, even though there's definitely more than there were a moment ago. But that's OK. So quadrillion-- quintillion is still winning. And I think if we go and reveal this, with the math, you should be doing is 94 to the 4th power.
Because there's 26 plus 26 plus 10 plus some more digits, some punctuation digits in there as well. So it's actually, oh, this is the other example, isn't it? This is embarrassing. All right, we had a good run in the past nine weeks instead. All right, so if you were curious as to how many four-character passwords are possible, it's 78 million.
But that's not the question at hand. The question at hand was, how many eight character passcodes are there? And in this case, the math you would be doing is 94 to the 8th power, which is a really big number. And, in fact, it's this number here, which is roughly 6 quadrillion possibilities.
Now, I could go about actually doing this in code here. So let me actually, for a final flourish, let me open up VS Code one last time here. And in VS Code, I'm going to go ahead and shrink my terminal window, go back into the code, and I'm going to import not just ASCII letters, not just digits, but punctuation as well, which is going to give me like 32 punctuation symbols from a typical US English keyboard. And I'm going to go ahead and just concatenate them all together in one big list by using the plus operator in Python to plus in both digits and punctuation. And I'm going to change the 4 to an 8.
So this now, it's what four actual lines of code is, all it takes for an adversary to whip up some code, find a cable as step two, and hack into a phone that even has eight-character passcodes. Let me enlarge in my terminal window here, run for a final time Python of crack.py. And this I'll actually leave running for some time. Because you can get already sort of a palpable feel of how much slower it is-- because these characters clearly haven't moved-- how long it's going to take. We might actually do-- need to do a bit more math.
Because doing just four-digit passcodes was super fast. Doing four-letter passcodes was slower, but still under a minute. We'll see maybe in time how long this actually runs for. But this clearly seems to be better, at least for some definition of better.
But it should hopefully not be that easy to hack into a system. What does your own device probably do to defend against that brute force attack? Yeah. AUDIENCE: Gives you a limited number of tries. DAVID MALAN: Yeah, so it gives you a limited number of tries. So odds are, at least once in your life, you've somehow locked yourself out of a device, typically after typing your passcode more than 10 times or 10 attempts or maybe it's your siblings or your roommate's phone that you realize this is a feature of iPhones and Android devices as well.
But here's a screenshot of what an iPhone might do if you do try to input the wrong passcode maybe 10 or so times. Notice that it's really telling you to try again in one minute. So this isn't fundamentally changing what the adversary can do. The adversary can absolutely use those same four lines of code with a cable and try to hack into your device.
But what has this just done? It's significantly increased the cost to the adversary, where the cost might be measured in sheer number amount of time-- like minutes, seconds, hours, days, or beyond. Maybe it's increased the cost in the sense of risk. Why? Because if this were like a movie incarnation of this and the adversary has just plugged into the phone and is kind of creepily looking around until you come back, it's going to take way too long for them to safely get away with that, assuming your passcode is not 123456, it's somewhere in the middle of that massive search space. So this just kind of fundamentally raises the bar to the adversary. And that's one of the biggest takeaways of cybersecurity in general.
It's completely naive to think in terms of absolute security or to even say a sentence like "my website is secure" or even "my home is physically secure." Why? Well, for a couple of reasons, like, one, an adversary with enough time, energy, motivation, or resources can surely get into most any system and can surely get into most any home. But the other thing to consider, unfortunately, that if we're the good people in this story and the adversaries are the bad people, you and I rather have to be perfect. In the physical world, we have to lock every door, every window. Because if we mess up just one spot, the adversary can get in. And so where there's sort of this imbalance.
The adversary just has to find the window that's ajar to get into your physical home. The adversary just needs to find one user who's got a really bad password to somehow get into that system. And so cybersecurity is hard. And so what we'll see today really are techniques that can let you create a gauntlet of defenses-- so not just one, but maybe two, maybe three. And even if the adversary gets in, another tenant of cybersecurity is at least, let's have mechanisms in place that detect the adversary, some kind of monitoring, automatic emails. You can increasingly see this already in the real world.
If you log into your Instagram account from a different city or state suddenly because maybe you're traveling, you will-- if you've opted into settings like these-- often get a notification or an email saying, hey, you seems to have logged in from Palo Alto rather than Cambridge. Is this, in fact, you? So even though we might not be able to keep the adversary out, let's at least minimize the window of opportunity or damage by letting humans like us know that something's been compromised. Of course, there is a downside here. And this is another theme of cybersecurity.
Every time you improve something, you've got to pay a price. There's going to be a tradeoff. And we've seen this with time and space and money and other such resources when it comes to designing systems already.
What's the downside of this mechanism? Why is this perhaps a bad thing or what's the downside to you, the good person in the story? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, if you've just forgotten your passcode, it's going to be more difficult for you to log in. Or maybe you just really need to get into your phone now and you don't really want to wait a minute. And if you, worse, if you keep trying, sometimes it'll change to two minutes, five minutes, one hour.
It'll increase exponentially. Why? Because Apple and Google figure that, they don't necessarily know what the right cutoff is. Maybe it's 10, maybe it's fewer, maybe it's more. But at some point, it is much more likely that this is a hacker trying to get in than it is for getting your passcode. But in the corporate world, it can be even worse.
There's a feature that lets phones essentially self-destruct whereby rather than just waiting you wait a minute, it will wipe the device, more dramatically. The presumption being that, no, no, no, no, no, if this is a corporate phone, let's lock it down further so that it is an adversary, the data is gone after 10 failed attempts. But there's other mechanisms as well. In addition to logging into phones via passcodes, there's also websites like Gmail, for instance. And it's very common, therefore, to log in to websites like these. And odds are, statistically, a lot of you are in the habit of reusing passwords.
Like, no, don't nod if you are. We have cameras everywhere. But maybe you're in the habit of reusing it.
Why? Because it's hard to remember really big long cryptic passwords. So mathematically, there's surely an advantage there. Why? Because it just makes it so much harder, more time-consuming, more risky for an adversary to get in. But the other tradeoff is like, my God, I just can't even remember most of my passwords as a result unless I reuse the one good password I thought of and memorized already or maybe I write it down on a post-it note on my monitor, as all too often happens in corporate workplaces.
Or maybe you're being clever and in your top right drawer, you've got a printout of all of your accounts. Well, if you do, like ha-ha, so do a lot of other people. Or maybe it's a little more secure than that, but there are sociological side effects of these technological policies that really until recent years were maybe underappreciated. The academics, the IT administrators were mandating policies that you and I as human users were not necessarily behaving properly in the face of. So nowadays, there are things called password managers. And a password manager is just a piece of software on Macs, on PCs, on phones that manage your passwords for you.
What this means specifically is when you go to a website for the very first time, you, the human, don't need to choose your password anymore. You instead click a button or use some keyboard shortcut. And the software generates a really long cryptic password for you that's not even eight characters.
It might be 16 or 32 characters, can be even bigger than that, but with lots of randomness. Definitely not going to be on that top 10 or that top 100 list. The software thereafter remembers that password for you and even your username, whether it's your email address or something else. And it saves it onto your Mac or your phone or your PC's disk or hard drive. The next time you visit that same website, what you can do is via menu or, better yet, a keyboard shortcut, log into the website without even remembering or even knowing your password.
I mean, to this day, I'll tell you, I don't even know anymore 99% of my own passwords. Rather, I rely on software like this to do the heavy lifting for me. But there's an obvious downside here, which might be what if you're doing this? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Right, so what if they find out the one password that's protecting this software? Because unstated by me up until now is that this password manager itself has a primary password that protects all of those other eggs in the one basket, so to speak. And my one primary password for my own password manager, it is really long and hard to guess.
And the odds that anyone's going to guess are just so low that I'm comfortable with that being the one really difficult thing that I've committed to my memory. But the problem is if someone does figure it out nonetheless somehow or, worse, I forget what it is. Now, I've not lost access to one account, but all of my accounts. Now, that might be too high of a price to pay. But, again, if you're in the habit of choosing easy passwords like being on that top 10 list, reusing passwords, it's probably a net positive to incur this single risk versus the many risks you're incurring across the board with all of these other sites. As for what you can use, increasingly our operating systems come with support for this, be it in the Apple world, Google, Microsoft world, or the like.
There's third party software you can pay for and download. But even then, I would beware. And I would ask friends whose opinion you trust or do some googling for reviews and the like. All too often in the software world have password managers been determined to be buggy themselves.
I mean, you've seen in weeks of CS50 how easy it is to introduce bugs. And even the best of programmers still introduce bugs to software. So you're also trusting that the companies making this password management software is really good at it. And that's not always the case. So beware there too.
But we'll also focus today on some of the fundamentals that these companies can be using to better protect your data as well. But there's another mechanism, which odds are you're in the habit of using. Two-factor authentication, like most of us probably have to use this for some of your accounts-- your Harvard account, your Yale account, maybe your bank accounts, or the like. So what is two-factor authentication in a nutshell? Yeah.
AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, you get a second factor that you have to provide to the website or application to prove that it's you like a text to your phone or maybe it's an actual application that gets push notifications or the like. Maybe in the corporate world, it's actually a tiny little device with a screen on it that's on your keychain or the like. Maybe it's actually a USB dongle that you have to plug into your work laptop. In short, it's some second factor. And by factor, I mean something technical.
It's not just a second password, which would be one factor. It's a second fundamentally different factor. So generally speaking in the world of two-factor authentication or 2FA or MFA is the generalization as multi-factor authentication, you have not just a password, which is something you know, the second factor is usually something you have-- whether it's your phone or that application or the keychain. It might also be biometrics like your fingerprints, your retinas, or something else physically about you. But it's something that significantly decreases the probability that some adversary is going to get into that account. Why? Because right now, if you've only got a username and password, your adversaries are literally every human in the world with an internet connection, arguably.
But as soon as you introduce 2FA, now it's only people on campus or, more narrowly, only the people in Starbucks at that moment who might physically have access to your person and your second factor, in this case. More technically, what those technologies do is they send you a one-time passcode, which is further secure because once it's used, there's hopefully some database that remembers that it has been used and cannot be used again. So an adversary can't like sniff the airwaves and replay that passcode the next time they, indeed, expire, which adds some additional defense.
And you might type it into a phone or maybe a web app that looks a little something like this. So passwords thus far, some defenses, therefore, any questions on this here mechanism? No? All right, well, let's consider this. Odds are, with some frequency, you forget these passwords, especially if you're not using a password manager. And so you go to Gmail and you actually have to click a link like this, Forgot Password. And then it typically emails you to initiate a process of resetting that password. But if you can recall, has anyone ever clicked a link like that and then got an email with your password in the email? Maybe if you ever see this in the wild, that is to say in the real world, that is horrible, horrible design.
Why? Because well-designed websites, not unlike CS50 Finance, which had a users table, should not be storing username-- rather, should not be storing passwords in the clear, as it actually is. It should somehow be obfuscated so that even if your database from CS50 Finance or Google's database is hacked and compromised and sold on the web, it should not be as simple as doing like select star from Account semicolon to see what your actual passwords are. And the mechanism that well-designed websites use is actually a primitive back from like week 5 when we talked about hashing and hash tables. This time, we're using it for slightly different purposes. So in the world of passwords, on the server side, there's often a database or maybe, more simply, a text file somewhere on the server that just associates usernames with passwords. So to keep things simple, if there's at least two users like Alice and Bob, Alice's password is maybe apple.
Bob's password is maybe banana, just to keep the mnemonics kind of simple. If though that were the case on the server and that server is compromised, whoever the hacker now has access to every username and every password, which in and of itself might not be a huge deal because maybe the server administrators can just disable all of the accounts, make everyone change their password, and move on. But there's also this attack known as password stuffing, which is a weirdly technical term, which means when you compromise one database, you know what? Take advantage of the naivety of a lot of us users. Try the compromised Apple password, the banana password not on the compromised website, but other websites that you and I might have access to, the presumption being that some of us in this room are using the same passwords in multiple places. So it's bad if your password is compromised on one server because, by transitivity, so can all of your other accounts be compromised.
So in the world of hashing, this was the picture we drew some time ago, we can apply this same logic whereby, mathematically, a hash function is like some function F and the input is X and the output or the range is F of X. That was sort of the fancy way of describing mathematically hashing as a process weeks ago. But here, at a simpler level, the input to this process is going to be your actual password. The output is going to be a hash value, which in week 5 was something simple generally like a number-- 1 or 2 or 3 based on the first letter. That's not going to be quite as naive an approach as we take in the password world.
It's going to look a little more cryptic. So Apple weeks ago might have just been 1, banana might have been 3. But now let me propose that in the world of real world system design, what the database people should actually store is not apple, but rather this cryptic value. And you can think of this as sort of random, but it's not random. Because it is the result of an algorithm, some mathematical function that someone implemented and smart people evaluated and said, yes, this seems to be secure, secure in the sense that this hash function is meant to be one way.
So this is not encryption, a la Caesar Cipher from weeks ago whereby you could just add 1 to encrypt and subtract 1 to decrypt. This is one way in the sense that given this value, it should be pretty much impossible mathematically to reverse the process and figure out that the user's password was originally apple. Meanwhile banana, back in week 5 for simplicity, for hashing into a table, we might have had a simple output of 2, since B is the second letter of the English alphabet. But now the hash value of banana, thanks to a fancier mathematical function, is actually going to be something more cryptic like this.
And so what the server really does is store not apple and banana, but rather those two seemingly cryptic values. And then when the human, be it Alice or Bob, logs in to a web form with their actual username and password, like Alice, apple, Bob, banana, the website no longer even knows that Alice's password is apple and that Bob's is banana. But that's OK. Because so long as the server uses the same code as it was using when these folks registered for accounts, Alice can type in apple, hit Enter, send it via HTTP to the server. The server can run that same hash function on A-P-P-L-E. And if the value matches, it can conclude with high probability, yes, this is in fact, the original Alice or this, in fact, is the original Bob.
So the server never saves the password, but it does use the same hash function to compare those same hash values again and again whenever these folks log in again and again. So, in reality, here's a simple one-way hash for both Alice's and Bob's passwords in the real world. It's even longer, this is to say, than what I used as shorter examples a moment ago.
But there is a corner case here. Suppose that an adversary is smart and has some free time and isn't necessarily interested in getting into someone's account right now, but wants to do a bit of prework to decrease the future cost of getting into someone's account. There is a technical term known as a rainbow table, which is essentially like a dictionary in the Python sense or the SQL sense, whereby in advance an adversary could just try hashing all of the fruits of the world or, really, all of the English words of the world or, rather, all possible four-digit, four-character, eight-character passcodes in advance and just store them in two columns-- the password, like 0000 or apple or banana, and then just store in advance the hash values.
So the adversary could effectively reverse engineer the hash by just looking at a hash, comparing it against its massive database of hashes, and figuring out what password originally correspond to that. Why then is this still relatively safe? Rainbow tables are concerning. But they don't defeat passwords altogether. Why might that be? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: OK, so the adversary might not know exactly what hash function the company is using. Generally speaking, you would not want to necessarily keep that private.
That would be considered security through obscurity. And all it takes is like one bad actor to tell the adversary what hash function is being used. And then that would put your security more at risk.
So generally in the security world, openness when it comes to the algorithms in process is generally considered best practice. And the reality is, there's a few popular hash functions out there that any company should be using. And so it's not really keeping a secret anyway.
But other thoughts? Why is this rainbow table not such a concern? AUDIENCE: It takes a lot longer for the [INAUDIBLE].. DAVID MALAN: It takes a lot longer for the adversary to access that information because this table could get long. And even more along those lines-- anyone want to push a little harder? This doesn't necessarily put all of our passwords at risk. It easily puts our four-digit passcodes at risk.
Why? Because this table, this dictionary would have, what, 10,000 rows? And we've seen that you can search that kind of like that or even regenerate all of the possible values. But once you get to eight-character passcodes, I said it was 4 quadrillion possibilities. That's a crazy big dictionary in Python or crazy big list of some sort in Python. That's just way more RAM or memory than a typical adversary is going to have. Now, maybe if it's a particularly resourced adversary like a government, a state more generally, maybe they do have supercomputers that can fit that much information.
But, fine, then use a 16-character passcode and make it an unpronounceable long search space that's way bigger than 4 quadrillion. So it's a threat, but only if you're on that horrible top 10 list or top 100 or short passcode list that we've discussed thus far. So here's though a related threat that's just worth knowing about.
What's problematic here? If we introduce two more users, Carol and Charlie, and just for the semantics of it, whose password happened to be cherry. What if they both happened to have the same password and this database is compromised? Some hacker gets in. And just to be clear, we wouldn't be storing apple, banana, cherry, cherry. We'd still be storing, according to this story, these hashes. But why is this still concerning? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. If you figure out just one of them, now you've got the other.
And this is, in some sense, just leaking information, right? I don't maybe at a glance what I could do with this information. But if Carol and Charlie have the same password, you know what? I bet they have the same password on other systems as well. You're leaking information that just does no good for anyone.
So how can we avoid that? Well, we probably don't want to force Carol or Charlie to change their password, especially when they're registering. You definitely don't want to say, sorry, someone's already using that password, you can't use it as well. Because that too would leak information. But there's this technique in computing known as salting whereby we can do this instead.
If cherry we in this scheme hashes to a value like this, you know what? Let's go ahead and sprinkle a little bit of salt into the process. And it's sort of a metaphorical salt whereby this hash function now takes two inputs, not just the password, but some other value known as a salt. And the salt can be generally something super short like two characters even, or something longer.
And the idea is that this salt, much like a recipe, should of perturb the output a little bit, make it taste a little bit d
2024-01-04 20:46