Scaling Security Threat Detection with Apache Spark and Databricks
Hello. And welcome, I hope. Everyone's having a good spark summit so far, I'm, Josh gildner with apples detection, engineering, team in. This session we'll be taking a look at threat detection using spark and data bricks with, a specific focus on some of the neat tricks we have developed to overcome the, challenges, of scale so let's start with who my team is and what we do at a. High level apples, detection, engineering, team consumes, the telemetry that's, emitted by systems across our corporate infrastructure, everything, from host level log data to network sockets, to application. Events our. Job is to sift through the data for, patterns, which could indicate malicious, activity, like malware or hackers on the network it. Can be pretty challenging because we're looking for people who are trying to hide their behavior, and blend in with normal usage in the data but, that's what makes it fun. To. Do this we use many different technologies, some, off-the-shelf and some custom developed. These. Are the ones that we'll be focusing in on a little bit more in this talk. Before. We dive into details, it's important, to level set on some of the terminology, we'll use we'll. Start by looking at what we call a detection, or the most basic unit of code for the building blocks of our monitoring program a. Detection. Is a piece of business logic that input some data applies. Any Norbit arbitrary, number of transformations or, pattern matching or statistics, and then it spits out a security, alert we. Encapsulate these, as Scala classes, or generally, there's a one class to one notebook to one data bricks job relationship. Each. Of these can do pretty vastly. Different things but they all generally conform, to the same paradigm with the same components. After. One. Of these jobs has found something potentially, malicious the. Alert is fed into an orchestration, system, we've. A team of security analysts, continuously. Triaging, the alerts to figure out if they're a real security issue or just something that looks suspicious, but ultimately, he poses no risk after. The animal is finished analyzing the. Orchestration, system, has hooks into internal, systems that can be you to respond, and contain a security, issue things, like locking accounts. Or kicking machines off the network in certain. Cases the orchestration, system, may, already have enough, information to, contain, an issue without a human being involved and, we'll take a look at that a bit more later. Now. That we have an understanding of the general flow of those detection, zhh let's take a look at some of the problems that came up as we, continue writing. More and more of these detection jobs. The. First was an unjustifiable. Degree of overhead, and deploying, new detections. Most. Of the code for basic i/o and common, transformations, was exactly, the same just, reimplemented, across, many new books and each. Of these jobs needed a corresponding, test with, people manually, preserving, sample files and, writing. Scala tous this. Meant that we were unsustainably. Coming, up with cool new ideas, faster than we could turn them out. Other. Problems started to surface only after, we had already deployed, a bunch of these jobs, changing. The behavior of all or a subset, of the detections required. A massive refactor. Across all these different notebooks, even if the feature you wanted was comparatively. Minor each. Job, needs some degree of ongoing, maintenance and, performance, tuning which, can get pretty tedious with, all these disparate configs, in logic. There. Are some common patterns people tend to use in these detections, things. Like enrichments. Or statistical. Baseline that compare the novelty, of current activity, against, historical data well, what we saw was that without any primitives, of their disposal, everyone. Had their own special way of solving the same problems, and, when. We needed to update or fix them we would have to figure out each implementation. To. Really take our detections to the next level we needed to formalize, how they're written and make, better use of our limited human capacity. So, we started a journey to find a solution to these we. Talked with a lot of industry partners and looked at some third-party products but we're really unsatisfied. Because we, couldn't find anything that suited our needs so. We. Spent a few months and arrived a major breakthrough an. Entirely. New sdk for security, detection, called detection, kit this. Framework dramatically. Reduces, the complexity, of ready new detection, and helps, us react faster, by improving the aware quality, and automating, the investigations. There. Are way too many features to dive into here, but I'll highlight a few of them. Let's. Start with inputs since they're the first step in any detection. The. Data sets loaded by these jobs need to change situationally.
In. Production, they'll want a hive table name and in, the context, of a functional test they'll want the path of a preserved sample, file and dbfs, this. Means that the options, of spark dot read have to be externalized, and passed, in from the outside of the class through conflict but. We took it a step further building. A switch between spark, dot read and spark dot read stream with, an externally, provided partition. Filter it, allows us to change from, a streaming to a pass detection, on demand, without, changing. A class itself. In. Most. Cases a detection, acts like a bucket, containing, different pieces of logic which all look for different, aspects of the same general, high-level, attack in, terms. Of the code structure, we like to say that there are multiple alerts, in a detection, where, each of work is described, as a spark data frame resulting. From some transform, on the input, the. Detection class, creates, and exposes, a key value map of all the alert names to their respective, data frames which, can be consumed, either by other detection, x' or the, post-processing methods. That are applied before an, event is sent for the analyst to review. Much. Like we external eyes control, of detection input, emitters. Do the same thing but for output sinks in, production. We send alerts to AWS, Kinesis, but in other system, other situations, may call for writing a disk or an, in-memory table, in, emitter. Reps all of these output, parameters, like a target Kinesis cue file paths or stream options. Much, like inputs emitters, are also provided, in config, so you can easily change, the output behavior of, the detection without modifying, the class. There. Are a lot of small details required. By these jobs that are pretty annoying to provide like the streaming, checkpoint, paths or the schedule or pools we. Found that most of them can be inferred by convention, using, parameters that are already in the config, object with. Optional, overrides only if people want to specify them manually, this. Minimizes, the amount of required fields, and simplifies. Our config structure a. Detection. Can contain multiple related. But independent, of words and most, of the time you'd, want them to be configured, in the same way things. Like using the same emitter or inferring, the parameters, by, the same convention. By. Default they'll inherit the configuration. Of the parent detection, but, remain, individually. Configurable, should you want granular, control, of them. We. Can think of a detection, as a sequence, of data frame transformations.
The. First transform, in the sequence, or what we like to call the preprocessor, is a, method, that's supplied, in config, to be applied before the core detection, logic this. Is an ideal place to inject certain types of operations, like partition, filters that will determine how much of the input data will be processed or. The pushdown predicate, on a stream to tell it where to start, after. All the data has been processed and, we've yielded, a set of suspicious, events there. May be some additional enrichments, to apply right before, they're emitted many. Of which are common amongst, multiple jobs to, make, this easier we've broken out some common transformation. Into, reusable modules, which. Are defined in a mutable Scala list and applied sequentially at the end of the detection. Finally. There, may be certain types of operations which, are not natively, supported, by the structured, streaming API for. Those we can specify a transform, function, to be applied within a for each batch and. We'll talk about a specific example of this in more detail later. Detections are inherently, imperfect. They're. Designed to look for anomalies, and datasets that are sometimes chock-full, of these weird little edge cases and idiosyncrasies, that can change over time and are, really difficult to predict even if you have all the historical, data in the world actually. Humans. Review these events, the signal-to-noise ratio. Is really important, and we. Continuously, tweak the detections using, these analyst suggestions. But. Doing them one by one takes a lot of time and while. They're being worked on we, don't have any cycles to write new detection. When. Analysts triage alerts they label them as either false positive, if it found something that wasn't a real security issue or a true, positive if it's actually malicious, this. Feedback is memorialized. In a Delta table where, we've incorporated it, back into the detection pipeline, rather. Than us manually, tweaking, each of these detection x' the, system, learns from deanna was consensus, and automatically, adjusts the future behavior in this. Way the bulk of this day-to-day tuning is self-service, and we saw about a 70% reduction in the total a word volume when we deploy this which was pretty nice. In. Some cases that are difficult for that automation, to figure out we may still have to do some degree, of manual, tuning, each. Round of this tuning adds an exclusion, to the detection logic, until inevitably, it looks like, the top right hand side there you. End up with these really ugly addendums. That are almost as big as the detection logic, itself not, this thing not that thing over and over and over again to. Solve this we built a new mechanism which. Breaks, out a list of sequel expressions, that, are applied against all of our output, in the for each batch transform, at the very end it. Allows us to be as complex, as we want in the exclusion, logic while still keeping the detection classes, pretty clean and. Because they're integrated, into our functional tests we, can ensure that overly, selective, or malformed, sequel expressions, are caught before they impact their production jobs although. The. Matching, events, to these expressions, are excluded, from, what analysts see we, don't just want to drop them on the floor all. Of the excluded, events are written into their own table, so, we can continuously, monitor the number of records that are excluded by each expression, individually. Let's. Take a look at what happens during a word triage. Typically. There isn't enough information inside. The alert on its own to render a verdict and they, wanna run some ad hoc queries in a notebook to gather some, substantiating. Data they'll. Need to take some parameters, from the alert payload create, a notebook and, fill it out and then sit around and wait for these queries to finish, what. Are the information they're looking for doesn't. Really change all, that much from investigation. To investigation, things, like what happened immediately before and, after the alert and what, is this machine or this account typically, do.
To. Help with this we've used the data bricks workspace, API to template. Eyes and automate, the execution. Of investigation. Notebooks, each. Detection, has a corresponding template, notebook in a specific directory, which, the orchestration, system, clones and populates, with information, from the alert and because. Many of the queries are the same between similar categories, of detection we. Can modules. The templates, and reuse, components, with the percent run magic command in notebooks. This. Functionality arms, our analysts, with the information, they need to make faster, decisions and. It also abstracts. The complexity, of so many things that people wouldn't usually do manually, on. All the detections surrounding. A suspicious. Process, execution. Some. Custom d3 will render an interactive, process tree, that of let you trace the pig lineage, in its relationships, in this. Example there was an interactive, reverse shell following. Exploitation, of a Java web server process something, that would it, would have been pretty difficult to figure this out if you're just looking at records in a table. We've. Also taken some queries, that would be executed, in a template notebook and have, the orchestration, system, run them Oh DBC. Rather. Than a human interpreting, the results in a notebook the, machine can evaluate the, current activity that's happening against, the historical, baseline it collects and render. A verdict, if it's. Competent enough in the verdict it can automatically, contain, the issue without a human even being involved so. You'd end up with these detections that test well but, don't necessarily work in the real world and since. Fully structured binary formats like Delta don't, play out that well with revision control we. Need to use semi structured formats like JSON and worried about codifying, the schemas, into, our test suite. So. We wrote a test generation, library, that wraps scala tests and we can run them right inside notebooks. Using. Only the information in the config object it'll. Create an assert a ssin for each word and infer. The paths of the hits and samples by convention, so, the only thing you need to worry about as, the author of a test is writing, your samples, and hits to the right place, with. This we've seen a dramatic 85. Percent reduction in the amount of code it takes to write a basic test for a detection. But. We took it a step further and integrated. The notebook based test into our CI it. No clone notebooks, from the PR into a specific directory and execute. Them on a real cluster, that returns, a JSON, object, summarizing. The test results, via DB utils, our. CI system gathers these, results, and we'll pass or fail the bill based on what happened remotely, in the data breaks instance it. Also frees us up to write detections that do things which wouldn't normally work in an IDE you like percent run imports, of other notebook components. Once. These tests pass and the production notebook, lands and data bricks we, still had to manually, create and configure jobs that would execute the notebook tasks, each. One might need a specific cron, expression or. Want to run on a certain cluster and all, these details were typed in and maintained by hand at a. Certain point we had way too many of these detection jobs to continue, managing them with the data bricks UI and, no, real way of making, bulk changes like moving a bunch of jobs between clusters. But. Data bricks recently, announced, a really nifty beta feature called, stacks it's. A dream come true for anyone, who's looking to build a job ci system because. It removes most of the complexity, from maintaining, inventory, and state across, the different data breaks api's, you. Can package a job and all of the resources it needs like notebooks, or dbfs, files into a tiny little config. Stanza, and, it gets to plate all at once as a package. So. We built a fully featured job ci on top of this stack, CLI, all of, our notebooks files, and job parameters, can now exist and get with, CI doing, the heavy lifting for job, between we. Wint the stacks config, objects and then pass them into the COI which. Creates or overrides any of the resources, as necessary, but. For jobs specifically. There, were some important, finishing, touches which. Aren't currently covered by stacks we. Augmented, the COI with a couple helper scripts that'll kickstart newly created jobs or, if a job or the underlying resources, has changed, it'll, restart them to accept, the new config.
Job. CI was the last piece of the puzzle for us and automating, every piece of the deployment testing, in the management of the detection x' from end to end. So. Far we've been able to do some pretty neat things with, the job CI that would have been a real pain to do with the UI, things. Like moving, a bunch of jobs between clusters all at once or having. It restored all the jobs when we change a shared component, and. It, was an ideal place for us to enroll the newly created jobs into our metrics platform, so we get monitoring. And alarms on these streams by default, the. Only Kavya is the, writing, stacks JSON is pretty tedious so. We wrapped a CLI utility, to generate, them with a questionnaire, for. The. Last thing we'll discuss I want to take a step back from the detection and focus, on some of the problems we've had in a triage space. If. You. Look at the alert trends, over time you'll. Find that most of the events day-to-day are completely, novel, it's. Very likely that something maybe. Not the exact thing but something, similar already, happened in the past and, was already triage and investigated. Or closed out. Particularly. For, analyst teams that operate 24/7. Across, multiple shifts it's, really difficult for any, one person to remember all of this historical, context, and it. Results, in this waste of resources cyclically. Reinvestigating. The same thing over and over again. The. Solution, of this problem is research, if people. Read through all of the various tickets, wiki's. And investigation, notebooks we curtail, a lot of the wasted cycles but. These datasets exist, in many different places each, with their own search syntax and interface and it's, unreasonable, to expect analysts. Who have either read every, document ever written or. Do these searches during. The word triage one time is at a real premium. Because. The data exists, in disparate, places we, can't effectively. Mine it for insights over the, course of many years we. Can handle multiple incidents, that are seemingly unrelated, but, actually share some common features that's too obscure, for any human to see, mapping. Their relationships, between Oh words or incidents, requires, not just intimate. Familiarity with, what's happened before but. Also that, the connection between them be so glaringly, obvious that. Someone happens to notice it. So. We built a solution to these problems we call doc search, to. Start we centralized, and normalized all of the incident, related data the. Text, paint loads of email correspondence.
Tickets. Investigations. And week users are old dumped into a delta table on top, of which we built a document, recommendation, mechanism. Having. A single place to search through all this knowledge was transformative. On its own but. More, importantly it provided, us with the means to programmatically. Leverage, all the things we've ever learned to better inform, what we do in the future so. What, does this look like from an analyst pected we. Have code that runs in all of the investigation. Template notebooks that'll take the alert payload, and say just potentially. Related documents, in a, pretty display, HTML table. Include. Some pretty useful features, like the verdicts, an analyst comments on the past words and a, matching term list where the document, and the alert intersect, and clickable. Links into the system that it originally, appeared in for. Email correspondence, and this is pretty neat when, you click the link it'll open up mail about app and Mac OS to the specific, thread. Where. You can do some further reading. So. Let's take a bird's-eye, view of how this works well, go into more detail about each one of these steps here in a little bit the. Template, notebook will tokenize, and extract specific entity types from the alert payload it. Runs. Those entities through an enrichment routine, to ensure that every, possible representation. Is covered before. Looking for occurrences, and all these documents. Since. We're searching through unstructured. Blobs of, text and the sparks equal like and contains operation, get pretty expensive we. Use a more optimal, concurrent, string search algorithm, called aho Korsak that, prevents, performance, degradation as, the term count increases. Depending. On the input terms, there, could be a ton of tangentially. Related documents. We don't really care about so. The hits are run through a scoring, algorithm, to compute their relevance, and only. The most useful. Results, are going to be given to the analyst to. Understand, why we need to bother with any tokenization. Let's. Take a look at the structure of a typical alert, they're. A set of key value pairs and, for, the purposes, of finding related, documents, not, all of them are created equal. Some. Of these strings like dates or HTTP. Methods, or ports, are, gonna be found across, thousands, of different documents, and almost, all of them are going to be related to each other they're. Also some in here that are too selective, like a timestamp that, would only appear, in this specific, alert and the words but. They're a subset, of tokens, that describe different aspects of an entity and the. Machine or accounts that are involved, in the and these, are the ones that are valuable for Coralie.
So. We use a suite of reg X expressions, to extract, the common, entity types we care about but, that's not enough on its own if you. Think about a physical machine like, someone's MacBook Pro there. Are, many different identifiers, x' that could describe it it, has a serial, number, some MAC addresses a. Host, name and potentially. Many different IP addresses, over time if the system uses dynamic addressing. Like DHCP. Some. Documents, might contain one or a couple of these but never all of them all in one place so. You'll have documents, all kind, of referring, to the same machine but each one uses a different attribute to describe them which. Makes tying them together pretty, difficult, to. Address this we've run the extract identities, through multiple, enrichments, making. Sure that we're searching for the superset, of those identifiers, x' and we'll find all the related documents regardless of which piece they contain. After. Entity extraction and, the enrichment and searching on the documents, the, results, are fed into a suggestion, algorithm, that will compute, a term wise relevant, score for. Each matching entity we'll look at the number of documents, it appears in over how long a period of time and its, distribution, across, the different types of documents the. Terms are ordered by an average rank percentile, of those features such, that the. Less common, a term is across documents the. More valuable it'll be is an indicator of relevance, between them. Documents. Containing this subset, of valuable terms are going to be presented, in order of how many they contain, with, documents, that have multiple hits going to the top of the list. And. With that that's all the content, I have so we'll open the Florida Q&A. You.