Azure Datafest - New Technology for Fast Big Data Analytics (Part 1): Intro to Azure Data Explorer

Azure Datafest - New Technology for Fast Big Data Analytics (Part 1): Intro to Azure Data Explorer

Show Video

Hello everyone and good afternoon. Welcome to the Azure Datafest, our event and this is the third session. Yeah, I'm. It's usually some trouble, so for me to change slides, but I'll do that, shortly to show you the calendar of sessions. OK, here it is. Yeah, as you may remember, for those of you that have been with us since the beginning of this event, this is the third session we are continuing our exploration of Azure Data Explorer or Kusto, as it is also known as. And, last time

we did an introduction to what Azure Data Explorer does, and we are building on that foundation today to investigate and deep dive more into or what are the features and capabilities that Azure Data Explorer exposes in order to build a highly concurrent and performant applications. OK. We have also extended our calendar of events beyond October in November, to also talk about not only all the services that Azure provides for you to perform mainly big data analytics and analytics on real time streaming data and machine learning, but also to explain and deep dive in what are the infrastructure or services that can be put in place to production lines of your deployments, secure them and so we have introduced additional sessions that target exactly these topics and you could see actually we will run these events all the way to the second of December where we will explain and introduce a little bit more adept the topic of security and how to secure their infrastructure and visualizing trends. So, data services, yes, but also deploy and run them in a production environment in a secure way. Moving to the next slide.

It's a bit of lag. Surely, yeah, let me introduce the, If you think  just before we get started. So as you know, at Microsoft we see sustainability and humanity's response to climate change is one of the greatest challenge of our lifetime and we believe it is not too late to plan for a cleaner and greater and greener future. And therefore as part of our commitment to sustainability, Microsoft Singapore will be planting a tree for every person attending this event. Moving on to the next slide, a few housekeeping rules so the session is being recorded.

For those of you that want to replay it afterwards, you can ask questions throughout the sessions. I will be your moderator and eventually I will liase with our presenter today, whom I will introduce shortly. To post your questions, because you will be on mute, but you should see a Q&A panel where you can enter your questions and I will be reading them and responding to them. At the end of it, of this session, we will also reserve about 10 minutes for a Q&A session so we can elaborate on some key questions or questions that require a lengthy explanation at the end of the session.

Alright, so without further ado, let me introduce you Avner, Avner Aharoni, Principal Program Manager of Azure Data Explorer which today will help us deep dive in some of the key capabilities of the Data Explorer service. Thank you, Avner. Thank you for being with us today. Thank you Andrea. Hi everyone, nice to be here.

And. So I'm going to give the talk about building high concurrency applications on Azure Data Explorer. And, so, I'm going to talk in this, uh, in this talk I'm just going to start with introducing the use case, then talk about the applicable features that Azure Data Explorer, Kusto provides to fulfill this use case scenario. And then I'm going to touch a bit about the UI stack and then how to build the user interface and what are the different user interfaces available for such an application. OK, so let's start with the use case.

So we see quite often in Azure Data Explorer, which is a big data platform that supports it can ingest terabytes, and petabytes of data and present it to the users in a very fast manner. Very performant. And we see a lot of our users are building large scale dashboards and alerts solutions and the data is flowing all the time into the system and the users and we want to build into the dashboards. Many users are touching those dashboards, how many dashboards, there is alert running all the time on top of the data and the usual telemetry monitoring, and then they're not taking stuff scenario. And, the spec that I'm going to show you in the example that I'm going to use in order to exemplify the scenario is from actual use case that we were looking at a few months ago and in this spec we wanted to build a solution that will accommodate more than 1000 users and that are using data that is collected from Telegraph agents collecting around 34 million metrics per hour. And, in 10 seconds of granularity and keeping them for 15 months and building more than 2000 dashboards and including alerts, it runs all the time tries to detect anomaly and things like that.

The acceptance testing that we've done for such application is to have 100 concurrent users that are hitting those dashboards which we created seven different dashboards and with three different time spans and auto refreshes and things like that. And each dashboard had at least six tiles and with multiple time series and week over week calculations. Everything has to load very fast within 5 seconds to get the full dashboard to load. And everything has to be fresh. The data should have been in less than two minutes, and since it was created. So as you can see, this is the super demanding system to build.

And for that we invested a lot in creating features and perfection, perfecting, the different features that we already have. Yeah, so let's talk about the different features and capabilities that ADX has. And the first feature is materialized views and then we have the leader-follower pattern. I'm going to talk about data partitioning policy, query consistency and query cache results. so I'm going to drain next on each of those features and tell you more about them and how they can help building such a solution that accommodates this traffic. Let's start with the materialized view, material is used and details three main use cases.

And one is, the down-sampling, is the ability to have a view that always calculates in the background and calculates the down-sampling of the data. Meaning, aggregates of the data in different buckets than the one that you received. For example, if you're receiving telemetry every 10 seconds, but you want to keep the data in the granularity of one day, you can summarize it by this bucket of one day, and then when everybody queries the data they don't have to go through the full data set, it can just look at the aggregates.

The second case is the saying the last entity by the update time. This is a case where there is a sensor. The sensor sends the state of that sensor all the time and you only interested to see the last view of this sensor. What is the last? What is the current temperature? And, how far the buses for me, things like that, and so in Kusto we have very nice operator, called in summarized operator. We have a nice function

I should say called Arg_Max. Arg_max allows you to say based on this call and get me the last row. This details all the different values in this row by any dimension. It can be by ID, by tenant, by any dimension, by sense or something like that. And the third scenario is the de-duplication.

So, unfortunately, from time to time we see systems that they send the test duplications in them, and then our users come to us and say we want to first de-duplicate the data we really care about not having duplicates. And then we want to do other calculations. For that we can run a summarize, take any and by the different. Sorry, by the different keys that you're interested in. And then, and it's a

calculation is very heavy if you do that and during the query itself so you can do that in a materialized view. So those are the three scenarios and when you create materialized view it does cost you in computation in the background, but it gives you a great performance improvement and it's always fresh. You always get the data fresh, it's always the whole data and aggregated, de-duplicated, last value, all the scenarios that you want, and you get the right, the right data, always fresh, always up to date. It allows you to reduce the code, the cost of the cluster, and then it's a fully managed solution.

Transactional guarantee, meaning that you don't need to do any heavy lifting, you just defined the view and the Azure Data Explorer, Kusto does the work for you in the background. Before we move to the next section, I'm going to show a little demo and then let me just find the right slide. I think it is. This one. OK.

And then here, and I took a, this is a demo that runs by the way on our production cluster and I took a table. It is essentially like a sensor table. OK, so every few minutes or something it surveys all the different clusters in Kusto desk, and it sends in little record, it says, what is the cluster name? When it was last updated? Then it tells us what is the skew that it runs? And what is the machine counts. OK? Of this cluster. Let's run this query.

See the result. Yeah, this is what we have in this table. In this case I just filtered for my own cluster and I'm getting 1000 reads on this cluster that tells me that, in all these times, the cluster had a specific machine and specific machine count. OK? Now if I want to run that and to see just my latest view of the cluster, I can say Arg-max by the time stamp. This will give me the last record that I had for this cluster, for this specific cluster.

And then, and I can see that the state of the cluster. OK, so now let's look at the, we have a nice view here that allows me to focus on the query statistic. Let's look at the performance of this thing. So if you run that, and I can see that it took half a second to run, very fast, but it took 13 seconds of CPU, so the cluster spent 13 seconds of CPU time to calculate it. Now let's look at the materialized view that I created for this. This materialized view is essentially the same query just says, Dean clusters, which is the original table.

Summarize Arg-max, last updated, meaning give me the last record by the source, and, by the source, by the cluster. And, so for each cluster it will calculate the latest record. Now let's run the same query on the materialized view and this time also we focus just on statistics. Then, and let's see how long it takes.

So it took three times faster, but from performance of the CPU it took around 100 times less CPU, right? You see that, it improved CPU consumption from 100 and from 13 seconds it reduce it to 156 milliseconds. OK, so that's the type of performance improvement that you should expect, and this is a super super powerful feature. OK, let's continue with the deck and talk about the next feature. So, this is the leader and the follower. And, in Azure Data Explorer, when you send the data you ingest the data it goes to a cluster. You can ingest the data to this cluster.

You can read it. Essentially, the cluster is a read, write and cluster, but in many cases you want to have the cluster that only serves the reeds. Why? Because, maybe I should go through the next slide I should go through the reasons, but essentially what the follower and leader pattern allows you is to create many read only clusters that we call followers that follows the read-write cluster. And you can follow a database, you can follow a table, you can follow only specific tables and in any other cluster. OK. And, so when do you use it? Use it to share the data between different organization and teams.

Let's say that one organization is responsible of getting the data, ingesting it, and maybe has its own workload. And then, some other team, the machine learning team want to run some machine learning algorithms on that. Obviously if they run it on the main cluster, on the read-write cluster, it will interfere with their work. It will consume CPU. It will reduce the performance of that cluster, so they can simply create the following cluster and this is, now they don't impact any of the resources of the other clusters. And you can create as many departments, many teams can use the same data, without interfering with each other.

And, the other thing that it allows you, it allows you also to split the cost between the different teams. Because if one team is using that and you want to ensure that they pay for that, they can come out with their own cluster essentially, and pay for that cluster. And then, the other team doesn't have to bear the cost of that workload. And then,  the other reason it was super important for our use case, is to separate injection from queries. So in our use case and if you want to have a really performant high concurrency solution that runs dashboards, you don't want the ingestion workload to interfere with that. You don't want to have a sparking ingestion and all the dashboard work slower.

So what we did in our solution was to create a leader cluster that does the ingestion. In the following cluster that does the serving of the dashboards and then they didn't interfere with each other and then we can get a predictable performance on both clusters. And the features that it provides is, like I said, you can do that on a table level sharing.

You can change the permissions so you can have one set of permission on the leader cluster and one set of permission on the follower cluster and cluster can be both a leader and a follower, so you can have a cluster that has one database that does the read-write, another database that is following, yet another cluster, is like I said, a single database can be followed by many followers, and the cluster can follow databases for many other clusters. So it's a super super flexible and useful technology. It's really, really, really incredible technology. OK, one thing to to note, when we talk about that, is that following the leader has to be in the same region. The reason is because they're both looking at the same storage accounts for the data, so they don't share the same compute, but they share the same storage.

And if you try to have the cluster on a different storage account from a different region, the performance will suffer from that. As a result, we disable this disability. And you can only have the leader in the following the same region. And

the other thing to note is that there is a data latency because the leader is always up to date, it's ingesting the data. The follower is synchronizing with the storage account, so it takes it a bit of time to synchronize after a few minutes, and so because of that you should expect the latency on the follower. There is a way to get around those latency. One of the things that we did in our implementation is to have a view essentially function, that union between the following the leader and takes only the latest date of the last, so it's 10 minutes from the leader and then the rest from the follower. That's how we circumvented this limitation of the follower.

But some other people don't mind about having a couple of minutes to be behind the leader. Maybe they're doing, like I said, machine learning. Maybe they're doing some other scenarios. But in anyway, you can override it, overcome that essentially. But you're creating a union, between the leader and the follower.

Another thing, another good thing to note here or important thing to note, is that there is two settings for the prefetching of the extent, and the follower can answer the queries and show the data that arrived even before it caches the data. As soon as it sees that there is a new metadata for the leader, it sees the new extents. It allows you to run queries on them, but the data might not still be on those nodes cache, and that obviously leads to cache misses that might lead into performance hiccups, or performance degradation. You can set this prefetch extent settings to true, means that until the whole data is cached on the follower, it will not allow you to query that data, and that allows you to have a very predictable, very predictable and very good performance. So for the scenario that we use, we set it to true. So we always wanted it to be fully cached and fully warmed up before the user could query that. OK.

Now let's talk a bit about data partitioning. Data partitioning is also a super important capability in databases and in Kusto we're doing that extremely, extremely simple for you. So, essentially you need to define it once as a policy, and then you get the benefit of partitioning. So now when the query runs, it will know how to find the right partition. The partitioning policy on Kusto is really advanced. You don't see that in the model.

So in other databases, the partitioning concept is exposed to the user, to the developer. They need to tell the engine, they need to tell them if they're using a specific partition, they need to filter by partitions, they need to do some work in order to use it. In Kusto, it's all hidden for you.

You don't need to worry about it. Once you set it up, it's completely hidden. The system would know how to take advantage of the partitions, but from the user perspective it's all transparent and it's really really nice and simple model. The partitioning policy is essentially, you can say two things about it.

You can choose one string column in one end or one data in column, and for those columns, the system will ingest the data and then in the background it will repartition the extents our data shot, it will repartition them by those columns values based on the definition of the partitioning. And then all the queries that run on top of that will take advantage of this partitioning. One of the main advantages is what we call the pre-filter. So when we run the query, we're looking at the metadata first to find the values that you are filtering on.

For example, if you have 34 million different metrics and every time you just create for one single metric. And instead, let's say that you have 100 partitions, 128 partitions, instead of searching for that specific metric across all the different extents, it will search it only on one out of 128 extents, but it will do that ahead of time. It will know exactly what extents has this information and it will pre-filter the different extents by the partition values. The other advantage in details,  is it will allow you to move less data between the nodes of the cluster.

So that means that it will perform the calculation in a more local manner, and obviously it will improve the performance. When we use it, first of all, you use it. First of all, I should say it cost, this is the warning at the end, and  it costs the CPU for the cluster because the cluster is constantly partitioning in the background the new data as it arrives.

And this is a demanding workload essentially.  It takes every extent of 100, every extent to arrive and breaks it into 128 or the number of partition that you specified. The default is 128.

It will break it into those extents and then it will merge them and it will do other background operations on them. So it's obviously it costs CPU. So you want to use it when you really going to benefit from it. When the query performance is going to be so greatly enhanced that it's worthwhile spending the CPU in the background to set it up. So the first thing that you do is you need to ensure that majority of the queries are filtering on the partition column. In the example that I gave you,

if you have a 34 million metrics, and every time in the dashboard, is a filtering for specific metric or few metrics. And then, obviously, partitioning on the metric value is going to make a huge difference. In our experiments in that solution, it was at least 10 times faster, right? And it's 10 times less CPU. So you really want to do that when the partitioning column is used in either in a filtering or in aggregation. And the second reason for using a partitioning on data, is when the data is coming out of order.

Sometimes we see that in the data because the client clock is skewed and sometimes we get data in the future, sometimes we get data in the very fast start, it's just corrupted data essentially. Or sometimes there is a lag in the data for other reasons and we want to make sure that when you query, only query the ranges correctly. You don't have these values mixed with other values because that essentially destroys the index of the time, the time index, the time filter index, then you can set the partitioning also on the datetime column, and that will ensure that all the extents will be homogeneous as it relates to their data in column. Let's look at another demo this time and look at them more for the partitioning. And then here this example, I'm looking at, it's a data set that has a four billion records and then I'm going to look for, I created a partitioning, I have a log table that is not partitioned and then I have a log partition table that they applied the partitioning on the source column.

Let's for a second look at the partitioning policy before we start the demo and look and see what it is, and so it has a partition on two columns. One is the source column where you see at the time using here the filter and it's a string column. It's using the hash partitioning. It has 128 partitions. And then the second partition is on a timestamp column.

I'm not using this in the query, but it's just partitioning that based on a specific day. So, for each day there will be its own partitioning and I'm setting that the effective date time is in the task. So I can also partition all the data that I had so far.

And so I did the partitioning in this log partition table. I have this table that is doing and that is not partitioned. Let's click on the view and focus on the statistic immediately so we don't need to worry about them. And this, and here we see that in the query time and total CPU time.

And now let's click here and focus on the statistic. Avner, sorry to interrupt you. I just received some feedback that maybe the font may be appearing a little bit small on screen. I don't know whether you have the possibility to zoom otherwise it's still readable. Ya. I can zoom. Definitely I can zoom. Thank you.

Yes, this is Kusto Explorer. Everything is easy in it. And, great. So you can see here that in the query time, it's 6 times, 5 times faster and the CPU time is also five times difference, right? So having the partitioning improved, the CPU time and the query time by five times, and it really depends on the scenario. Like I said. And it also depends on how the other indexes are playing in the game. But as a rule of thumb, if you have a workload that has filters that most of the workload is filtering on a specific string, column identifier or something, you do want to have a partitioning on that column and that can give you very significant boost. One thing interesting thing to see, and I'm just quickly going to show you, is it looks in the statistics.

And here, and there is another tab, it's called query completion information. If you click on the execution time, it just gives you more information about what happened. So in here you can see that the scan extents and when it was not partitioned it had 145 extents, and it scans 140 of them and from the four billion record it scans 4 billion records, right? When I partitioned it and the number of extents grew because now we have more partitions right? And because we have more partitions, we have more extents. And then, but from their 467 extents, I only scan nine of them, right? As a result, you can see that instead of from the four billion record, I only scan 143 million. Right? So that's why I'm getting such a performance boost. And the fun part about it, is that I didn't need to do anything about my query.

I don't need to think about it. Once somebody set this policy up, everything just works. And as you would expect. OK, let's continue. Feel free to interrupt me with any questions and then happy to answer as we discussing the different features. As of now, no questions.

I think that people is hooked into what you are presenting. OK, good. Thank you. Sounds good and OK, and the next feature is a weak versus strong query consistency.

So, what are these? So by default, Azure data Explorer, Kusto is a running in a strong consistency. Strong consistency means that all the queries that are coming or going into the admin node and there is always one node in the cluster that is the admin and it has the source of truth for the latest metadata. So it knows and it has the key to the metadata and it has the latest metadata and when the query comes into it, it can find all the latest data that arrived to the cluster. And then, obviously, the query will be distributed by the admin node is taking care of the query plan and ensuring that the query sees the latest data. So in a strong consistency mode, the data is always fresh. Everything gets the latest snapshot.

And then the admin role is taking care of all the queries. We have another mode. It is called weak consistency. In weak consistency

all the nodes of the cluster, essentially, can participate in this query handling, in this query management. That means that when you indicate it on your query, you want it to be handled in a weak consistency. The Gateway of Azure Data Explorer will send it to any node that it wants to. Usually it allocates, instead of one admin, it will use four, but it's up to the internal logic to decide that.

It will send it to a different node, and that node might not have the latest metadata. It might be lagging in 20 seconds, 30 seconds, something like that. But it will reduce the load on the admin because now all the queries will be distributed across the different nodes of the cluster and that will allow you to scale essentially the cluster and have the workload split evenly across the nodes. We see, I have to say that we see quite a few cases where people are using just the default in a very high demanding workloads using the default, and then they try to scale the cluster.

The cluster is not behaving the way that they want, it's because the admin node is super, super busy. And what they do, they just add more nodes to the cluster and that doesn't help at all because the admin still continues to be busy. And then we tell them, hey, use the week consistency mode and so far it really worked well for most of our customers, but obviously we don't do that by default because we don't want to surprise anyone. But once we tell them or advise them to do that, immediately, the load on the admin is reduces and then the whole cluster can scale and then the customer can scale it to as much as they want to get the performance that they need.

OK, super important feature, you actually use it by telling each query and comes with a flag that tells it whether it's a strong consistency or weak consistency. So you actually can decide that if you want some of the queries to be strong and some of the queries to be weak, you can do that. It's not the setting on the cluster, it's a setting on the query side. OK, and now to the last feature which is the query results cache. Query results cache is a feature that is similar to Redis. You know, few years ago we saw a few of our customers using Redis as the way to cache the results of the queries and they had to manage both the Redis and ADX database.

And obviously that was a lot of work and we say to ourselves maybe we can do it better. Maybe we can also provide the server side caching that will allow them to just, with a simple flag to just say, tell ADX, hey we want you to cache the data and we will cache it. We did it actually in a very, I really like the design. Essentially, you tell it by the queries so you can decide based on the query if you want it to use this cache or not. And you also give the expiration time for that. So essentially, say for each query you say, I want this query to be cached for this time.

Once this time elapses, then the query will run again. Right? It will just ignore the cache. The cache would be invalidated. It's all managed by the Kusto cluster, and they're super simple to use like I said. And what is the logic behind it? First of all, the cache itself is based on the identical strings.

So if you even type one more space in your query, that will be considered to be a different query and that as a result it will not look for the cache for it. So the key is actually the query string, right? It has to be going to the same database and it has to have the same client request properties, whatever they are, but for example, weak consistency which we just talked about, if one will come with strong and one will come with week, it will not go to the same cache. So, this is pretty much it. It's a super simple feature. So let's just look at it.

We have another demo for this. OK. So what we see here I'm going to increase the font. OK, again, I'm going to show you a demo on our production cluster so hopefully everything is well with it and it's a cluster of 500 nodes. It ingests a lot of data and then what I want to do is I want to do some chart of some clusters.

So let's start first by looking what's going on on this cluster, so I'm just checking to see how many records arrive to the cluster over the last, for one hour. Starting from the go-to hours, I'm just picking up some random time frame and you can see that we got here 51 billion records. OK, in this one hour. OK, and then the

calculation that I want to run is I want to chart, chart this in, chart some behavior of the cluster. In this case, I'm going to look at the biggest, the ten biggest classes sent us the most telemetry, the top ten. And then I want to look at the behavior over at this hour and by one minute it been charted. OK, so I'm going to run this query. This is a heavy query. I specifically chose a query that takes some time to complete because this is how you can see better the cache. So it took it

around 9 seconds and I gotta resort. It says ten different clusters. I hash them so you don't see any names or something like that and you can see how they behave overtime. Super simple stuff. OK, so now let's say I'm removing the visualization so you can see the result. And now let's say, set a query results cache for that for 15 seconds.

OK? And, I'm going to run this. This is the first time that it's going to run, so it's going to take another six seconds and then I have the same query here with the same cache running here. You can see that in this query, this query started but it came immediately as with the other query. Why? First of all, because we had the cache, right? We set up the cache for 15 seconds and then the cluster was smart enough that even if it was executing the query and another same exactly query arrives, it knew how to queue that essentially put it in a special queue and when the first query finish it return both of them to the clients. So this is it. Just a very nice thing to do. OK, but let's see how it works. First of all,

I do that in 15 seconds assuming that the 15 second elapsed. So now it's runs it again because the cache is invalidated. OK, we'll wait for that to come back. Came back. Now, run it again.

And it tells me that it was run for seconds ago. Understanding the query from the cache. It's all immediate, right? If I run it again, it's still got it. It was 12 seconds ago, 123. Now it's invalidated and now the query run again with fresh data.

OK? So this is how the feature works and with this we are done pretty much with the features and then what makes a ADX. We feel the best platform to build such an application. Now, let's talk a bit about the UX Stack. And then, how people build those solutions.

So ADX, Kusto has an connector to three main visualization platforms. One is Grafana. And Grafana is actually the one that invested the most in building those features into the connector. So in Grafana for example, you can tell and we found that with just a simple setting, you just by clicking a checkbox, you can tell it to calculate the size of the queries, all cache, the length, or the time of the query results cache based on the query itself from the dashboard.

So if the dashboard for example is showing the last day, then the query results cache will be 10 minutes because you don't care about the last 10 minutes of the you know you don't need to see the data refresh every second if you're looking at the whole day. Right? What happens over the whole day? Within 10 minutes. Nothing major, right? Now the last point might have changed.

So in Grafana, the calculation of this caching window is automatically. Similarly, because this query results cache does the identical strings and the defaults in many dashboards is to show the exact timestamp. For example, if you choose that you want to see the last day, it will show you the exact timestamp of now versus the timestamp the day ago. In Grafana, they will change the query string to ensure that they will, they will be in it essentially to ensure that the query string matches and you can really enjoy this query results cache the way that you intended to. So essentially the query results cache length, plus the query results cache in, or the query string been well-coordinated in the Grafana plugin to ensure that it always works and always gives you the the best performance and the best utilization of the query results cache. So Grafana here's a,

obviously you can do all of that yourself, in your own dashboards, in your own solution, but Grafana it's all built-in. And, ADX dashboards is another dashboarding tool that we have. And you can use them, all these features like I said, weak consistency in queries those caches and then, and things like that you can use it from the ADX dashboards. But we don't have that as a global settings yet, and Power BI is another tool that people are using for  showing users and the different visualization. For the workload, Power BI has its own caching. So you

can use the input mode and what it will do is essentially it will take the whole data set, the relevant datasets, cache it on on the Power BI side and that will give the users a great performance. And then you can build your own UX and then obviously use all these features and enjoy them. OK, so that's the one thing that I made. I just talk about a few about the next steps here and what you are welcome to do is to read. We have a document that actually outlines this presentation and please check it out. There is a couple of blogs that are very interesting about partitioning and the following clusters that Yoni, who is the developer of this feature created.

And then I always like to reference my favorite blog. They ever created about Kusto. It's called just another Kusto hacker and it's about internal competition that we did in Microsoft about hacking and writing some fun queries in Kusto. So for those of you who are interested in blowing your mind of what people can do with the platform, feel free to check it out and enjoy it. OK, so I think we're ready for the Q&A.

If there's any. Thanks a lot, Avner, very interesting presentation and demo, which is very relevant to what many of our customers and partners are looking for. And while we wait for more questions, there is one, that is  related Azure Data Explorer and not necessarily to the features we presented, but, Uh, which I thought of elaborating here now because it does have some explanation, which is how do you compare Azure Data Explorer with other market players like Elasticsearch and BigQuery, in particular. How would you articulate what are the use cases in which Data Explorer performs best and makes a good fit versus these two platforms? OK thanks, this is a question we like because it's, so let me talk a bit about this. So first I should say that in historically Kusto and Azure data Explorer, are the latecomers into the game of the Big Data analytics platform and then as a result, it has the advantage of looking at what the competitors have done and to obviously, innovate and then provide it. So even disruptive innovation and around it.

But we did, we did compete and we are competing with those two extensively and heavily. And I can say that about Elasticsearch, the Elasticsearch philosophy is different than the ADX one. Elasticsearch is indexing engine for text and then you can do analytics with it which is what you get in Kibana etc. And by searching for different strings and looking at facets and you can also do a very sophisticated search as you can do. Catalogue search, you can do fuzzy searches, you can do semantical searches and things like that. For telemetry, for machine telemetry, log telemetry, the one that we are using, the fact that Kusto created a relational engine that has the index, the text index embedded into it, it gives a huge, huge performance improvement over Elasticsearch.

So this is what we usually see when we compare Azure Data Explorer, Kusto too. And I succeeded. The first thing that we see is a much better performance, and it's very easy to understand why? Because Elasticsearch is a text index and Kusto is a relational engine that gives you analytics on top of it. But it has the text index embedded. So obviously it's not the full text index. It's not, It doesn't provide you with fuzzy searches.

It doesn't know how to. If you search for Mr with a Mr., it will not find you the word Mr. Mister.

It would just, it would not know that this is the same thing. And if you want to have this functionality you go to Elasticsearch. This is why, this is where they excel. But in Kusto, when you search for the string and rare or something like that, it will find you this string faster, I can say that it will find it even faster for you, but it will definitely allow you to do analytics on top of it, to count it, to aggregate by, to calculate percentiles and things like that.

Much, much, much faster in a much cheaper cost. With BigQuery, BigQuery the other direction. BigQuery is a relational engine that was built on a technology that Google develops called ramen. And it's very close to the philosophy or the the design of Kusto in some ways. But BigQuery is first of all, they did not go to the text analytics, I think at least from what I've seen.

They do not go to the text analytics in the extent that Kusto does with embedding very good text indexes into their data. Maybe they already doing that, but as far as I can see, not in the other thing is the operators, the functions, the case insensitivity and things like that that are built into Kusto by default and just started this way. In BigQuery, you really have to work a lot to get the same amount itself to write regular expressions.

And things like that to get to those text analytics done for you. From what we've seen, it's only, you know everybody should try it on their own. We've seen a better performance on Kusto. We feel that, we think at least our engine is faster than BigQuery.

Thanks. Thanks Avner for taking us through. Yeah, and for your considerations, I mean that definitely help our customer to make decision and evaluate our services in the right light as well. And yeah, I do not see any more Q&A or questions. Sorry. And, so, OK. Before we close this session, maybe, I'm going to take control of this slide from you Avner. And maybe I'll just want to remind the participants of the next few sessions that we have in our calendar.

And let me see, Yeap. OK, little bit too fast. See if it works now. Yep, thank you. Uh, thank you. So in the initial three sessions, we have done a little bit of quite an extensive overview of what services for what fully managed.

I would say services for batch based on real time analytics available, today on Azure. And, we have had a deep dive in Azure Data Explorer. Which is, again one of the innovative technology that we have and the unique differentiator for us. For the next few sessions, in the immediate next week we will be talking about data governance. And we will provide an introduction to our Azure Purview.

And then, after Ignite, which is our main event in the month of November, we will get back with some sessions on machine learning, training machine learning models with no-code or a low-code approach. And then we will move on to infrastructural services that are needed to be taken into consideration. For example, networking or repeatable patterns through landing zones, potential deployments. So these services are needed to bring your analytics system into production.

And finally we have provided a deep dive specifically on security. And, on the next slide, let us know as well what's your feedback for this session. We have a survey, and you can, Yep, here is the QR code to that, so your feedback is very important and it will also help us shape this session moving forward to tune them according to what your interests are.

Let me see if there are some final questions. OK nice, the feedback is not a question. So there's a, there's a feedback that, this series and the session of today was great and very insightful. So, Thank you Avner, for explaining and being with us today, and explaining all the capabilities of Azure Data Explorer. And thank you

everyone for attending this session. See you next week.

2022-07-10 21:27

Show Video

Other news