AWS Summit ANZ 2022 - Demystify data streaming on AWS DATA7

Show video

Hello everyone. Thank you for joining us today. Today we will talk about how AWS can help to get real-time actionable insight from data. My name is Sayem, I am an Analytics SA at AWS. In this session, I also have my colleague Paul Villena. He is an Analytics Acceleration Lab SA.

Here’s our plan for the session today. I will talk about the core data streaming concepts, some use cases, architecture pattern for designing a data streaming solution. And finally, Paul has a wonderful demo to showcase the newly introduced Redshift streaming ingestion feature. So let's get started.

The way we derive insights from data is a continuously changing pattern. The exact time when human realise data has a value is debatable. In 90’s we saw a rise of relational database adoption. Initially it was mainly to store structured data. Then we saw a rise of internet, social media. We saw high volume of data coming from many sources like electronic devices or log data coming from application.

And we observed the more data we can capture, the more valuable insights we can derive. So we started using purpose-built tools / technologies, like data warehouse and data lake to get complete answer combining a structured, semi-structured, and unstructured data. Then, due to the explosion of data in recent times, we realised to continue to create value from data, we must derive insights in real-time. We observed failing to act in real-time can translate to real losses. So we also started adopting data streaming technologies side by side with data warehouse, data lake, and relational databases. So as you see the way we derive insights from data is an ever changing pattern.

So I said failing to act in real-time translates to real losses. But what is the meaning of real-time? Does real-time mean in seconds? According to a research, the fastest rate at which our brain can process visual stimuli is about 20 to 30 milliseconds. When I started learning real-time data streaming, my perception was real-time should be within milliseconds or seconds. But the more I learnt and engaged on real-world use cases, I eventually realised, my understanding was wrong. So how fast is real-time? Well, it depends. It depends on your use case.

For example, when you are building a micro-service based application faster communication between micro-services is key and usually the service-to-service communication needs to happen within millisecond. So here real-time means millisecond. If you think about getting server logs in real-time, or IOT device monitoring, or change data capture, the expectation is we are OK to get those data in seconds.

On the other hand, when you move your data to a central location like data lake or data warehouse for batch analytics, the expectation to get those data is usually in minutes and many cases in hours. So you see the definition of real-time varies on a case by case basis. And this is the key to understand when you are planning to build a real-time platform. So you can chose the right tool for the right job, instead of spending major amounts of time assessing the fastest tool. Lets talk about some other core concepts in real-time data platform. And these terms are fairly new in data processing world mainly for us who are heavily dealing with RDBMS for data processing.

Think about source. For any real-time data pipeline, that’s the first thing to identify. For example, source could be devices or applications that produce real-time data at a high velocity. Then you will plan about capturing those data. Think about writing data to a database.

In the data streaming world, we call such thing a producer. So you have written those data to a system, right? Now you will read those data. The component read those data from your streaming storage, we call them consumer. Once you read or consume those data usually you take some action. The common action is sending those data to a destination like data lake or a database.

And that’s the process of sink. One last concept I want to discuss today is data retention. Usually, you will not store those data in the streaming engine for a long time, you will consume it as soon as they arrive.

So you specify a length of time or data retention. That’s the time your data is accessible to the system. I hope these concepts will help. With that, lets now discuss some of the common real-time data streaming use cases. Over the last couple of years working closely with different customers, I see a common pattern across industries where companies are using real-time streaming technologies to solve their business problem. So in the next few slides, I will go deeper sharing my experience talking about four use cases.

These are log monitoring, connected device monitoring, near-real-time analytics with data warehouse, and platform modernisation. Let's start with log monitoring. The main purpose of server log or application log monitoring is so you can maximise the uptime of the system, which leads to improving the customer experience. For that, you need to capture the logs as fast as possible, move to a location so you can monitor the health of your entire system from a centralised place. For example, creating a dashboard. Also this helps to generate real-time alerting in case of any failure so you can react immediately.

The next use case I want to discuss is connected device monitoring. Tracking time series data on device connectivity and activities helps you to detect anomalies in real-time and react quickly to changing conditions and emerging situations. For example Optus, one of the largest, wireless carrier in Australia has built a real-time network monitoring platform on AWS, to monitor home and broadband connectivity, so they can understand the customer experience in real-time and better serve their customer with current and accurate view of the services.

Platform modernisation is another area where companies are heavily focusing these days to retire legacy investment, drive growth, and maximise the value of their existing investment. For example, building micro-service based application with a distributed messaging system. Another area companies are modernising their platform is capturing data changes in real-time without impacting the core platform and start building new features. The final use case I want to talk today is near-real-time analytics with data warehouse.

Previously, we used to perform batch job for hours of days to move data to data warehouse and run analytics. Things have changed, now we are evolving our analytics platform from batch to near-real-time, by adopting streaming data engines. With near real-time data analytics, you start consuming those data using data warehouse nearly as soon as they arrive. This approach helps to join diverse sources of data quickly and easily, and to get more complete result for your business.

So we talked about four common data streaming use cases. And when you are building a streaming data architecture, there are five common components you will work with. Source, streaming ingestion, streaming storage, stream processing, and destination. In the next few slides I will talk about the tools, framework and AWS services those can help you to build that pipeline. First, start with streaming ingestion.

For streaming ingestion, the tools and the libraries you will use depends on the streaming storage. The two most common and popular streaming storages are Kinesis Data Streams and Apache Kafka on Amazon MSK. To ingest data to Kinesis Data Streams, you can use AWS SDK, Kinesis Producer Library, or even installing an agent on your server.

If you are using Apache Kafka on Amazon MSK, you can use Kafka Stream or Kafka Connect. There are many other AWS services has easy integration with AWS streaming data services, like AWS IoT, Amazon CloudWatch, Amazon DynamoDB, AWS Database Migration Service, and Amazon Redshift. AWS provides two different managed offering for streaming data storage. First one, Kinesis Data Streams. It can continuously capture and store terabytes of data per hour from hundreds of thousands of sources. It is serverless, scales automatically, provide high availability, and data durability out of the box.

And the second one, Amazon MSK is a fully managed service for open source Apache Kafka. You can easily migrate from self-managed Kafka environment to Amazon MSK. You can do version upgrade easily, deploy your cluster across two or three AWS Availability Zone. Both Kinesis Data Stream and Amazon MSK have a deep integration with many AWS services like Database Migration Service, AWS Lambda, Kinesis Data Analytics, and AWS Glue.

So if you are building a cloud native data streaming platform, use Kinesis Data Stream to build it quickly and easily. If you are running Apache Kafka in a self-managed environment like EC2 or on-premises, Amazon MSK is a fantastic option to migrate your Kafka workload so you avoid worrying about infrastructure management. Once your data landed to a streaming storage like Kinesis Data Stream or Amazon MSK.

Now the next step is processing those data. There are many options available to process streaming data. In this session, I will talk about five different options. Lets start with Kinesis Data Analytics. For complex stateful stream processing, you can use Kinesis Data Analytics powered Apache Flink, and write your code using standard SQL, Python, Scala or Java. Kinesis Data Analytics will manage and scale your Apache Flink environment for you.

The service also provide a notebook environment with Kinesis Data Analytics Studio, for developing, debugging, and also visualising data so you can quickly inspect your streaming data, test your code, and then easily deployed as an Apache Flink application on Kinesis Data Analytics. And Kinesis Data Firehose, it provides managed connectors to send your data to different destination. You can use API to directly pull data, it also support Kinesis Data Stream as a source connector.

The service has many in-built sink connectors. For example, Amazon OpenSearch, Amazon S3, HTT endpoint, Splunk, Sumo Logic, and many other destinations. You can also do transformation with AWS Lambda.

So, if you have a complex data processing requirement with low latency and high throughput, use Kinesis Data Analytics for Apache Flink. If you want to send your data to a different destination without worrying about or writing code or managing infrastructure and your latency requirement is in minutes, use Kinesis Data Firehose. The next stream processing service I want to talk about is AWS Glue. AWS Glue is a serverless data integration service. With AWS Glue, you can run Apache Spark structured stream processing engine, transform your data and ingest to your data lake.

It supports stream processing with Kinesis Data Stream and Apache Kafka in a self-managed environment or on Amazon MSK. And recently, we have released Redshift streaming ingestion feature. During the time of this recording, the feature is in public preview and right now it supports Kinesis Data Stream as a source. We will add Apache Kafka as a source too. With this direct integration with streaming services, you don’t need to build or maintain a separate pipeline to send your data to your data warehouse.

The last stream processing service I want to discuss today is AWS Lambda. AWS Lambda is one of the most popular, serverless, event-driven compute service. You write your business logic with your preferred language like Python, .NET, Java, and let the service scale compute resource elastically. AWS Lambda supports stream processing with Apache Kafka and Kinesis Data Stream.

So if you are using AWS Glue as your ETL tool or you are already using Apache Spark, AWS Glue is the easiest way to process streaming data from Apache Kafka and Kinesis Data Stream. If you are using data warehouse on Amazon Redshift and you want to run analytics combining streaming and historical data with SQL, Amazon Redshift streaming ingestion is for you. If you want to do simple transformation on real-time streaming data and build a micro-service based event-driven architecture, AWS Lambda is the simplest tool you can think of.

So as you see, AWS provides many different options for real-time data streaming. In the next couple of slides I will talk about the real-time data streaming architecture pattern choosing AWS services. I was working with a large SASS solution provider. They have built a robust micro-service based architecture and it generates lots of application monitoring data.

For example, disc space usage, memory consumption, user interaction and so on. The platform was generating around 2GB telemetry data per minute. The outcome of capturing all those telemetry data was improving the customer experience.

But they were struggling to capture such massive amount of data efficiently and take action immediately. So to tackle that, they started sending those telemetry data to Kinesis data streams in real-time. Then they run complex stateful data processing using Apache Flink on Kinesis Data Analytics. For example, identifying the average CPU usage for a component over the last 15 minutes. And if it breach the threshold, they immediately can generate alert using AWS Lambda and Amazon SNS, so the engineering team can immediately take an action on it. They also send aggregated metrics, processing that using Kinesis Data Analytics and send those to Amazon OpenSearch so they avoid running large aggregation on their search cluster, and their search query runs faster.

Finally, they use Kinesis data firehose to capture raw telemetry data and send to Amazon OpenSearch to perform complex search on those massive amount of telemetry data. So their root cause identification of a problem is much faster and the resolution time is less, which overall improved the customer experience on their platform. The next story I want to share is working with a large Telco. They provide mobile communication and broadband services over millions of customers. One of the challenge for such large install base they were facing is getting accurate, real-time view of the quality of the service and issues experienced by their customer. Their existing batch processing system was struggling to perform with the massive amount of modem data.

So to handle that situation, they built a new platform. Where they ingest modem data in real-time to Kinesis Data Stream, use AWS Lambda to store current device status to Amazon DynamoDB so they can view the latest status for a particular device using an application or calling an API. They also run metrics aggregation using Kinesis Data Analytics and send those data to Amazon OpenSearch to build location based dashboard using OpenSearch dashboard so they can track which location is impacted due network disruption and they immediately can engage their engineering team to take proactive action.

With this new solution, now the platform is capable of onboarding more modem devices without worrying about infrastructure capacity planning. It can now detect issues and generate appropriate alert immediately and finally, it has significantly improved the customer experience on their broadband network. Here’s my next story working with an energy service provider. Customer is using Amazon RDS to store their end customer information, billing information, electricity and gas usage information. They weekly run batch job to collect all those transactional data from RDS to send to their data lake for reporting, forecasting, and even sending billing information to customers.

They also built a trigger based system on RDS to send email and SMS to their customer about their electricity usage, monthly billing information, etc. Such triggering system and batch processing was slowing down their core transactional system. Also, running weekly batch job for analytics was not giving accurate and latest result for the forecasting they want to do on their customer gas and electricity usage. Initially they were thinking to rebuild their entire platform, so they can avoid all those issues, but their core application is complex in design and running it in production for many years. Rebuilding the entire platform will take years and will cost million.

So they take a new approach. Instead of running batch job on their core transactional database, they started capturing data changes, which we call CDC, running Kafka connect on MSK Connect, send those data to a Kafka topic. They then use AWS Lambda to read data from Amazon MSK, generate email, SMS using Amazon SNS to their customer, like sending monthly billing information or when their electricity or gas usage is higher than their regular usage. They also use MSK Connect to send all transactional data to their S3 data lake, so they can run forecasting immediately and accurately. With this new approach, they avoided doing huge investment of rebuilding their entire platform and the time it might take.

They also have been able to reduce batch job pressure on their core transactional system and their batch analytics on S3 Data Lake provides much latest and accurate view of their business. The last example today I want to discuss is a new way of doing near-real-time data analytics with data warehouse like Amazon Redshift. Currently, if you want to combine real-time data with the data available in your data warehouse, there is a pipeline you have to build. Usually this pipeline latency is in minutes. Now with the new Redshift streaming ingestion feature, you don’t need to maintain a separate pipeline to send your streaming data to Amazon Redshift. You connect with Kinesis Data Stream creating a materialise view in Amazon Redshift and let your engine populate your data warehouse automatically.

This way, you can achieve very low latency near-real-time analytics with your data warehouse engine, process streaming data easily with SQL, and visualise those with tools like Amazon Managed Grafana or Amazon QuickSight. So, as I talked about this new approach for near-real-time analytics with data warehouse, how about we show you a demo on this. For that, I am now handing over to Paul to show us a demo with the new Redshift streaming ingestion feature. Hi Everyone, my name is Paul Villena and I am here to demonstrate how we can enable near real-time analytics using Amazon Kinesis Data Streams, Amazon Redshift, and Amazon Managed Service for Grafana. Supply chain disruption is top of mind for a lot of us right now. And so I have created a Grafana dashboard for a logistics company to provide augmented intelligence and situational awareness for their operations team.

Here we can see the key performance indicators that includes revenue, costs, and other delivery metrics. I can easily change the granularity of the information that is displayed here using this drop down list. Here, we have a detailed view of the stream of consignments that they are receiving. This dashboard is also able to surface the real-time inferences of a machine learning model that predicts the likelihood of a consignment getting delayed. All of these insights are based on transactions that happened as early as 10 seconds ago and this can really help them respond to disruptions. For example, I can see here a slightly increasing trend for the number of consignments scored with a high likelihood of getting delayed.

We can see that it is concentrated in the state of NSW and it coincides with some of vehicles going into unscheduled maintenance. With this information, we can respond accordingly by rebalancing our fleet based on demand or revisiting our maintenance schedule. So here is how we built this. I have a notebook that runs a python code that simulates the event for customer and consignment, and I am using the Python Faker library to generate dummy data. And this section of the code pushes the event to Kinesis data streams using the Python SDK.

Here, we have a control loop that triggers the event and we can see the stream of data getting generated. We have configured two Kinesis data streams for this demo both using the OnDemand capacity mode which is capable of serving hundreds of megabytes of write and read throughput per second without any capacity planning. Now, I am going to login to Redshift and establish the integration between Redshift and Kinesis Data Streams. This is done by simply creating an external schema. The next step is to define how we want to store our streaming dataset as native Redshift objects. And for this, we are going to use a materialised view.

In this example, the transactions are stored in JSON format, and this can be ingested as is within Redshift and stored using the SUPER data type as shown in this example. Now, let us refresh this materialised view and this is where the actual data ingestion happens. Data gets loaded directly from the Kinesis Data Stream into Redshift without having to stage it first in Amazon S3. Now, let us run a query against this new materialised view.

All of the succeeding refreshes only picks up incremental data and you can trigger this as often as required. Here, I have another query which is able to break down the number of customers across the different states. It shows how easy it is for us to unpack the SUPER data type and extract the state attribute. Now lets copy this SQL statement and we're going to use this to create a new visualisation in our Grafana dashboard. I am going to create a new panel, select Redshift as a data source, and paste the SQL statement.

I can select table as my visualisation option and click apply. So that is it for our quick demo. We have shown how easy it is for us to directly ingest streaming data from Kinesis Data Streams into Redshift using this new Redshift Streaming feature. And now, I will hand it over back to Sayem to discuss other recent feature releases.

Thank you Paul. I want to close the session today sharing three major feature releases we have recently on data streaming. The first one, Amazon MSK Serverless, that makes it easy for you to run Apache Kafka without having to manage and scale cluster capacity. And then Kinesis Data Stream on-demand, a flexible new billing option which is capable of serving thousands of requests per second without capacity planning. And finally, as you have seen in the demo, Amazon Redshift streaming ingestion feature, which makes your stream processing with data warehouse faster and easier than ever.

In this session, we talked about core streaming data concepts, discussed about real world use cases, sample architecture using AWS streaming data services, and a demo about the new Redshift streaming ingestion feature. I want to close the session today sharing a thought. As you see, AWS provides many different options for real-time data streaming.

Like many of you, I personally want to explore trending and emerging technologies. Sometimes we tend to use a technology because it is trending, it is exciting to learn, it is new. And by the way, it is not wrong to explore or learn new things, but we immediately jump into implementing that in our organisation without knowing what problem we are trying to solve.

We even create a long-term organisational strategy in this ever changing world choosing a particular, trending technology. This is one of the common mistakes of choosing a technology. If we don’t know why we are choosing a popular tool or technology, most of the cases, our implementation will fail. It’s a waste of time, money and resource for an organisation. So, identify your use case so you can clearly measure the impact of your initiative in the business and choose the right tool for the right job.

To learn more about streaming data services and their use cases, here’s few documents and workshop resource I would highly recommend to check. And to continue your cloud journey, please use these training resources. With that, I hope the session was useful.

I would love to hear from you what the next awesome idea that you gonna try with AWS streaming data service. Please provide your feedback. Your feedback is very important to us and thanks for watching.

2022-09-04

Show video