Teach yourself Spark Streaming in Five Lines of Code
Here you are!, you heard about it, want to know more but don't want to spend a week going through complicated tutorials or even the documentation.
This is a minimalistic concise way of learning the concepts with minimal boilerplate code or secondary concepts.
What is Spark Streaming?
Spark Structured Streaming allows you to perform ETL of your data using plain old SQL, and to obtain query results in near real-time.
Your data can come from a variety of sources like CSV, JSON, or Parquet. For this example, we will be using JSON.
Streaming Queries
The central concept in Spark Streaming is that of Streaming Queries. Streaming Queries are Spark SQL queries, that are executed periodically on a table that can dynamically expand and shrink as new elements get added to the stream and old elements get consumed. That's all.
Code Example
The complete code can be found here. But in this tutorial we are going straight to the point. First we read the data,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// read the first file to get the schema | |
val firstFile = spark.read.json(s"$fullPath") | |
// create the stream | |
val rawRecords = spark.readStream | |
.schema(firstFile.schema) | |
.json(s"$resourcesPath/*.json") |
Next thing we do, is to hook our transformation,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val ageEvents = rawRecords | |
.select($"name", - $"age" + 2017 as "birthyear") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val streamingQuery = ageEvents | |
.writeStream | |
// check for files every 2s | |
.trigger(ProcessingTime("2 seconds")) | |
// write in the console | |
.format("console") | |
.start() | |
// Wait 2 minutes | |
streamingQuery.awaitTermination(120000) |
So far nothing has happened. It is not until line 7 that the stream gets finally started. Finally, the awaitTermination() call in line 10 prevents the program from finishing. It gives 2 minutes for us to play with it.
As soon as this starts running, we will get the following output in console,
+----------+---------+
| name|birthyear|
+----------+---------+
| Maxim| 1995|
|Aleksander| 1994|
| Paul| 1973|
+----------+---------+
This is just the transformation applied to our first JSON file. And to generate some additional components in the stream we will add our second JSON file to that same folder,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
~/spark_stream$ cp src/main/resources/people2.json target/scala-2.11/classes/ |
+-------+---------+
| name|birthyear|
+-------+---------+
| Juan| 1995|
| Giulio| 1994|
|Rudolph| 1973|
+-------+---------+
You see that data from the first file didn't appear again. So this is just showing the 'streaming' character of the computation. Simple, right?. Enjoy!