Google Professional Data Engineer – Pub/Sub for Streaming Part 1
July 31, 2023

1. Pub Sub

Let’s talk about Pub sub, which is an enterprise messaging transport system available on the Google Cloud platform. And as always, here is a question that I’d like you to think about when a data set is unbounded, what constitutes the source of truth? Now, if you have a bounded data set like a file or a database, the source of truth is all of the data in that file or database. But if our data set is unbounded like a stream, what then is the source of truth? Let’s keep this question in mind as we talk about Pub sub. Pub sub is one of the services available on the Google Cloud platform and it can be defined as messaging middleware. It can be defined as an enterprise level message transport system which takes data from many senders and relays it to many receivers.

Asynchronously in this way it helps to decouple senders and receivers in a very robust way. The name pub. Sub comes from the terms publisher and subscriber, and indeed these terms are central to the architecture of Pub sub. Publisher apps will create and send messages on a distinct entity known as a topic. There are also subscriber apps. These are the subscriber apps subscribe to topics and will then receive all messages which belong to that topic. And each subscription can be thought of as a message queue or a message stream which takes messages to a subscriber. A message itself consists of data as well as some attributes. Those attributes can be set by the publisher.

It of course, includes a topic. The message attributes are key value pairs that the publisher chooses to include with the message. So at a high level, that’s what’s going on in Pub Sub. Publisher apps create and send messages to a topic. Subscriber apps subscribe to a topic. They will then receive all messages on that topic. Once a message has been acknowledged by a subscriber, it will be deleted from the corresponding queue. Now, we’ve discussed the importance of the queue being persistent and that’s why messages need to be persistent in that message store until they’ve been either delivered or acknowledged. There is one queue per subscription. In effect for each subscription.

The queue of the message transport is the source of truth. Subscribers could either be push or pull subscribers. If they are push subscribers, they will use web hook endpoints in order to get the data delivered to them. Pull subscribers will fetch the data for themselves by connecting to an Https endpoint. Observ has a number of interesting use cases. A lot of these are listed in the Google Docs, and it makes sense to understand them. Consider, for instance, the case of balancing workloads in network clusters. Here we can have a large queue of tasks distributed over a set of workers, such as Google Compute Engine instances, for example, or take something like an order processing application.

Here, asynchronous workflows are required because the orders need to be processed by workers which are separated or buffered in some way from those orders. Or consider something like sign up notifications. Maybe you’ve noticed that when you sign into your Gmail account, you get updates into a bunch of email accounts which are connected to that Gmail as the backup. What’s happening here is there is a service which accepts the user signups and then distributes those notifications to a set of downstream devices or accounts which wish to receive notifications about those sign ups. Another use case can be the refreshing of distributed caches.

So let’s say an application pushes invalidation events. These will update the IDs of all objects in any of the distributed caches that have been changed. Another use case might be logging to multiple systems. Let’s say you have a Google Compute Engine instance which is writing logs to a monitoring system like Stackdriver, monitoring to a database, or a data warehouse like BigQuery, and so on. Streaming data from a large sensor network is another prototypical use case for Pub sub. You might have sensors which then stream data about temperature or fuel or whatever else it is. These are processed and then in some sense analyzed by backend servers which are hosted in the cloud.

Let’s now trace a message through its entire life cycle that will give us a good idea of the architecture of Pub sub. It all begins when a publisher creates a topic in the Pub sub service and sends a message to it. The message body is called the payload. The message could also include attributes that have been set by the publisher. Those messages will then be persisted in a message store until they are delivered to the subscribers and acknowledgments are received. This happens via a subscription. A subscription is what receives messages. So when the Pub sub service persists messages, it will forward messages from a topic to all of its subscriptions individually. Recall that there are two types of subscriptions push and pull. In push subscriptions, Pub sub will send the subscription to the subscriber.

In a pull subscription, the subscriber will fetch the data by accessing an Https endpoint. One way or the other, the message finds its way to the subscriber. The subscriber gets those pending messages from the subscription and then acknowledges each one to the Pub sub service. Once that message has been acknowledged by the subscriber, it will be removed from the permanent storage. This architecture revolves around a couple of planes. There is a data plane, which involves moving messages between publishers and subscribers, and a control plane, which involves the assignment of publishers and subscribers to different servers on the data plane. The servers in the data plane are called forwarders, and those in the control pane are called routers.

 A publisher is any application that can make Https requests to googleapis. com. So this includes App Engine apps, Compute Engine instances, apps running on any third party network, in fact, any mobile or desktop app or even a browser subscribers can be either push or pull subscribers. Push subscribers are exactly like publishers. Any app that can make Https requests to that Google Apis. com URL pull subscribers are a little different. These must be web hook endpoints which can accept post requests over Https. The post requests are required because, after all, the data is going to be sent to the subscribers by the Pub sub service. Let’s now tie this conversation about Pub sub back to our discussion of a stream first architecture.

We had discussed how in traditional systems which rely on files or databases for batch processing, there is a clear source of truth. The data is whatever data resides in the file or in the database. There is very little ambiguity. But as soon as we transition to stream first architecture, the data now might come in from a stream or from a traditional system. It will then need to be aggregated or decoupled from the senders and the receivers using some kind of middleware. That middleware will in turn pipe it into a stream processing application. And that middleware is going to collect all of the data from all of these disparate sources into a message transport layer.

That message transport or that queue has a bunch of important properties. And indeed, Pub sub is exactly this. Pub sub plays the role of message transport. It lines up all of the events, all of the data items nicely and sends them into a data processing layer. This is the stream processing application, which could be in data flow or BigQuery or anything else. In this use case, the stream is the source of truth because, as we saw, data is persisted in the stream until it has been acknowledged by the receivers or the subscribers. To answer the question that we had posed at the start of this video, when a data set is unbounded, the source of truth is usually just the stream or the message transport layer which buffers the senders from the receivers. You.

2. Lab: Working With Pubsub On The Command Line

At the end of this lecture, you should know the answer to this question. Let’s say you set up a brand new subscriber for a topic. Where can this subscriber on Pub Sub access all old messages which were published on that topic? In this lecture, we’ll work with Google’s pub Sub. A fully managed real time messaging service that allows you to send and receive messages is in near real time. To get started, click on Pub Sub in your side navigation bar. Pub. Sub will show you that there are topics and subscriptions. Let’s say you start off by clicking on Topics. It will show you all topics that exist in your current project. Well, if you’re setting up Pub Sub for the very first time, this should be completely empty.

You can click on Subscriptions to see if you have any subscribers. Again, this is empty. Let’s switch over to Cloud Shell and use the Gcloud command line APIs to set up our first topic and publish messages to it. Before we start working on pub. Sub run gcloud in it. This will reinitialize your Cloud Shell instance and download the latest versions of everything, including the Gcloud SDK. We want the latest version of Pub Sub for all the demos from here on in. Let’s ensure we get it by running in It. In it will ask you a bunch of questions. The responses for them should be pretty straightforward. Yes. We want to re initialize this configuration with new settings. I want to continue using my loanicon account.

That’s what I’ve been using so far. I want to continue using my test project. That is option two. Here in this list, I chose yes to configure Google Compute engine. I want US central one A to meet my project default. That is option 18, and then I let the initialization process run through. Let’s install the beta components as well. G Cloud components, install Beta are all up to date. The first order of business is to create a topic to which we send messages. Use the Gcloud Beta Pub sub topics create command for this. The name of the topic is My Topic. Think of topics in Pub sub as the pipe that carries the messages from publisher to subscriber. Any publisher will publish messages to a topic.

Using the command line, we can say Gcloud Beta Pub sub topics publish specify the name of the topic, My Topic, and the message is hello. A message has been published to My topic, but has anyone received this? Subscribers receive messages on individual topics. Let’s set up a subscriber for the My topic topic gcloud Beta pub Sub subscriptions create Dash dash topic and the name of the topic is My topic here. And the name of our subscriber is my subscription. Now that we have a subscriber, let’s see if we can receive the hello message that was just sent to My topic on the command line. You can invoke Gcloud Beta pub sub subscriptions pull and specify Auto acknowledgment.

Pub sub follows at least once delivery heuristics, which means every subscriber will receive a message at least once. The subscribers need to acknowledge this message or just drop them. Pub. Sub will repeatedly try to deliver a message that has not been acknowledged. When you execute this command, which pulls a message from my topic, you’ll find that no messages were received by this subscriber. A subscriber to a topic in Pub sub cannot access those messages that were sent earlier before the subscriber subscribed to that topic. Now that we have a subscriber, let’s go ahead and publish another message to my topic.

This time we’ll say, hello, how are you? When my subscription, which is a subscriber to my topic, makes a full request for messages, it will receive this message. It receives the data which is the content of the message, the message ID, and any attributes that were associated with that message. Our message had no attributes, which is why that field is empty. Let’s go ahead and publish a whole bunch of messages on my topic. We’ll publish three separate messages hello one, hello two, and hello three. Pub sub does not guarantee that these messages will be received exactly in order at all its subscribers. For example, let’s say my subscription now pulls messages. Notice that it first receives Hello Two. This is totally random. It could have received hello one or hello three first as well.

When the subscriber requests another message, it receives hello three, and it finally receives Hello One. This order is random. It does not depend on the order that the messages were posted to the topic. Pub. Sub is a highly available and scalable messaging service. The trade off for this is that it does not provide strict ordering of messages. The order in which the messages are received by subscribers is not guaranteed. And as for the question that we asked at the very beginning of this lecture if messages have been posted to a topic before a subscriber has subscribed to that topic, they cannot be accessed by the subscriber. There is no place on publishers that a subscriber can go to to get old messages. Messages posted to a topic with no subscribers are lost forever.

3. Lab: Working With PubSub Using The Web Console

Let’s say that you’re a publisher publishing messages to a particular topic. You want to send some additional details with each message, such as what the source of that message is, or maybe some additional timestamp information. How would you do this using Pub sub? In this demo, we’ll see how we can work with Pub Sub. Using the Web Console as well as the command line, we’ll mix and match these Upsup page. In Google Cloud’s Web Console you can click on create a topic. To create a new topic to which you want to publish messages, let’s choose a topic here. Let’s say greetings. You want to send a whole bunch of greetings to this topic. Click on Create and you’ll see a new topic on the Topics dashboard.

If you hit Refresh, you’ll see the old topic that we created there as well. I hadn’t deleted the My Topic topic that’s here as well. At a quick glance, you can see how many subscribers each topic has. Greetings is a new topic, and it has zero subscriptions. My topic has one subscriber, the My Subscription that we set up earlier. If you notice, there are three dots next to each topic. This is a menu that allows you to perform a bunch of actions on that topic. If you click on it, you can see that you can create a new subscription for the topic, publish a message, delete that topic, or manage permissions for that topic. I’m going to choose new subscription, and that will take me to the Subscriptions dashboard.

I can specify a new subscription for the Greetings topic. In this page, I’m going to call my subscription Greetings sub. I then need to make a choice as to how my subscription will receive messages that are published to the Greetings topic. In the Pull Delivery method, which is what we’ve seen so far. The subscriber will make requests to Pub sub in order to receive messages on a topic in the Push Delivery method, pub. Sub will make Http requests to an Endpoint URL that you specify. At that Endpoint URL, you’ll have a function which acknowledges the message, receives it, and processes it further. The Acknowledgment deadline is a per subscriber deadline to acknowledge a message when it has received it.

If a subscriber hasn’t acknowledged a message within this deadline, upsub will keep retrying sending or delivering that message. Up sub will stop its retries once the Acknowledgment deadline has passed. Go ahead and create the subscriber, and when you go to the Topics page, you’ll see that Greetings has a new subscription. Its subscription count is now one. Use the menu to publish a new message to the Greetings topic. This will open up a nice web UI where you can type in your message. If you need to send additional information to your subscriber along with the message, you can use the attributes to specify key value pairs that travel along with the message. For example, different publishers might want to identify the source of their messages.

We published a message to our Greetings topic. Let’s receive this message using our subscriber, but using the command line rather than the UI. Call subscriptions. Pull on Greetings sub and there you see it the Good Morning message that we published in the Attributes. You can see that the source is June. Let’s say we no longer need this topic anymore. There are no messages that we want to send there. We can use the web console to go ahead and delete this topic. A topic can be deleted even if it has subscribers. The topic Greetings had one subscriber, and if you switch over to the subscriptions page, you’ll find that the Greetings sub now says Deleted Topic.

The UI indicates to you that this subscriber is for a deleted topic and may not receive any messages. You can use the three dot menu to delete subscribers as well. Just click on Delete and this subscription has been deleted. Let’s clean house and delete the subscription for my topic too. And if you go to the Topics page now, you’ll see that my topic has zero subscriptions. If you want, you can clean up this topic as well. Just hit delete. We won’t be using it anymore. And as for this question, which we asked at the beginning of this lecture, you can use the Attributes field with every message to send key value pairs which provide additional detail about your message. For example, you can use that to indicate the source of the message and so on.

Leave a Reply

How It Works

img
Step 1. Choose Exam
on ExamLabs
Download IT Exams Questions & Answers
img
Step 2. Open Exam with
Avanset Exam Simulator
Press here to download VCE Exam Simulator that simulates real exam environment
img
Step 3. Study
& Pass
IT Exams Anywhere, Anytime!