SignalR Core Part 1/3: Design Considerations

Disclaimer: SignalR Core is still in early stages and changing rapidly. All information in this post is subject to change.

A few months ago, we started working on the new version of SignalR that will be part of the ASP.Net Core framework. Originally we just wanted to port existing code and iterate on it. However, we soon realized that doing so would prevent us from enabling new scenarios we wanted to support. The original version of SignalR was designed around long polling (note that back in the day support for websockets was not as common as it is today – it was not supported by many web browsers, it was not supported in .NET Framework 4, it was not (and still isn’t) supported natively on Windows 7 and Windows 2008 R2). A JSON based protocol was baked in and could not be replaced which blocked a possibility of using other (e.g. binary) formats. Starting the connection was heavy and complicated – it required sending 3 HTTP requests whose responses had to be correlated with messages sent over the newly created transport (you can find a detailed description of the protocol in SignalR on the wire – an informal description of the SignalR protocol – a post I wrote on this very subject). This basically meant that a dedicated client was required to talk to a SignalR server. In the old design the server was centered around MessageBus – all messages and actions had to go through the message bus. This made the code very complex and error prone especially in scale-out scenarios where all the servers were required to have the same data. The state (e.g. cursors/message ids, groups tokens etc.) was kept on the client which would then send it back to the server when needed (e.g. when reconnecting). The need of keeping the state up-to-date significantly increased the size of the messages exchanged between the server and the client in most of the non-trivial scenarios.

In the new version of SignalR we wanted to remove some of the limitations of the old version. First, we decided to no longer use long polling as the model transport. Rather, we started with a premise that a full duplex channel is available. While this might sound a lot of like websockets we are thinking that it will be possible to take it further in the future and support other protocols like TCP/IP. Note, it does not mean that the long polling and server sent events transports are going away. Only, that we would not drag better transports down to the standards of worse transports (e.g. websockets supported binary format but long polling (until XmlHttpRequest2) and server sent events didn’t so in the old version of SignalR there was no support for binary messages. In the new version we’d rather base64 encode messages if needed and let users use what websockets offers). Second, we did not want to bake in any specific protocol or message format. Sure, for hub invocations we will still need to be able to get the name of the hub method and the arguments but we will no longer care how this is represented on the wire. This opens the way to using custom (including binary) formats. Third, establishing the connection should be lightweight and connection negotiation can be skipped for persistent duplex transports (like websockets). If a transport is not persistent or uses separate channels for sending and receiving data connection negotiation is required – it creates a connection id which will be used to identify all the requests sent by a given client. However, if there are no multiple requests because the transport is full duplex and persistent (like in the case of websockets) the connection id is not needed – once the connection is established in the first request it is used to transfer the data in both directions. In practice, this means that you can connect to a SignalR server without a SignalR client – just by using bare websockets.

There are also a few things that we decided not to support in the new SignalR. One of the biggest ones was the ability re-establish a connection automatically if the client loses a connection to the server. While it may not be obvious, the reconnect feature has a huge impact on the design, complexity and performance of SignalR. Looking at what happens during reconnect should make it clear. When a client loses a connection, it tries to re-establish it by sending the reconnect request to the server. The reconnect request contains the id of the last message the client received and the groups token containing the information about groups the client belongs to. This means that the server needs to send the message id with each message so the client can tell the server what was the last message it received. The more topics the client is subscribed to the bigger the message id gets up to the point where the message id is much bigger than the actual message.

Now, when the server receives a reconnect request it reads the message id and tries to resend all the messages the client missed. To be able to do that the server needs to keep track of all messages sent to each client and buffer at least some recent messages so that it can resend them when needed. Indeed, the server has a buffer per connection which it uses to store recent messages. The default size of that buffer is 1000 messages which creates a lot of memory pressure. The size of the buffer can be configured to make it smaller but this will increase the probability of losing messages when a reconnect happens.

The groups token has similar issues – the more groups the client belongs to the bigger the token gets. It needs to be sent to the client each time the client joins or leaves a group so the client can send it back in case of reconnects to re-establish group membership. The size of the token limits the number of groups a client can belong to – if the groups token gets too big the reconnect attempt will fail due to the URL being bigger than the limit.

While auto-reconnect will no longer be supported in SignalR users can build their own solution to this problem. Even today people try restarting their connection if it was closed by adding a handler to the Closed event in which they start a new connection. It can be done in a similar fashion in SignalR core. It’s true that the client will no longer receive messages it missed but this could happen even in the old SignalR – if the number of message the client missed was greater than the size of the message buffer the newest messages would overwrite the oldest messages (message buffer is a ring buffer) so the client would never receive the oldest messages.

Another scenario we decided not to support in the new version of SignalR was allowing clients to jump servers (a multi-server scenarios). Before, the client could connect to any server and then reconnect or send a data to any other server in the farm. This required that all servers had all the data to be able to handle requests from any client. The way it was implemented was that when a server receive a message it would publish it to all the other SignalR servers via MessageBus. This resulted in a huge number of messages being sent between SignalR servers.

(Side note. Interestingly, the scenario of reconnecting to a different server than the one the client was originally connected to often did not work correctly due to server misconfiguration. The connection token and groups token are encrypted by the server before sending them to the client. When the server receives the connection token and/or groups token it needs to be able decrypt it. If it cannot, it rejects the request with the 400 (Bad Request) error. The server uses the machine key to encrypt/decrypt the data so, all machines in the farm must have the same machine key or the connection token (which is included in each request) encrypted on one server can’t be decrypted on another server and the request fails. What I have seen several times was that servers in the farm had different machine keys so, reconnecting to a different server did not actually work.)

In the new SignalR the idea is that the client sticks only to one server. In multi-server scenarios there is a client to server map stored externally which tells which client is connected to which server. When a server needs to send a message to a client it no longer needs to send the message to all other servers because the client might be connected to one of them. Rather, it checks what server the client is connected to and sends the message only to this client thus the traffic among SignalR server is greatly reduced.

The last change I want to talk about, somewhat related to the previous topic, is removing the built-in scale-out support. In the previous version of SignalR there was one official way of scaling out SignalR servers – a scale out provider would subclass the ScaleoutMessageBus and leave all the heavy lifting to SignalR. It sounds good in theory but with the time it became apparent that with regards to scale-out there is no “one size fits all” solution. Scaling out an applications turned out to be very specific to the application goals and design. As a result, many users had to implement their own solution to scaling out their applications yet still paid the cost of the built-in scale-out (even when using just one server there is an in-memory message bus all messages go through). While, scale-out support is no longer built-in the project contains a Redis based scale-out solution that can be used as-is or as a guidance to create a custom scale-out solution.

These are I think the biggest design/architecture decision we have made. I believe that they will allow to make SignalR simpler, more reliable and performant.

Advertisements

9 thoughts on “SignalR Core Part 1/3: Design Considerations

  1. […] test some of the scenarios described in the first part of this mini-series I came up with an idea for a relatively simple application where users can report to the server […]

    Like

  2. […] SignalR Core Part 1/3: Design Considerations […]

    Like

  3. Adam Shortland says:

    I’ve been reading a lot and it seems the trade off to improve compatibility are not really worth it. SignalR was great because of its simplicity to implement. The examples available now are significantly more complex and don’t seem to be any more functional than we had in previous versions. It makes me question why I would even bother using SignalR instead of just managing all the connections myself and using sockets or websockets the way we did before SignalR. This seems like a giant leap backwards.

    Like

  4. Aaron says:

    Awesome

    Didn’t read anything about dropping jQuery dependency. Can we assume it’s been dropped?

    Like

    • moozzyk says:

      jQuery dependency was just an implementation detail (albeit a heavy one) so I did not mention in the post but yes, the new client no longer depends on jQuery or other 3rd party libraries.

      Thanks,
      Pawel

      Like

  5. Erik Taylor says:

    The overhead for auto-reconnect is there, I’ve had to do a similar thing when implementing push-notifications in WCF on duplex channels. Buuut, it’s really great to have that, because in the other scenario, you will definitely lose messages more often than not and what’s worse on the client you’ll have no way to recover the lost messages and without a message id of some kind you won’t know if you did or didn’t miss anything, thus must assume you did and will bear the cost of a full reload to restore state. Is it absolutely necessary to have a buffer per client? Would it be possible to share a large ring buffer and have each client index it, then use a head index to delta the lost messages? Having to do more work on the client for a less robust solution doesn’t sound like a win. Maybe an opt in on the global message buffer or some such to allow recovery in situations that warrant it? Me wants me cake and eats it too.

    Like

    • moozzyk says:

      Whether loosing messages is important or not depends on the scenario. For instance if you broadcast some kind of status every second to a client loosing 10 messages might not be a problem at all. Your client will have stale status for a while and then will be updated when you reconnect. If you want to have guaranteed delivery even now you would have to number messages and implement resending messages that have been missed because messages can be lost not only when the client cannot re-connect but also when the client is re-connecting.
      I am not sure if the idea of common buffer could work or maybe how complicated it would get. Note that typically clients receive different sets of messages – clients can belong to different groups and should not receive (nor track) messages sent to groups they don’t belong to. Clients can also receive direct messages – i.e. messages sent only to a specific client and not to anyone else. So using a common buffer would inverse the problem – instead of tracking messages per client you would have to track clients per message (i.e. each message would have a list of clients who are/were supposed to receive it). Maybe it could save some memory but I don’t know if the difference would be that dramatic.
      Opt-in sounds good in theory but (as we all learned) built-in reconnect with message reply affects the design of the entire server while not really solving the problem.

      Thanks,
      Pawel

      Like

      • Erik Taylor says:

        I agree it is scenario dependent, most often I am passing state modifications/changes instead of passing the entire state, which is what I think of when I think of status messages – the entire state of the status is contained in each message. Pushing changes to clients that are database records with timestamps concurrency tokens is a good example where I see issues with lost messages. I have a set of records, a user can edit them, but if there has been a change that is lost, then I incur the costs of sending the update, then when the update fails, returning the new record, prompting the user regarding the concurrency error and then making awkward decisions about what to do next. Overwrite the current users changes locally with the updated record from the database and then have them reinput their changes, just bring in the updated concurrency token and overwrite with the local changes for the other fields or perhaps I do some kind of merge where I compare field by field which are different and then where there is a conflict prefer the local changes or give them an interface to choose the merge. All of these choices are not great, for various reasons, the best thing is to have the list stay up to date by pushing the changes in as they are made. I still run the risk of having a record pushed in while a user is editing it, but this collision has been really rare, so forcing a reinput isn’t much of a cost. If the list integrity is slowly slipping because of intermittent connection issues, I won’t know which items are impacted and end up having to poll the whole list occasionally anyway, which seems to devalue the push notifications all together. I don’t see many other options when I’m trying to keep the data as fresh as possible and it doesn’t change at regular, short intervals. If all of the data was changing at regular intervals, then I could be reasonably sure, like in your status example, that soon enough a fresh push would update the stale data, probably before I run into too many of the above issues.

        In my old duplex channel implementation, I kept a buffer of messages per client, but only added to it when a disconnect was detected, which was reasonable since I could determine disconnects server side, upper bounds for concurrent users and message size so the memory usage only grew relative to the number of disconnected users, not connected ones. When I switched to SignalR, I was not aware of the implementation details, just the intent of the auto reconnect and push of missed messages. I don’t understand your assertion that even now I would have to number messages and implement resending, when the current implementation handles that with its buffer per client model and auto reconnect.

        I’m not sure the design considerations about group tokens and message id are very compelling reasons to drop the whole feature.

        With the groups token, since the server is already maintaining state, couldn’t the reconnect just ‘page’ the group token to get around the url size limit: Here’s the first 20 groups I’m a part of and there’s two more pages of groups I will send, when you get the last page, go ahead and do normal things after a reconnect.

        With the messageIds growing with each topic, this seems like transmitting additional nonrelevant message state for other topics. What’s the rationale for doing so, isn’t the client keeping track of the last message id received for each topic anyway? Maybe there’s something I’m missing here, but when I look at the github issues linked in the post above, I see the chain of growing cursors with loads of repetitive data that could be stored by the client per topic and only updated when a specific topic message gets pushed. This isn’t a purely stateless system, each side is retaining information about what has transpired, so I don’t understand the need to be shuttling data about all the topics/groups back and forth in a single message. Why doesn’t a message come with a group/topic/message token, which uniquely maps internally to the same hierarchy in the client and when a reconnect is needed, you just go over that hierarchy composing last-successful-receive messages to send up to the server (for group 1, topic 4, last message was 35). On the server a similar hierarchy and buffer per group or topic would allow it to quickly determine where a client was and catch them up.

        I understand that not having any buffers or any history uses less memory, but structured indexes and shared buffers, even if a pointer per client per topic was necessary, seems like you’d get around the 1000 message buffer per client issue with a memory growth that was more dependent on group/topic/message density then on N, the number of clients.

        The case of clients per message, sounds like a situation where you’d just need to have the option for a bit of extra data on a message in the buffer at the server to handle when its not all the subscribed clients. When reconstructing after a reconnect, you’d look at that and when its not all clients, check the list of client ids for the reconnecting client id to determine if the send is warranted. A bit of overhead, but not for every message, just those that are specific to certain clients. For users of the library, they could make a decision about whether it was more advantageous to determine ahead of time if the partitioning of messages would be better served by groups and topics, such that the list of clients per message is completely determined by the subscription of the group/topic. I think it would be the edgier case where a set of users are subscribed to the same groups/topics but all need to receive different subsets of the messages from those group/topics and the groups/topics couldn’t just be repartitioned to represent the implied subsets more concretely.

        Chewing up a bit more cloud memory seems like a smaller price to pay than having an unreliable reconnect scheme that in many scenarios forces entire dataset reloads and at least makes the freshness of the data always at question whenever a disconnect is detected.

        Also, I apologize for this extremely long comment 🙂

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: