Disclaimer: SignalR Core is still in early stages and changing rapidly. All information in this post is subject to change.
A few months ago, we started working on the new version of SignalR that will be part of the ASP.Net Core framework. Originally we just wanted to port existing code and iterate on it. However, we soon realized that doing so would prevent us from enabling new scenarios we wanted to support. The original version of SignalR was designed around long polling (note that back in the day support for websockets was not as common as it is today – it was not supported by many web browsers, it was not supported in .NET Framework 4, it was not (and still isn’t) supported natively on Windows 7 and Windows 2008 R2). A JSON based protocol was baked in and could not be replaced which blocked a possibility of using other (e.g. binary) formats. Starting the connection was heavy and complicated – it required sending 3 HTTP requests whose responses had to be correlated with messages sent over the newly created transport (you can find a detailed description of the protocol in SignalR on the wire – an informal description of the SignalR protocol – a post I wrote on this very subject). This basically meant that a dedicated client was required to talk to a SignalR server. In the old design the server was centered around MessageBus – all messages and actions had to go through the message bus. This made the code very complex and error prone especially in scale-out scenarios where all the servers were required to have the same data. The state (e.g. cursors/message ids, groups tokens etc.) was kept on the client which would then send it back to the server when needed (e.g. when reconnecting). The need of keeping the state up-to-date significantly increased the size of the messages exchanged between the server and the client in most of the non-trivial scenarios.
In the new version of SignalR we wanted to remove some of the limitations of the old version. First, we decided to no longer use long polling as the model transport. Rather, we started with a premise that a full duplex channel is available. While this might sound a lot of like websockets we are thinking that it will be possible to take it further in the future and support other protocols like TCP/IP. Note, it does not mean that the long polling and server sent events transports are going away. Only, that we would not drag better transports down to the standards of worse transports (e.g. websockets supported binary format but long polling (until XmlHttpRequest2) and server sent events didn’t so in the old version of SignalR there was no support for binary messages. In the new version we’d rather base64 encode messages if needed and let users use what websockets offers). Second, we did not want to bake in any specific protocol or message format. Sure, for hub invocations we will still need to be able to get the name of the hub method and the arguments but we will no longer care how this is represented on the wire. This opens the way to using custom (including binary) formats. Third, establishing the connection should be lightweight and connection negotiation can be skipped for persistent duplex transports (like websockets). If a transport is not persistent or uses separate channels for sending and receiving data connection negotiation is required – it creates a connection id which will be used to identify all the requests sent by a given client. However, if there are no multiple requests because the transport is full duplex and persistent (like in the case of websockets) the connection id is not needed – once the connection is established in the first request it is used to transfer the data in both directions. In practice, this means that you can connect to a SignalR server without a SignalR client – just by using bare websockets.
There are also a few things that we decided not to support in the new SignalR. One of the biggest ones was the ability re-establish a connection automatically if the client loses a connection to the server. While it may not be obvious, the reconnect feature has a huge impact on the design, complexity and performance of SignalR. Looking at what happens during reconnect should make it clear. When a client loses a connection, it tries to re-establish it by sending the reconnect request to the server. The reconnect request contains the id of the last message the client received and the groups token containing the information about groups the client belongs to. This means that the server needs to send the message id with each message so the client can tell the server what was the last message it received. The more topics the client is subscribed to the bigger the message id gets up to the point where the message id is much bigger than the actual message.
Now, when the server receives a reconnect request it reads the message id and tries to resend all the messages the client missed. To be able to do that the server needs to keep track of all messages sent to each client and buffer at least some recent messages so that it can resend them when needed. Indeed, the server has a buffer per connection which it uses to store recent messages. The default size of that buffer is 1000 messages which creates a lot of memory pressure. The size of the buffer can be configured to make it smaller but this will increase the probability of losing messages when a reconnect happens.
The groups token has similar issues – the more groups the client belongs to the bigger the token gets. It needs to be sent to the client each time the client joins or leaves a group so the client can send it back in case of reconnects to re-establish group membership. The size of the token limits the number of groups a client can belong to – if the groups token gets too big the reconnect attempt will fail due to the URL being bigger than the limit.
While auto-reconnect will no longer be supported in SignalR users can build their own solution to this problem. Even today people try restarting their connection if it was closed by adding a handler to the Closed event in which they start a new connection. It can be done in a similar fashion in SignalR core. It’s true that the client will no longer receive messages it missed but this could happen even in the old SignalR – if the number of message the client missed was greater than the size of the message buffer the newest messages would overwrite the oldest messages (message buffer is a ring buffer) so the client would never receive the oldest messages.
Another scenario we decided not to support in the new version of SignalR was allowing clients to jump servers (a multi-server scenarios). Before, the client could connect to any server and then reconnect or send a data to any other server in the farm. This required that all servers had all the data to be able to handle requests from any client. The way it was implemented was that when a server receive a message it would publish it to all the other SignalR servers via MessageBus. This resulted in a huge number of messages being sent between SignalR servers.
(Side note. Interestingly, the scenario of reconnecting to a different server than the one the client was originally connected to often did not work correctly due to server misconfiguration. The connection token and groups token are encrypted by the server before sending them to the client. When the server receives the connection token and/or groups token it needs to be able decrypt it. If it cannot, it rejects the request with the 400 (Bad Request) error. The server uses the machine key to encrypt/decrypt the data so, all machines in the farm must have the same machine key or the connection token (which is included in each request) encrypted on one server can’t be decrypted on another server and the request fails. What I have seen several times was that servers in the farm had different machine keys so, reconnecting to a different server did not actually work.)
In the new SignalR the idea is that the client sticks only to one server. In multi-server scenarios there is a client to server map stored externally which tells which client is connected to which server. When a server needs to send a message to a client it no longer needs to send the message to all other servers because the client might be connected to one of them. Rather, it checks what server the client is connected to and sends the message only to this client thus the traffic among SignalR server is greatly reduced.
The last change I want to talk about, somewhat related to the previous topic, is removing the built-in scale-out support. In the previous version of SignalR there was one official way of scaling out SignalR servers – a scale out provider would subclass the
ScaleoutMessageBus and leave all the heavy lifting to SignalR. It sounds good in theory but with the time it became apparent that with regards to scale-out there is no “one size fits all” solution. Scaling out an applications turned out to be very specific to the application goals and design. As a result, many users had to implement their own solution to scaling out their applications yet still paid the cost of the built-in scale-out (even when using just one server there is an in-memory message bus all messages go through). While, scale-out support is no longer built-in the project contains a Redis based scale-out solution that can be used as-is or as a guidance to create a custom scale-out solution.
These are I think the biggest design/architecture decision we have made. I believe that they will allow to make SignalR simpler, more reliable and performant.