LOAD BALANCING MONGODB

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email


In the mongoDB docs, it was recommended to run mongos on each application server. This of course is desirable to minimize latency.

This is a great options for those running a few apps, however if you have apps in the double digits number i would reconsider this option and think about other scalable options.

If you are running mongoDB driver 3.2 then you can rely on the driver latency window algorithm to find the most suitable server to route requests. Depending on how many operations per second you are processing you would have to vertically scale your mongos. Bare in mind that each connection takes up 1mb of memory.

The other option is to scale horizontally. As with any computer system, we all know that it is cheaper and more performant if we double our resources by using commodity server instead of purchasing expensive servers to scale.

This can be achieved by using a load balancer that supports tcp load balancing with iphashing profile so that you can configure your load balancer with session affinity.

If you are running mongoDB in amazon then you cannot use amazon’s load balancer option as of this posting as it only supports session affinity with http load balancing.

However, you do have the option to get an F5 load balancer AMI from the amazon market place. Though very expensive in terms of both instance and license. You can also opt to use an open source software load balancer such as nginx or haproxy. Just make sure that you configure it with sticky session by using iphashing.

One thing to note here is that whenever you use a load balancer all traffic terminates at the load balancer and then proxied to your backend mongos. What does this means? Well this means that in terms of trouble shooting and you need to get a specific clients IP address from the mongos log, you would only see the load balancer’s address.

Unfortunately, in version 1.11.x of nginx, you cannot forward client address to backend servers when doing strictly layer 4 load balancing.  The reason is that the proxy protocol takes incoming http requests and then inject proxy_protocol information and forward that to the backend server via proxy_protocol to your upstream backend servers.

If you had a way to modify the tcp packet and inject the original source ip in the header information after the packet has been terminated at the LB then you could achieve this. So for now if you want to use nginx to load balance your cluster, you will have to just live without the original source ip information in your mongos logs.

With all that being said, nginx is great and can handle mass amount of traffic you just have to fine tune the configuration and the tcp stack from the kernel to achieve best performance.

You know I wouldn’t leave without an example on how to achieve this.

nginx tweaked configuration:

user nginx nginx;
worker_rlimit_nofile 64000; #number of file descriptors to allow 
based on mongo docs. This should equal the product of the 
worker_processes and the worker_rlimit_nofile.

worker_processes 8; #should equal the number of processors
events {
 multi_accept on; #accepts multiple connection in the listen que.

 use epoll; #effecient way of handling events sent to the kernel

 worker_connections 8000; #number of concurrent connections. 

 accept_mutex on; #allow each subsequent connection to use a different processor this is the default.
}

stream {  
    server {  listen *:27017; #mongo default port  proxy_pass mongos;  
}

upstream mongos {  hash $remote_addr consistent; 
    server 192.168.198.12:27016 max_fails=2 fail_timeout=10; 
# if server fails twice within 10 seconds, mark it as down.  
server 192.168.198.2:27016 max_fails=2 fail_timeout=10;  
server 192.168.198.11:27016 max_fails=2 fail_timeout=10;  
    } 
}

Kernel’s Network Stack Configuration:

You should consider configuring the tcp buffer to send more packets in high speed networks since the buffer size is bottle necked by the operating system. This is a upper bound placed on the maximum amount of memory available for your tcp connection. Why is this important? Well considering the fact that tcp communicate over a socket interface and that its performance is dependent upon the product of the round trip time (RTT) and the transfer time also known as the bandwidth delay product (BDP). This of course measures the amount of data that fills the tcp pipe.

When you send a packet, the kernel attaches a receive and send buffer to each opened sockets. These buffers must be large enough to carry your tcp packets along with the OS overhead. The BDP determines the space required by the sender to receiver to obtain maximum tcp throughput.

Your buffer size can be determined by:

buffer size = network capacity * round trip time
Example: If your ping time is 60 miliseconds and you have a 1gb ethernet card
this would be
buffer size = .03 (1024) * (1/8)   
             = 3.84MB

To find the memory consumed by tcp:

 $ sysctl net.ipv4.tcp_mem
net.ipv4.tcp_mem = 188319 251092 376638

The values of the above reads: minimum, initial and maximum buffer size. You shouldn’t have to tune these values. Since kernel version 2.6.17 linux was bundled with an auto-tuning feature that configures the buffer automagically. You can confirm this by entering the following:

$ sysctl net.ipv4.tcp_moderate_rcvbuf net.ipv4.tcp_moderate_rcvbuf = 1

The value of 1 should be expected.
It means auto tuning is enabled.

You can also check the send and receive buffers:

sysctl net.ipv4.tcp_rmem

This lists the memory for the TCP receive buffers

sysctl net.ipv4.tcp_wmem

This lists the memory for the TCP send buffers

The following settings are use to set the receive and send window size:

sysctl net.ipv4.tcp_moderate_rcvbuf

net.core.rmem_max = 212992

sysctl net.ipv4.tcp_moderate_rcvbuf

net.core.wmem_max = 212992

set them to the following:

sudo sysctl -w net.core.rmem_max=16777216
net.core.rmem_max = 16777216

sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216'
net.ipv4.tcp_rmem = 4096 87380 16777216

$ sudo sysctl -w net.core.wmem_max=16777216
net.core.wmem_max = 16777216

$ sudo sysctl -w net.ipv4.tcp_wmem='4096 16384 16777216'
net.ipv4.tcp_wmem = 4096 16384 16777216

You want to make these changes because if the the receiver window size
is smaller than the sender window size then the sender won’t be able to
send more data than the receiver can receive and ultimately leading
to suboptimal performance. 16MB should be a good value for these settings.

Another interesting settings is MSL (Maximum Segment Lifetime). This is the maximum time tcp connection will be in the TIME_WAIT  state from the sender to the receiver and back. This can cause
performance impact. The default set by the kernel is 60secs which amount to 2mins that is 60secs each direction. Wow! What do you mean?  Well, longer the connection is held in the  Time_WAIT state the more resouces is unilizaed and ultimtely lead to poor performance.

You can see this by entering:

sysctl net.ipv4.tcp_fin_timeout
net.ipv4.tcp_fin_timeout = 60

I would suggest to set this to 40secs instead:

sudo sysctl -w net.ipv4.tcp_fin_timeout = 20
Remember it is 20sces each way so a total 40 secs in TIME_WAIT.

You can also improve tcp overhead by configuring the tcp behavior by using
tcp_tw_reuse. This option configure tcp to reuse the socket in TIME_WAIT
state instead of making new tcp handshake. You can configure this by setting
net.ipv4.tcp_tw_reuse. By default this is set to 0.

You can check this by:

entering: sysctl net.ipv4.tcp_tw_reuse.

set the value to 1:
sysctl -w net.ipv4.tcp_tw_reuse = 1

Alternatively you could also set the tcp_tw_recycle. This options allows
faster recycling of the tcp TIME_WAIT connection state.

Now that we have optimize our tcp settings we need to raise the OS resource upper boundaries so that tcp can utilize these resources.

The first on the list is the que size

The tcp stack will try to limit the que size by processing data as quickly as possible. If this process is slow then packets will get queued up. Fortunately for us, the kernel impose a hard limit that we can adjust to our benefit.

The Value can be checked by:

sysctl net.core.netdev_max_backlog
net.core.netdev_max = 300

Lets increase this value to 12000

sysctl -w net.core.netdev_max_backlog = 12000

The kernel also has a limit set on the listen queue size. Let us increase this size.

Check queue size:

sysctl net.core.somaxconn
net.core.somaxconn = 128

Lets make this a bit larger:

sysctl -w net.core.somaxconn = 2048

Another settings to configure is the half-opened connections. Whenever there’s a new connection request, the host requesting the connection waits for an acknowledgement from the client. The connection will remain in half-opened state until the requesting host gets an acknowledgement. Ok so what’s the problem? Well, the OS kernel limits the amount of connection that can be held in this state and will ultimately refuse all other connections if this limit is exceeded.

To view the limit set by the kernel for half-opened connections enter:

sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 256

To increase this value:

sysctl -w net.ipv4.tcp_max_syn_backlog = 2048

Ephemeral ports are another bottleneck in the tcp stack that should be addressed for best performance. What are ephemeral ports? Glad you ask. These are ports that are randomly assigned when a client opens up a connection and starts a session. These ports last only for the duration of the client’s session.

Lets check our maximum allowed ports by the kernel:

sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 61000

The values above signifies the minimum and maximum ports value of the 65,535 ports available on any system. To derive the number of available ports on your system you would subtract the minimum value from your maximum.

Total available ports:

61000 - 32768 = 28232

To find out the total number of ports that can be served in any moment in time we do the following:

take the number we set for TIME_WAIT and divide that by the total number of available ports from above:
28232 / 60 = 470
Finally we would need to divide 470 /2 because we are proxying our connections to an upstream backend.
235 ports can be served at any given moment. Please note that this is not the number of concurrent connections.

In order to get to the 64000 ports that we have set for the number of open files in the nginx config, we need to modify the ports range set by the kernel. We can achieve this by:

sysctl -w net.ipv4.ip_local_port_range = '1000 65000'
This will give you 64000 tcp ports to use.

The following are the recommended kernel settings for your nginx load balancer for MongoDB:

net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_tw_reuse = 1
net.core.netdev_max_backlog = 10000
net.ipv4.ip_local_port_range = 1000 65000
net.core.somaxconn=2048

In order to persist the changes above across reboots you should write the change to /etc/sysctl.conf and then reload the conf file with sysctl -p.

Please don’t forget to leave a comment if you like my post.

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Copyright © 2018 Imperiadata. All rights reserved.