FreeBSD forwarding Performance

There are lot’s of guide about tuning FreeBSD TCP performance (where the FreeBSD host is an end-point of the TCP session), but it’s not the same that tunig forwarding performance (where the FreeBSD host don’t have to read the TCP information of the packet being forwarded) or firewalling performance.

Concepts

How to bench a router

Definition

Clear definition regarding some relations between the bandwidth and frame rate is mandatory:

Benchmarks

Cisco or Linux

FreeBSD

Bench lab

The bench lab should permit to measure the pps. For obtaining accurate result the RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) is a good reference. If switches are used, they need to have proper configuration too, refers to the BSDRP performance lab for some examples.

Tuning FreeBSD

Literature

Here is a list of sources about optimizing/analysis forwarding performance under FreeBSD.

How to bench or tune the network stack:

FreeBSD Experimental high-performance network stacks:

Enable fastforwarding

By default, fastforwarding is disabled on FreeBSD (and incompatible with IPSec usage). The first step is to enable fastforwarding with a:

echo "net.inet.ip.fastforwarding=1" >> /etc/sysctl.conf
sysctl net.inet.ip.fastforwarding=1

Here is an example of the difference without and with fastforwarding: Impact of ipfw and pf on 4 cores Xeon 2.13GHz with 10-Gigabit Intel X540-AT2

Entropy harvest impact

Lot’s of tuning guide indicate to disable:

  • kern.random.sys.harvest.ethernet
  • kern.random.sys.harvest.point_to_point
  • kern.random.sys.harvest.interrupt.

But what about the REAL impact on a router (value in pps):

x harvest DISABLED
+ harvest ENABLED (default)
+--------------------------------------------------------------------------------+
|+                   x          x    x        x+        +   +   +               x|
|                    |_______________M_____A______________________|              |
|                   |_________________________A_________M______________|         |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1918159       2036665       1950208       1963257     44988.621
+   5       1878893       2005333       1988952     1967850.8     51378.188
No difference proven at 95.0% confidence

⇒ There is no difference on FreeBSD 10.2 and earlier: Then we can keep the default value (enabled)

But since FreeBSD 11 (head) it’s not true and removing some entropy source (interrupt and ethernet) can have a very big impact:

Impact of disabling some entropy source on FreeBSD forwarding performance

NIC drivers tuning

Network card became very complex and provide lot’s of tuning parameters that can add huge performance impact.

Multi-queue

First, the multi-queue feature of all modern NIC can be limited to the number of queues (then CPU) to uses. You need to test this impact on your own hardware because it’s not always a good idea to use the default value (which is number of queues = number of CPU): Bad default NIC queue number with 8 cores or more

This graphic shows that, on this specific case, playing with the parameters “max interrupts rate” didn’t help.

Still regarding this graphic we could understand for this setup the best configuration was limiting 4 queues to the drivers: This is correct for a router… but for a firewall this parameters isn’t optimum: Impact of ipfw and pf on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743

Descriptors per queue and maxi number of received packets to process at a time

Regarding some others drivers parameters, here are potential impact of the maximum input packets to manage and size of the descriptors:

Disabling LRO and TSO

All modern NIC support LRO and TSO features that needs to be disabled on a router:

  1. By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. Intel drivers readmeinclude this note “The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic.”
  2. This break the End-to-end principle

There is no real impact of disabling these feature on PPS:

x tso.lro.enabled
+ tso.lro.disabled
+--------------------------------------------------------------------------+
|   +  +     x+    *                          x+                    x     x|
|               |___________________________A_M_________________________|  |
||____________M___A________________|                                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1724046       1860817       1798145       1793343     61865.164
+   5       1702496       1798998       1725396     1734863.2     38178.905
No difference proven at 95.0% confidence

Resume

Default FreeBSD parameters are for a generic server (end host) and not tuned for a router usage, and tuning parameters that suit to a router didn’t always suit for a firewall usage.

Where is the bottleneck ?

Tools:

Packet traffic

Display the information regarding packet traffic, with refresh each second.

Here is a first example:

[root@BSDRP3]~# netstat -i -h -w 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      370k     0     0        38M       370k     0        38M     0
      369k     0     0        38M       368k     0        38M     0
      370k     0     0        38M       370k     0        38M     0
      373k     0     0        38M       376k     0        38M     0
      370k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       369k     0        38M     0

⇒ This system is forwarding 370Kpps (in and out) without any in/out errs (The packet generator used netblast with 64B packet-size a 370Kpps).

Don’t use “netstat -h” on a standard FreeBSD: This option has a bug

Here is a second example:

[root@BSDRP3]~# netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      399k  915k     0        25M       395k     0        24M     0
      398k  914k     0        24M       398k     0        24M     0
      399k  915k     0        25M       399k     0        25M     0
      398k  915k     0        24M       397k     0        24M     0
      399k  914k     0        25M       398k     0        24M     0
      398k  914k     0        24M       400k     0        25M     0
      398k  915k     0        24M       396k     0        24M     0
      400k  915k     0        25M       401k     0        25M     0
      397k  914k     0        24M       397k     0        24M     0
      398k  914k     0        24M       399k     0        25M     0
      400k  914k     0        25M       401k     0        25M     0
      398k  914k     0        24M       397k     0        24M     0

⇒ This system is forwarding about 400Kpps (in and out), but it’s overloaded because it drops (errs) about 914Kpps (the generator used netmap pkt-gen with 64B packet size at a rate of 1.34Mpps).

Interrupt usage

Report on the number of interrupts taken by each device since system startup.

Here is a first example:

[root@BSDRP3]~# vmstat -i
interrupt                          total       rate
irq4: uart0                         6670          5
irq14: ata0                            5          0
irq16: bge0                           27          0
irq17: em0 bge1                  5209668       4510
cpu0:timer                       1299291       1124
irq256: ahci0                       1172          1
Total                            6516833       5642

⇒ Notice that em0 and bge1 are sharing the same IRQ. It’s not a good news.

Here is a second example:

[root@BSDRP3]# vmstat -i
interrupt                          total       rate
irq4: uart0                        17869          0
irq14: ata0                            5          0
irq16: bge0                            1          0
irq17: em0 bge1                        2          0
cpu0:timer                     214331752       1125
irq256: ahci0                       1725          0
Total                          214351354       1126

⇒ Almost zero rate and counters regarding NIC IRQ means polling is enabled: IRQ management of current NIC avoid the use of polling.

Memory Buffer

Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.

[root@BSDRP3]~# netstat -m
5220/810/6030 mbufs in use (current/cache/total)
5219/675/5894/512000 mbuf clusters in use (current/cache/total/max)
5219/669 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/256000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/128000 9k jumbo clusters in use (current/cache/total/max)
0/0/0/64000 16k jumbo clusters in use (current/cache/total/max)
11743K/1552K/13295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Or more verbose:

[root@BSDRP3]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    5221,     667,414103198,   0,   0
mbuf:                   256,      0,       1,     141,     135,   0,   0
mbuf_cluster:          2048, 512000,    5888,       6,    5888,   0,   0
mbuf_jumbo_page:       4096, 256000,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216, 128000,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  64000,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

⇒ No “failed” here.

CPU / NIC

top can give very useful information regarding the CPU/NIC affinity:

[root@BSDRP]/# top -nCHSIzs1
last pid:  1717;  load averages:  7.39,  2.01,  0.78  up 0+00:18:58    21:51:08
148 processes: 18 running, 85 sleeping, 45 waiting

Mem: 13M Active, 9476K Inact, 641M Wired, 128K Cache, 9560K Buf, 7237M Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root       -92    -     0K   864K CPU2    2   0:01  98.39% intr{irq259: igb0:que}
   11 root       -92    -     0K   864K CPU5    5   0:38  97.07% intr{irq262: igb0:que}
   11 root       -92    -     0K   864K WAIT    7   0:38  96.68% intr{irq264: igb0:que}
   11 root       -92    -     0K   864K WAIT    3   0:39  96.58% intr{irq260: igb0:que}
   11 root       -92    -     0K   864K CPU6    6   0:38  96.48% intr{irq263: igb0:que}
   11 root       -92    -     0K   864K WAIT    4   0:38  96.00% intr{irq261: igb0:que}
   11 root       -92    -     0K   864K RUN     0   0:40  95.56% intr{irq257: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:37  95.17% intr{irq258: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:01   0.98% intr{irq276: igb2:que}
   11 root       -92    -     0K   864K RUN     3   0:00   0.88% intr{irq278: igb2:que}
   11 root       -92    -     0K   864K WAIT    0   0:01   0.78% intr{irq275: igb2:que}
   11 root       -92    -     0K   864K WAIT    4   0:00   0.78% intr{irq279: igb2:que}
   11 root       -92    -     0K   864K RUN     7   0:00   0.59% intr{irq282: igb2:que}
   11 root       -92    -     0K   864K RUN     6   0:00   0.59% intr{irq281: igb2:que}
   11 root       -92    -     0K   864K RUN     5   0:00   0.29% intr{irq280: igb2:que}

Drivers

Depending the NIC drivers used, there are some counters available:

[root@BSDRP3]~# sysctl dev.em.0.mac_stats. | grep -v ': 0'
dev.em.0.mac_stats.missed_packets: 221189883
dev.em.0.mac_stats.recv_no_buff: 94987654
dev.em.0.mac_stats.total_pkts_recvd: 351270928
dev.em.0.mac_stats.good_pkts_recvd: 130081045
dev.em.0.mac_stats.bcast_pkts_recvd: 1
dev.em.0.mac_stats.rx_frames_64: 2
dev.em.0.mac_stats.rx_frames_65_127: 130081043
dev.em.0.mac_stats.good_octets_recvd: 14308901524
dev.em.0.mac_stats.good_octets_txd: 892
dev.em.0.mac_stats.total_pkts_txd: 10
dev.em.0.mac_stats.good_pkts_txd: 10
dev.em.0.mac_stats.bcast_pkts_txd: 2
dev.em.0.mac_stats.mcast_pkts_txd: 5
dev.em.0.mac_stats.tx_frames_64: 2
dev.em.0.mac_stats.tx_frames_65_127: 8

⇒ Notice the high level of missed_packets and recv_no_buff. It’s a problem regarding performance of the NIC or its drivers (on this example, the packet generator send packet at a rate about 1.38Mpps).

pmcstat

During high-load of your router/firewall, load the hwpmc(4) module:

kldload hwpmc
Time used by process

Now you can display the most time consumed process with:

pmcstat -TS instructions -w1

That will display this output:

PMC: [INSTR_RETIRED_ANY] Samples: 36456 (100.0%) , 29616 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
 56.6 pf.ko      pf_test              pf_check_in:29.0 pf_check_out:27.6
 13.5 pf.ko      pf_find_state        pf_test_state_udp
  7.7 pf.ko      pf_test_state_udp    pf_test
  7.5 pf.ko      pf_pull_hdr          pf_test
  4.0 pf.ko      pf_check_out
  2.5 pf.ko      pf_normalize_ip      pf_test
  2.3 pf.ko      pf_check_in
  1.5 libpmc.so. pmclog_read
  1.3 hwpmc.ko   pmclog_process_callc pmc_process_samples
  0.8 libc.so.7  bcopy

On this case, the bottleneck is pf(4)

CPU cycles spent

For displaying where the most cpu cycles are being spent with. We first need a partition with about 200MB that include the debug kernel:

system expand-data-slice
mount /data
fetch http://BSDRP-release-debug
tar xzfv BSDRP-release-debug.tar.xz

Then, during high-load, start collecting (during about 5 seconds):

pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out

Then analyses the output with:

pmcannotate /data/pmc.out /data/debug/boot/kernel/kernel.symbols

ref:https://bsdrp.net/documentation/technical_docs/performance

Updating FreeBSD with Make World

When a new version of FreeBSD is released, or there have been a number of security updates released it is necessary to update FreeBSD with Make World. The process may seem very complex if you have never done one before, but overall it is very straight forward and painless.

Step 1. Getting the new source

To see how to download the most recent source see the Cvsup Tutorial.

Step 2. Building the new world

This step is not necessary since the world will be automatically built when you do the install, if it has not been built already. The reason for running this command seperately is so that it can run while the machine is running normally instead of running while the machine is in single user mode. Running it seperately also allows you to fix any errors that occured before installing the world.

# cd /usr/src
# make buildworld

The time buildworld takes varies greatly depending on the speed of the machine. On a 90mhz machine it could take up to 24 hours, where as on a 3ghz machine the time would be under 45 minutes.

If you experience problems during the buildworld you need to run the follow commands to clean up so you can start over

# cd /usr/obj
# chflags -R noschg *
# rm -rf *

Step 3. Recompiling the Kernel

The kernel needs to be using the same source as the new world or strange things will happen. To do this see the building a new kernel tutorial

Step 4. Installing the new world

The new world should be installed with as little else running as possible, so booting into single user mode is the best way. To do this reboot and either choose boot into single user mode from the menu in 5.x or hit space during the 10 second count down at boot and type

boot -s

Hit enter to choose /bin/sh as your shell and then enter the following commands to mount the needed drives and add swap space.

# mount -u /
# mount -a -t ufs
# swapon -a

Now you are ready to install the new world.

# cd /usr/src
# make installworld

Step 5. Merging the config file

Some updates to the source require updates to the config files in /etc. It is a good idea to back up your /etc directory. Once you have done this run mergemaster

# mergemaster -v

Mergemaster will compare the new etc files to what you have in /etc. It will present any differences to you and ask if you want to merge, skip, or just install the new file. For the most part you should install the new version of every file unless it is a file that you know you have editted such as make.conf, rc.conf, or master.passwd.

After you have finished merging the etc files you can reboot back into multi-user mode.

Performing a Make World remotely.

Although it is not recommended a make world can be done remotely. You just need to kill every unnecessary process and be really careful since if there is a problem the machine will not come back online after the reboot.
ref:http://www.freebsdmadeeasy.com/tutorials/freebsd/performing-a-make-world-in-freebsd.php

Recompiling The Kernel In FreeBSD

Unless you have built your own custom kernel in FreeBSD, you are using the GENERIC kernel. The GENERIC kernel contains everything you need to get the machine up and running the first time and covers a wide range of hardware. Some common reasons for rebuilding the kernel are:

  • To speed it up by taking out unused modules
  • Add support for new hardware
  • Update the kernel with the new source during a Make World

Creating the Kernel Config

The first step to building a new kernel is to copy the GENERIC config file to your own. This new config file is generally given the hostname of the machine.

# cd /usr/src/sys/i386/conf
# cp GENERIC NEWKERNEL

Once you have does this open the new kernel config and begin hashing out modules that are not needed. For example, if you are creating this kernel for a PIII you do not need the I486_CPU or the I485_CPU in your kernel and you can take them out like this:

machine i386
#cpu I486_CPU
#cpu I586_CPU
cpu I686_CPU
ident GENERIC

You should also change the ident to the name of the file so that when booting up it will show that it is booting your kernel and not the GENERIC one.

machine i386
#cpu I486_CPU
#cpu I586_CPU
cpu I686_CPU
ident MIDNIGHT

On machines without SCSI controllers everything in the SCSI section can be hashed out, the same is true with the RAID, USB, and Firewire sections. If something needed is removed the machine may not be able to reboot with the new kernel and the previous one will need to be loaded instead. Notes on what the different modules in the config file are responsible for can be found in /usr/src/sys/i386/conf/NOTES

Building your new kernel is traditional way

# cd /usr/sbin/config NEWKERNEL
# cd ../../compile/NEWKERNEL
# make depend
# make
# make install

To save time the last three commands can be done as

# make depend && make && make install && reboot

Building your new kernel the new way

Recently FreeBSD has a new way to build and install the new kernel and it can be done as so

# cd /usr/src
# make buildkernel KERNCONF=NEWKERNEL
# make installkernel KERNCONF=NEWKERNEL

Here KERNCONF is just a variable that refers to the name of the kernel config file you want to use. To store the name so you no longer need the last part of these commands you can put the following into you /etc/make.conf to save it.

KERNCONF=NEWKERNEL

Rebooting

Once you successfully built and installed the kernel you will need to reboot the machine

# reboot

Errors

If the machine reboots successfully but gives errors it may be because the kernel was compiled with source code that is newer than that of the world. To rebuild with the new source code read about performing a make world

A NAT Router Firewall IPSec Gateway with FreeBSD 5.1-RELEASE

A typical setup for home users and small businesses is to have a single machine connected to the internet as a router that serves as gateway for the private network behind it. Obviously, this router has to “hide” the whole net behind its own external address, which can even be dynamically assigned via an ISP’s DHCP service.

This article describes the steps necessary for setting this up on a machine with two network cards running FreeBSD 5.1-RELEASE. It is largely equivalent to the setup described for the same machine running Linux in another article here. Additionally, the FreeBSD Router serves as IPSec gateway, encrypting traffic to networks behind other routers on the internet.

Warning!

This article contains just the steps that it took me to set up the router. Neither does it cover the whole underlying concepts of networking, security, IPSec, and the state of the world as such, nor does it necessarily protect your machines from security risks (even though it is designed to do so). I might have overlooked something, might have opened a security hole in my configuration. I might be an absolute nincompoop. So do not simply read the text and follow the steps. Follow the links in it, try Google and freebsd.org to understand for yourself what it is you’re doing. If you find any flaw in my configuration, please let me know (address at bottom).

You’ve been warned.

  1. Preparing the system

    1.1. Install FreeBSD 5.1-RELEASE. I used the ISO image here. Do read the Handbook!. Make sure to install the kernel sources and the ports collection.

    1.2. Recompile the kernel. How to do this is described in detail in the Handbook!. Because stock 5.1-RELEASE comes without firewall and IPSec support in the kernel, we have to compile that in. Those are the necessary options:

        # Firewall support added 
        options         IPFIREWALL
        # Divert support added (necessary for natd)
        options         IPDIVERT
        # IPSec support added
        options         IPSEC
        options         IPSEC_ESP
        options         IPSEC_DEBUG
    

    Compile, install and reboot.

  2. Set up the firewall
    (This setup uses ipfw, a dynamically assigned external address and a local net of 192.168.1.0/24.)

    2.1. Enable all necessary services and functions in /etc/rc.conf (adopt addresses and interface names for your setup):

        # use DHCP for external interface
        ifconfig_ep0="DHCP"
        # static address for internal interface
        ifconfig_ep1="inet 192.168.1.1 netmask 255.255.255.0 \
    	broadcast 192.168.1.255
    
        # enable IP forwarding
        gateway_enable="YES"
    
        # enable firewall
        firewall_enable="YES"
        # set path to custom firewall config
        firewall_type="/etc/fw/rc.firewall.rules"
        # be non-verbose? set to YES after testing
        firewall_quiet="NO"
    
        # enable natd, the NAT daemon
        natd_enable="YES"
        # which is the interface to the internet that we hide behind?
        natd_interface="ep0"
        # flags for natd
        natd_flags="-f /etc/fw/natd.conf"
    

    2.2 Edit /etc/fw/rc.firewall.rules (or whatever you set firewall_type to)

        # be quiet and flush all rules on start
        -q flush
        
        # allow local traffic, deny RFC 1918 addresses on the outside
        add 00100 allow ip from any to any via lo0
        add 00110 deny ip from any to 127.0.0.0/8
        add 00120 deny ip from any to any not verrevpath in
        add 00301 deny ip from 10.0.0.0/8 to any in via ep0
        add 00302 deny ip from 172.16.0.0/12 to any in via ep0
        add 00303 deny ip from 192.168.0.0/16 to any in via ep0
                                                                                        
        # check if incoming packets belong to a natted session, allow through if yes
        add 01000 divert natd ip from any to me in via ep0
        add 01001 check-state
        
        # allow some traffic from the local net to the router 
        # SSH
        add 04000 allow tcp from 192.168.1.0/24 to me dst-port 22 in via ep1 setup keep-state
        # ICMP
        add 04001 allow icmp from 192.168.1.0/24 to me in via ep1
        # NTP
        add 04002 allow tcp from 192.168.1.0/24 to me dst-port 123 in via ep1 setup keep-state
        add 04003 allow udp from 192.168.1.0/24 to me dst-port 123 in via ep1 keep-state
        # DNS
        add 04006 allow udp from 192.168.1.0/24 to me dst-port 53 in via ep1
        
        # drop everything else
        add 04009 deny ip from 192.168.1.0/24 to me
             
        # pass outgoing packets (to be natted) on to a special NAT rule
        add 04109 skipto 61000 ip from 192.168.1.0/24 to any in via ep1 keep-state
                                                                                        
        # allow all outgoing traffic from the router (maybe you should be more restrictive)
        add 05010 allow ip from me to any out keep-state
        
        # drop everything that has come so far. This means it doesn't belong to an 
        established connection, don't log the most noisy scans.
        add 59998 deny icmp from any to me
        add 59999 deny ip from any to me dst-port 135,137-139,445,4665
        add 60000 deny log tcp from any to any established
        add 60000 deny log ip from any to any
        
        # this is the NAT rule. Only outgoing packets from the local net will come here.
        # First, nat them, then pass them on (again, you may choose to be more restrictive)
        add 61000 divert natd ip from 192.168.1.0/24 to any out via ep0
        add 61001 allow ip from any to any
    

    2.3 Edit /etc/fw/natd.conf (or whatever you set natd_flags -f to)

        unregistered_only
        interface ep0
        use_sockets
        dynamic
        # dyamically open fw for ftp, irc
        punch_fw 2000:50
    

    2.4. Start the firewall

    Run /etc/rc.d/ipfw start. If you have the DHCP client running, you should have an external address and your firewall router will be functional. Else start it with /etc/rc.d/dhclient start or reboot. Rebooting would be a good idea at this point anyway.
    Time to get yourself a nice cold beer. You’ve deserved it!

    2.5. Tuning

    It may be useful to set some connection related parameters in the kernel. I did this in /etc/sysctl.conf (you can of course call sysctl directly).

     
        security.bsd.see_other_uids=0
        net.inet.ip.fw.dyn_ack_lifetime=3600
        net.inet.ip.fw.dyn_udp_lifetime=10
        net.inet.ip.fw.dyn_buckets=1024
    

    This sets the session lifetime for TCP sessions without any traffic to 1 hour, for UDP to 10 seconds.

  3. Set up IPSec

    3.1. Get and install racoon

    (Also take a look at http://www.x-itec.de/projects/tuts/ipsec-howto.txt)

    Go to /usr/ports/security/racoon.
    Enter make all install clean. This will install racoon and clean up afterwards. Racoon is used for managing the key exchange needed for IPSec. The encryption itself is done in the kernel.

    3.2. Enable racoon

    Add the following line to /etc/rc.conf

        racoon_enable="YES"
    

    This will start racoon at boot time by running /usr/local/etc/rc.d/racoon.sh. I simply edited this file and set my options there: it now calls/usr/local/sbin/racoon -f /etc/ipsec/racoon.conf -l /var/log/racoon. This means that config is in /etc/ipsec, logs go to /var/log.

    3.3. Configure racoon

    Now we have to edit racoon’s config file. In my case this is /etc/ipsec/racoon.conf, per default it is /usr/local/etc/racoon/racoon.conf.

    Very good documentation has been written on how to configure racoon. Google knows it, and you might also take a look athttp://www.kame.net/newsletter/20001119/. Below is just what I changed in /etc/ipsec/racoon.conf.

        [....]
        path certificate "/etc/ipsec/cert" ;
        [....]
        log info;
        [....]
    
        # this is the other gateway's address
        remote aaa.bbb.ccc.ddd 
    {
    	# it's freeswan, so it doesn't support aggressive mode
            exchange_mode main,aggressive; 
            doi ipsec_doi;
            situation identity_only;
                                                                                    
            my_identifier asn1dn ;
    	# Subject of other gateway's certificate
            peers_identifier asn1dn "C=XY/O=XY Org/CN=xy.org.org";
    	# my own X.509 certificate and key
            certificate_type x509 "mycert.crt" "mykey.key";
     
            nonce_size 16;
            lifetime time 1 min;    # sec,min,hour
            initial_contact on;
            support_mip6 on;
            proposal_check obey;    # obey, strict or claim
     
            proposal {
                    encryption_algorithm 3des;
                    hash_algorithm sha1;
                    authentication_method rsasig ;
                    dh_group 2 ;
            }
    }
    

    3.4. Install the certificates

    Also take a look at http://www.kame.net/newsletter/20001119b/

    Copy the following files to /etc/ipsec/cert/ (or whatever you set path certificate to above):

    • Your certificate (named "mycert.crt" above)
    • Your private key (mykey.key; make this -rw------- root)
    • The other gateway’s CA’s public certificate.

    Then make a link to the other gateway’s CA’s certificate (I assume in the following example that the certificate file is named ca.pem). This link must be named after the hash value of the file itself. You can create it with

        # ln -s ca.pem `openssl x509 -noout -hash -in ca.pem`.0
    

    3.5. Tell the kernel to use IPSec

    You should have read the above links before you proceed.

    Now you must tell the kernel to use IPSec when communicating with the other gateway. You do this by creating a file which is then used by /etc/rc.d/ipsecon startup. I put the file in /etc/ipsec/ipsec.conf and entered the following in /etc/rc.conf

        ipsec_enable="YES"
        ipsec_file="/etc/ipsec/ipsec.conf"
    

    /etc/ipsec/ipsec.conf contains the parameters for the setkey (8) command that adds, updates, dumps, or flushes Security Association Database (SAD) entries as well as Security Policy Database (SPD) entries in the kernel.

        # First, flush all SAD and SPD
                                                                                    
        flush;
        spdflush;
                                                                                    
        # Then, set up SPD for the local net and the net behind the other gateway
        # www.xxx.yyy.zzz is my own external address
        # aaa.bbb.ccc.ddd is the other gateway's address as given in
        # /etc/ipsec/racoon.conf
    
        spdadd 192.168.1.0/24 192.168.100.0/24 any -P out ipsec \
            esp/tunnel/www.xxx.yyy.zzz-aaa.bbb.ccc.ddd/require ;
        spdadd 192.168.100.0/24 192.168.1.0/24 any -P in ipsec \
            esp/tunnel/aaa.bbb.ccc.ddd-www.xxx.yyy.zzz/require ;
    

    This tells the kernel to use the ESP tunnel between www.xxx.yyy.zzz (the router) and aaa.bbb.ccc.ddd (the other gateway) when routing between the subnets 192.168.1.0/24 (local) and 192.168.100.0/24 (remote).

    This is the only place in the whole configuration where my own external IP address is used. Since this is a dynamically assigned address which can change, I had to provide some mechanism to restart IPSec when the address changes. I did this by configuring dhclient to run a script which rewrites/etc/ipsec/ipsec.conf and restarts /etc/rc.d/ipsec. I found it to be the easiest way to name the script /etc/dhclient-exit-hooks, as the ISC dhclient will look for this file everytime it has to update anything (see dhclient-script(8)). Somebody is likely to come up with a better solution.

    3.6. Amend the firewall rules for IPSec

    Add the following lines to your firewall config (/etc/fw/rc.firewall.rules in this example)

        add 05020 allow udp from any 500 to me dst-port 500 in via ep0 keep-state
        add 05021 allow esp from any to me in via ep0 keep-state
        add 05022 allow ah from any to me in via ep0 keep-state
    

    This will allow racoon to exchange keys on UDP port 500 and the kernel to talk via ESP and AH with the outside world (AH is actually not needed in my setup, but I keep it there for completeness and portability).

    3.7. Start all required services and enjoy

    You may want to add a DHCP, NTP, and DNS server to your router (I did, as the above firewall rules imply). I will not cover this here, though. The configuration files are about the same as in the Linux setup article.

    Now start all services manually with their respective scripts in /etc/rc.d/ or simply reboot. This should do.

    In case you’re also in charge of the other IPSec gateway, here are its /etc/ipsec.conf and /etc/ipsec.secrets. In my case it was RedHat Linux, Kernel 2.4.18-3, and the super-freeswan-1.99.8 patch from http://www.freeswan.ca/.
    ref : http://lugbe.ch/lostfound/contrib/freebsd_router/

VLAN คืออะไร ?

โดยปกติแล้ว ถ้าเรารู้จัก LAN ก็จะสามารถทำความเข้าใจกับ VLAN ได้ง่ายขึ้น ซึ่ง LAN ก็หมายถึงการเชื่อมต่ออุปกรณ์ Network ต่างๆ เข้าด้วยกันภายใน location เดียวกัน อาจจะเป็นตึกเดียวกัน ชั้นเดียวกัน หรือภายในพื้นที่เดียวก็ได้
LANอุปกรณ์ที่เชื่อมต่อภายใน LAN เดียวกัน ก็จะสามารถติดต่อสื่อสารกันได้ และ อุปกรณ์ทั้งหมดก็จะอยู่ใน Broadcast Domain เดียวกัน
Broadcast Domain คือ ขอบเขตหรือบริเวณที่ Broadcast Traffic สามารถส่งกระจายไปถึงได้ ถูกแบ่งได้จากอินเตอร์เฟสของอุปกรณ์ L3 ขึ้นไป
ถ้ามีการใช้งานมากๆ บน LAN ก็จะทำให้ traffic แบบ broadcast ยิ่งมีมากขึ้นไปด้วย จะส่งผลให้ Network ทำงานได้ช้าลง นอกจากจะทำให้ Network ทำงานช้าแล้ว ยังส่งผลต่ออุปกรณ์ต่างๆด้วย ผมเคยเจอปัญหาของลูกค้าในการใช้งาน LAN วงใหญ่ๆ แล้ว Printer ใช้งานไม่ได้ ปรากฎว่า มี broadcast เยอะมากๆ จน Printer ประมวลผลไม่ทัน
ถ้าเราต้องการจะแยก Network ออกจากกัน หรือพูดง่ายๆ คือ แบ่ง broadcast domain ออกเป็น 2 วง หรือ หลายๆ วง ก็สามารถทำได้โดยแยก Switch ไปอีกตัวนึงเลย ตามรูปด้านล่าง
broadcast domainจากรูปเราก็จะสามารถแยกวง หรือ แยก broadcast domain ออกจากกันได้แล้ว แต่ก็เปลือง Switch ไปด้วยใช่ไหมครับ !! VLAN จะเข้ามาช่วยได้ครับ โดยให้ผลเหมือนกับการแบ่ง Switch ออกจากกันเลย แต่ใช้ Switch เพียงตัวเดียวก็สามารถทำได้

VLAN หรือ Virtual LAN เป็นความสามารถของอุปกรณ์สวิตช์ที่สามารถกำหนดขอบเขตของ Broadcast Domain บน Layer 2 หมายความว่า บน Switch 1 ตัว สามารถแยก broadcast domain ได้หลายๆ วง หรือ แยก subnet ได้นั่นเอง
broadcast doamin-2

ประโยชน์ของการทำ VLAN หลักๆมีอะไรบ้าง ?

  • ลดจำนวน broadcast traffic ลงในเครือข่าย
  • ลดความเสี่ยง ป้องกันการ flooding ภายใน network ให้จำกัดภายใน VLAN เดียว
  • เพิ่มความปลอดภัย เพราะแต่ละ VLAN ไม่สามารถสื่อสารกันได้
  • มีความยืดหยุ่นในการใช้งาน เพียงแค่เปลี่ยน config บน port ของ switch ให้อยู่ภายใน VLAN กำหนด โดยไม่ต้องไปย้ายสาย

ลองมาดูวิธีการตั้งค่า VLAN กันครับ

1. สร้าง VLAN บน Switch ที่ต้องการ
Switch# configure terminal
Switch(config)# vlan 10
!! สร้าง VLAN หมายเลข 10
Switch(config-vlan)# name Sales
!! ตั้งชื่อ VLAN หมายเลข 10 เป็น Sales
.
เลขของ VLAN สามารถระบุได้ตาม range ได้ดังนี้vlan-rangeNote : บน Cisco Switch ทุกตัว จะมี default เป็น VLAN 1 และพอร์ตทุกพอร์ตก็จะเป็นสมาชิกของ VLAN 1
ตรวจสอบการสร้าง VLAN ด้วยคำสั่ง “show vlan brief”
show vlan-1
ลองดูรูปด้านล่าง จะเห็นว่า บน Switch มีการสร้าง VLAN 10 ขึ้นมาแล้ว และตอนนี้พอร์ตของ Switch ทุกพอร์ตจะเป็นสมาชิกของ VLAN 1 อยู่ โดย default หมายความว่า ถ้าเราเอา devices มาต่อที่พอร์ต devices เครื่องนั้นก็จะอยู่ใน VLAN 1 ทันที ถ้าเราต้องการจะให้ devices นั้นอยู่ใน VLAN 10 เราจะต้องตั้งค่าพอร์ตนั้นให้เป็นสมาชิกของ VLAN 10 ลองดูที่ขั้นตอนถัดไปครับ
create-vlan-switch
2. นำพอร์ตของ Switch เข้ามาเป็นสมาชิกของ VLAN
Switch# configure terminal
Switch(config)# interface Fa0/1
Switch(config-if)# switchport mode access
Switch(config-if)# switchport access vlan 10
ตรวจสอบด้วยคำสั่ง “show vlan brief”
show vlan-2
จะสังเกตุว่าพอร์ต Fa0/1 จะถูกย้ายเข้ามาเป็นสมาชิกของ VLAN 10 แล้ว
จากรูปด้านล่าง ตอนนี้เรานำพอร์ต Fa0/1 ไปเป็นสมาชิก VLAN 10 แล้ว หมายความว่า ถ้านำ devices ใดๆ มาต่อกับพอร์ต Fa0/1 ก็จะอยู่ใน VLAN 10 และก็จะติดต่อกับเครื่องที่นำมาต่อกับพอร์ต Fa0/2 , Fa0/3 , Fa0/4 ที่อยู่ใน VLAN 1 ไม่ได้แล้วครับ
assign-vlan-switchจบแล้วครับ สำหรับเรื่อง VLAN หวังว่าน่าจะเป็นประโยชน์กับผู้ที่สนใจครับ เจอกันใหม่ในบทความหน้านะคร้าบบบ
ref:http://netprime-system.com/vlan/

เทคนิคการคำนวณ IP Address

IP Address หรือ Internet Protocol Address มีความสำคัญอย่างไร และเกี่ยวข้องอะไรกับ
เราบ้าง ปัจจุบันคงไม่ต้องกล่าวถึงแล้ว IP Address เป็นหมายเลขที่ใช้กำหนดให้กับเครื่องคอมพิวเตอร์ หรือ
อุปกรณ์ Network ต่างๆ เช่น Router, Switch , Firewall , IP Camera , IP Phone , Access
point , เป็นต้น และอีกไม่นานอุปกรณ์ไฟฟ้าหรืออุปกรณ์สื่อสารทุกประเภทที่จะออกวางจำหน่ายจะมีIP
Address ติดมาด้วยจากโรงงานเลยทีเดียว IP Address ที่ใช้ในปัจจุบันนั้นจะเป็นชนิดที่เรียกว่า IPv4
(IP version 4) ซึ่งไม่เพียงพอต่อการใช้งาน จึงมีการพัฒนาเป็น IPv6 (IP version 6) เพื่อรองรับ
อุปกรณ์และเทคโนโลยีใหม่ๆที่ต้องใช้IP Address ในการติดต่อสื่อสาร และในเมืองไทยเองก็มีการใช้IPv6
ในหลายหน่วยงานแล้ว หน่วยงานที่จัดสรร IP Address ให้ในแถบ Asia Pacific คือAPNIC ผู้
ให้บริการ Internet หรือ ISP จะขอ IP จาก APNIC แล้วนำมาแจกจ่ายให้แก่ลูกค้าของ ISP นั้นๆอีกต่อไป
สำหรับผู้ที่จะสอบใบ Certificate ค่ายต่างๆ เช่น CCNA , CCNP , LPI , Security + , CWNA
เป็นต้น ล้วนแล้วแต่จะต้องมีความรเู้ กี่ยวกับ IP Address ทั้งสิ้น โดยเฉพาะ IPv4 จะต้องคำนวณได้อย่าง
แม่นยำและรวดเร็วIPv4
IPv4 ประกอบด้วยเลขฐานสอง 32 bits (4 bytes ,( 8bits=1byte)) แบ่งเป็น 4 กลุ่ม กลุ่มละ 8 bits แต่
ละกลุ่มนั้นจะคั่นด้วย . ( Dot )
กรณีตัวเลขน้อยสุดหรือเป็น เลข 0 ทั้งหมด  00000000 . 00000000 . 00000000 . 00000000
กรณีตัวเลขมากสุดหรือเป็น เลข 1 ทั้งหมด  11111111 . 11111111 . 11111111 . 11111111
เมื่อแปลงเป็นเลขฐาน 10 จะได้
กรณีตัวเลขน้อยสุดหรือเป็น เลข 0 ทั้งหมด  0.0.0.0
กรณีตัวเลขมากสุดหรือเป็น เลข 1 ทั้งหมด  255.255.255.255
ดังนั้น IPv4 จะมีตัวเลขที่เป็นไปได้ ตั้งแต่ 0.0.0.0 – 255.255.255.255

ก่อนการคำนวณเรื่อง IP เพื่อความรวดเร็ว ให้เขียนตามด้านล่างนี้
3-12-2554 8-33-03.png

IPv4 จะมีตัวเลขที่เป็นไปได้ทั้งหมดคือตั้งแต่ 0.0.0.0 – 255.255.555.555
สามารถแบ่ง IPv4 ได้เป็น 5 แบบ หรือ 5 Class ตามด้านล่าง โดยวิธีการแบ่งจะอ้างอิงจาก byte ที่1 ดังนี้
class A  byte ที่1 ตัวเลขบิตแรก จะเป็น 0
class B  byte ที่1 ตัวเลขบิตแรกจะเป็น 1 บิตที่2 จะเป็น 0
class C  byte ที่1 ตัวเลข 2 บิตแรก จะเป็น 1 บิตที่3 จะเป็น 0
class D  byte ที่1ตัวเลข 3 บิตแรก จะเป็น 1 บิตที่4 จะเป็น 0
class E  byte ที่1 ตัวเลข 4 บิตแรกจะเป็น 1
ดังนั้นจะได้ผลตามรูปด้านล่าง

3-12-2554 8-16-17.png

จะได้IP ในแต่ละ Class ดังนี้
Class A จะเริ่มต้นตั้งแต่ 0.0.0.0 ถึง 127.255.255.255
Class B จะเริ่มต้นตั้งแต่ 128.0.0.0 ถึง 191.255.255.255
Class C จะเริ่มต้นตั้งแต่ 192.0.0.0 ถึง 223.255.255.255
Class D จะเริ่มต้นตั้งแต่ 224.0.0.0 ถึง 239.255.255.255
Class E จะเริ่มต้นตั้งแต่ 240.0.0.0 ถึง 255.255.255.255
IP ที่สามารถนำไป Set ให้อุปกรณ์หรือ Host ได้จะมีอยู่3 Class คือ Class A, B และ C ส่วน IP Class
D จะสงวนไว้ใช้สำหรับงาน multicast applications และ IP Class E จะสงวนไว้สำหรับงานวิจัย หรือ
ไว้ใช้ในอนาคต
IPv4 ยังแบ่งเป็น 2 ประเภท คือ Public IP ( IP จริง ) และ Private IP ( IP ปลอม )
Public IP ( IP จริง ) คือ IP ที่สามารถ set ให้อุปกรณ์network เช่น Server หรือ Router แล้ว
สามารถติดต่อสื่อสารกับ Public IP ( IP จริง ) ด้วยกัน หรือออกสู่Network Internet ได้ทันที
Private IP ( IP ปลอม ) สามารถนำมา ใช้set ให้กับ PC หรืออุปกรณ์ในออฟฟิตได้แต่ไม่สามารถออกสู่
Public IP หรือออก Internet ได้ ต้องมีอุปกรณ์ Gateway เช่น Router ,Server หรือModem
DSL เปิด Service NAT ( Network Address Translation ) ไว้ จึงจะสามารถออกสู่Internet ได้
Private IP จะมีเฉพาะ Class A,B และ C ดังนี้
Class A : 10.x.x.x ( 10.0.0.0 – 10.255.255.255 )
Class B : 172.16.x.x – 172.31.x.x ( 172.16.0.0 – 172.31.255.255 )
Class C : 192.168.x.x ( 192.168.0.0 – 192.168.255.255 )
การคำนวณ IPv4
เมื่อเราได้IP Address มา 1 ชุด สิ่งที่จะต้องบอกได้จาก IP Address ที่ได้มาคือ
Subnet Mask คือ IP Address อะไร
Network IP คือ IP Address อะไร
Broadcast IP คือ IP Address อะไร
Range host IP ที่สามารถนำมาใช้งานได้ มีIP อะไรบ้าง
จำนวน Subnets , จำนวน hosts / Subnet
Subnet Mask ทำหน้าที่แบ่ง network ออกเป็นส่วนย่อยๆ ลักษณะคล้ายกับ IP Address คือ
ประกอบด้วยตัวเลข 4 ตัวคั่นด้วยจุด เช่น 255.255.255.0 วิธีการที่จะบอกว่า computer แต่ละเครื่องจะอยู่
ใน network วงเดียวกัน (หรืออยู่ใน subnet เดียวกัน) หรือไม่นั้นบอกได้ด้วยค่า Subnet Mask

วิธีการหา Subnet Mask
/30 หมายถึง mask 30 bits แรก
/27 หมายถึง mask 27 bits แรก
/20 หมายถึง mask 20 bits แรก
ให้ทำการแปลง mask bit ที่กำหนดให้ เป็นค่า Subnet Mask
วิธีการคือ bits ที่อยู่หน้าตัวmask ให้แทนด้วยเลข 1 bits ที่อยู่หลังให้แทนด้วยเลข 0
Ex /30
/30  11111111 . 11111111 . 11111111 . 111111/00

3-12-2554 8-34-57.png

จะได้ค่า Subnet Mask
/30  255.255.255.252
11111111 . 11111111 . 11111111 . 111111/00
ให้ใช้ตารางช่วยจะทำให้เร็วขึ้น โดย ถ้าเป็น 1 จำนวน 8 ตัวจะได้255
ถ้าเป็น 1 จำนวน 6 ตัวจะคือ 252 หรือจะใช้วิธีนับจาก 24 bits แรกซึ่งเป็น 1 ทั้งหมดอยู่แล้ว นับต่อมาจะได้
bits ที่30 เป็น 252 พอดี
Ex /27
/27  11111111 . 11111111 . 11111111 . 111/00000
จะได้ค่า Subnet Mask
/27  255.255.255.224

Ex /20
/20  11111111 . 11111111 . 1111/0000 . 00000000
จะได้ค่า Subnet Mask
/20  255.255.240.0
ตัวอย่าง Subnet Mask ต่างๆ มีดังนี้
Mask ที่เป็นค่า default ของ IP Class ต่างๆมีดังนี้
Class A = Mask 8 bits = 255 . 0 . 0 . 0
Class B = Mask 16 bits = 255 . 255 . 0 . 0
Class C = Mask 24 bits = 255 . 255 . 255 . 0
Subnet mask ทั่วไป
Mask 10 = 255 . 192 . 0 . 0 Mask 21 = 255 . 255 . 248 . 0
Mask 11 = 255 . 224 . 0 . 0 Mask 22 = 255 . 255 . 252 . 0
Mask 12 = 255 . 240 . 0 . 0 Mask 23 = 255 . 255 . 254 . 0
Mask 13 = 255 . 248 . 0 . 0 Mask 25 = 255 . 255 . 255 . 128
Mask 14 = 255 . 252 . 0 . 0 Mask 26 = 255 . 255 . 255 . 192
Mask 15 = 255 . 254 . 0 . 0 Mask 27 = 255 . 255 . 255 . 224
Mask 17 = 255 . 255 . 128 . 0 Mask 28 = 255 . 255 . 255 . 240
Mask 18 = 255 . 255 . 192 . 0 Mask 29 = 255 . 255 . 255 . 248
Mask 19 = 255 . 255 . 224 . 0 Mask 30 = 255 . 255 . 255 . 252
Mask 20 = 255 . 255 . 240 . 0 Mask 31 = 255 . 255 . 255 . 254

หมายเหตุ เพื่อให้การแปลงตัวเลขจากเลขฐานสอง เป็นฐานสิบเร็วขึ้นให้ดูจากด้านล่าง เช่นถ้าเป็น เลข 1
ทั้งหมดจะได้เลข ฐานสิบคือ 255 ถ้าเป็นเลข 1 จำนวน 4 ตัวจะคือ 240 ถ้าเป็นเลข 0 ทั้งหมด จะได้เลข 0

หลังจากได้Subnet Mask แล้ว ขั้นตอนต่อไปคือการหา Network IP และ Broadcast IP
Network IP คือ IP ตัวแรกของ Subnet ปกติจะเอาไว้ประกาศเรื่องของ Routing จะไม่สามารถนำมา
Set ให้แก่อุปกรณ์หรือเครื่อง PC ได้

ให้แก่อุปกรณ์หรือเครื่อง PC ได้เช่นกัน
Ex.1 192.168.22.50/30
จากโจทย์ /30 เมื่อแปลงเป็น Subnet Mask จะได้255.255.255.252
ให้ดูจากที่เขียนไว้ด้านบนนะครับ ถ้าเป็น 1 หมดทั้ง 8 ตัวจะได้255 ( แปลงจากฐานสองเป็นฐานสิบ )
เป็น 1 ทั้งหมด 6 ตัวจะได้252 ดังนั้นจึงได้subnet mask เป็น 255.255.255.252
ต่อไป หาว่า จำนวน IP ต่อ Subnet มีจำนวนเท่าไหร่ จากค่า Subnet Mask ที่ให้มา
ดูที่2 bit ที่เหลือ ที่เป็นอะไรก็ได้นั้น ตัวเลขที่เป็นไปได้หมดคือ 00 , 01 , 10 , 11 มี4 ตัว
และเมื่อนำ00 , 01 , 10 , 11 แปลงเป็นฐานสิบจะได้
00 แปลงเป็นฐานสิบจะได้ 0
01 แปลงเป็นฐานสิบจะได้ 1
10 แปลงเป็นฐานสิบจะได้ 2
11 แปลงเป็นฐานสิบจะได้ 3
สรุปคือ จำนวน IP ต่อ Subnet เมื่อ Subnet Mask คือ 255.255.255.252 คือ 4 ตัว นั่นเอง
หรือใช้วิธีลัดดูจากที่เขียนไว้ ตัวเลขที่อยู่บน 252 คือ 4 ตามด้านล่างครับ

ดังนั้นถ้า /30 จำนวน IP ในแต่ละ subnet ที่จะเป็นไปได้ดูเฉพาะกลุ่มสุดท้าย
คือ 0-3 , 4-7 , 8-11 , _ _ _ , 252-255 หรือเขียนในรูป IPv4 จะได้
192.168.22.0 – 192.168.22.3
192.168.22.4 – 192.168.22.7
192.168.22.8 – 192.168.22.11
———–
192.168.22.48 – 192.168.22.51
———
192.168.22.252 – 192.168.22.255

หมายเหตุ 3 กลุ่มแรกเหมือนเดิมเนื่องจากผลของการ and ระหว่าง bit เนื่องจาก 3 กลุ่มแรกเป็น bit 1
ทั้งหมดทำการ add กับเลขใดก็จะได้ตัวเดิม 3 กลุ่มแรกจึงได้เลขฐาน 10 ตัวเดิม
โดย IP Address ตัวแรกของแต่ละ subnet จะเรียกว่า Network IP และ IP Address ตัวสุดท้ายของแต่
ละ subnet จะเรียกว่า Broadcast IP ดังนั้น
จากโจทย์192.168.22.50/30
1. Network IP คือ IP Address อะไร
ตอบ 192.168.22.48
2. Broadcast IP คือ IP Address อะไร
ตอบ 192.168.22.51
3. Range hosts IP ที่สามารถนำมาใช้งานได้ หรือ จำนวน hosts Per Subnet
ตอบ 192.168.22.49 – 192.168.22.50 นำIP มา set เป็น host ได้2 IP
วิธีการหา Network IP นอกเหนือจากการเขียนตามด้านบนแล้วยังหาได้โดย
วิธีการปกติทำได้โดยการนำเอา Subnet Mask มา AND กับ IP Address ที่ให้มา ผลที่ได้จะเป็น
Network IP วิธีนี้หนังสือหลายเล่มมีอธิบายแล้ว
วิธีการหาร นำIP จากโจทย์ที่ให้มา ตั้งหารด้วยจำนวน IP ที่มีได้ใน Subnet เช่น
192.168.22.50/30 ให้นำเอาตัวเลข 50 หารด้วย 4 ดังด้านล่าง

3-12-2554 8-37-45.png

เมื่อได้Netwok IP แล้ว ก็จะได้คำตอบเช่นเดียวกับด้านบน เนื่องจากเรารู้อยู่แล้วว่า /30 ใน 1 subnet จะมี
จำนวน IP ทั้งหมด 4 ตัวจากตาราง ดังที่ได้กล่าวมาแล้ว

Ex.2 192 .168.5.33/27 which IP address should be assigned to the PC host ?
A.192.168.5.5
B.192.168.5.32
C. 192.168.5.40
D. 192.168.5.63
E. 192.168.5.75
จากโจทย์/27 จะหมายถึง
11111111 . 11111111 . 11111111 . 111/XXXX X = mask 27 bit แรก ต้องเป็นเลข 1 ส่วน 5
bit หลัง เป็นอะไรก็ได้
/27 เมื่อแปลงเป็นเลขฐานสิบจะได้255 . 255 . 255 . 224

หรือจะคิดแบบลัด ตามตาราง ดูบรรทัดที่4 จะหมายถึงผลบวกของ bit ใน 8 bit สุดท้ายครับ 111 ก็คือ
128+64+32 = 224
เมื่อ ได้Subnet Mask แล้ว เราก็จะรู้ว่ามีจำนวน IP ต่อ Subnet เท่ากับ 32 หรือจะดูจากที่เขียนไว้ด้านบน
ของ 224 ก็คือ 32 นั่นเอง
จากโจทย์192 .168.5.33/27 จะใช้วิธีไหนก็ได้หาตัว Network มาให้ได้ก่อน
192.168.5.33/27 หมายถึง 192.168.5.32 – 192.168.5.63
โดย IP ตัวแรกจะเป็น Network IP ( 192.168.5.32 ) และ IP ตัวสุดท้ายจะเป็น Broadcast IP (
192.168.5.63 ) ซึ่งไม่สามารถใช้set ให้แก่PC ได้ ดังนั้นจะเหลือ IP ที่สามารถ Set ให้แก่PC ได้คือ
192.168.5.33 – 192.168.5.62
คำตอบจึงเป็นข้อ C. 192.168.5.40
Ex.3 IP 10.10.10.0/13 เป็น IP ที่นำไป set ให้host ได้หรือไม่
IP ที่สามารถนำไป set ให้host ได้หรือนำไปใช้งานได้ จะต้องไม่ตรงกับ Network IP หรือ
Broadcast IP
วิธีการคิดก่อนอื่นเราต้องทำการแปลง /13 หรือmask 13 bit ให้เป็น subnet mask

11111111 . 11111/XXX . XXXXXXXX . XXXXXXXX = mask 13 bit
แรก ต้องเป็นเลข 1 ส่วน bit ที่เหลือเป็นอะไรก็ได้
/13 เมื่อแปลงเป็นเลขฐานสิบจะได้ 255 . 248 . 0 . 0
จากโจทย์ เขียนใหม่ได้ดังนี้IP 10.10.10.0 subnet mask 255.248.0.0
ขั้น ต่อไปเราจะมาหาช่วง IP จาก subnet mask ที่หามาได้255.248.0.0
หลักที่1 จะมีค่าคงที่คือเลข 10 หลักที่3 และหลักที่4 นั้น ตัวเลขที่เป็นไปได้คือ 0 – 255
ส่วนหลักที่2 นั้น เราต้องมาคำนวณ โดยเว้นไว้ก่อน เขียนช่วง IP จะได้ดังนี้คือ
10 . X . 0 . 0 – 10 . X . 255 . 255

ถ้า เราพิจารณาเฉพาะ 248 (ดูเฉพาะตัวเลขกลุ่มที่2 ) ถ้าดูจากรูปด้านบน บรรทัดที่3 ซึ่งจะหมายถึง IP ที่มี
ได้ทั้งหมด ก็คือ 8 ตัว คือ 0-7 , 8-15 , 16- 23 , _ _ _ , 248-255 หรือเขียนเต็มๆจะได้
10 . 0 . 0 . 0 – 10 . 7 . 255 . 255
10 . 8 . 0 . 0 – 10 . 15 . 255 . 255 ————> จากโจทย์10.10.10.0 จะอยู่ในช่วงนี้
10 . 16 . 0 . 0 – 10 . 23 . 255 . 255
————
10 . 248 . 0 . 0 – 10 . 255 . 255 . 255
จากโจทย  10.10.10.0/13 ก็จะคือ IP ในช่วง 10 .8 . 0 . 0 – 10 . 15 . 255 . 255
1. Network IP คือ IP Address อะไร
ตอบ 10 . 8 . 0 . 0
2. Broadcast IP คือ IP Address อะไร
ตอบ 10 . 15 . 255 . 255
3. Range host IP ที่สามารถนำมาใช้งานได้
ตอบ 10 . 8 . 0 . 1 – 10 . 15 . 255 . 254 ดังนั้น IP 10.10.10.0/13 จึงนำมาใช้งานได้ถือว่า
เป็นHost ตัวนึง

การหาจำนวน Subnet และ จำนวน hosts / Subnet
การหาจำนวน hosts ต่อ Subnet จากค่า Subnet Mask ที่ให้มา จะใช้ สูตร
2n – 2
โดย n คือจำนวน bits ที่อยู่หลังตัวMask ส่วนเลข 2 ที่ลบออกไปคือ Network IP และ Broadcast IP

Ex.1 /30 11111111 . 11111111 . 11111111 . 111111/00
หรือ 255.255.255.252 จะได้
จำนวน hosts/Subnet = 2n – 2 = 22 – 2 = 4 – 2 = 2

Ex.2 /20 11111111 . 11111111 . 1111/0000 . 00000000
หรือ 255.255.240.0
จำนวน hosts/Subnet = 2n – 2 = 212- 2 = 4096 – 2 = 4094

การหาจำนวน Subnet จากค่า Subnet Mask ที่ให้มา ปัจจุบันใช้สูตร
2n ไม่ต้องลบ 2 เนื่องจากว่า ปัจจุบันทุก Subnet สามารถใช้ได้ทั้งหมด และใน router cisco เองมีการ
เพิ่ม IP Subnet Zero ไว้อยู่แล้ว
โดย n คือจำนวน bits ที่อยู่หน้าตัวMask ถึงตำแน่ง . (dot) ที่ใกล้ที่สุดหรือตำแหน่งที่ระบุไว้

Ex.3 /30 11111111 . 11111111 . 11111111 . 111111/00
หรือ 255.255.255.252 จะได้
จำนวน Subnet = 2n = 26 = 64

Ex.4 /20 11111111 . 11111111 . 1111/0000 . 00000000
หรือ 255.255.240.0
จำนวน Subnet = 2n = 24 = 16

Ex.5 จากเดิม /20 แบ่งเป็น /27 จะได้กี่Subnet อันนี้ระบุMask ต้นทางมาจะได้
11111111 . 11111111 . 1111/1111 . 111/00000
จำนวน Subnet = 2n = 27 = 128

คำศัพท์ที่ควรรู้
Classful และClassless
Classful จะสนใจ Class ของ IP เป็นหลักจะไม่สนใจตัวMask ดูตัวเลข IP ว่าอยู่Class ไหน เช่น อยู่
Class A ,B หรือ C ตามนี้
Class A ( 0.0.0.0 – 127.255.255.255 )
Class B ( 128.0.0.0 – 191.255.255.255 )
Class C (192.0.0.0 – 223.255.255.255 )
ในการใช้IP Address ช่วงแรกๆจะเป็นแบบ Classful ซึ่ง Classful จะ มีค่า default subnet mask
ดังนี้
A /8 255.0.0.0
B /16 255.255.0.0
C /24 255.255.255.0
ดังนั้นถ้าเราใช้หลักการของ Classful ก็ไม่สามารถแบ่ง Subnet ได้แตกต่างจากค่า Default Subnet Mask
ตัวอย่าง routing protocols : ที่เป็นแบบClassful
• RIP Version 1 (RIPv1)
• IGRP
ส่วน Classless จะตรงข้ามกับ Classful คือจะไม่สนใจ Class ของ IP แต่จะสนใจตัวMask เป็น
หลัก อย่างเช่นที่คำนวณตามตัวอย่างที่ผ่านมา โดยจะเป็นไปตามหลักการของ Classless Inter-
Domain Routing (CIDR) ดังนั้น ตัวMask จะเป็นอะไรก็ได้ไม่สนใจว่า IP อยู่Class ไหน
ตัวอย่าง routing protocols : ที่เป็นแบบClassless ได้แก่
• RIP Version 2 (RIPv2)
• EIGRP
• OSPF
• IS-IS
Variable Length Subnet Masks ( VLSM )
จากหลักการ เครือข่ายที่เราใช้งานกันอยู่ ไม่จำเป็นจะต้องมีขนาดเท่ากันเสมอไป (ไม่จำเป็นต้องมี ตัวMask
เท่ากัน ) เช่น การเชื่อมต่อแบบจุดต่อจุด (Point-to-Point) ต้องการแค่2 IP ก็เพียงพอ ดังนั้นก็ควร
Mask 30 bit ( /30 ) หรือใช้subnet mask เป็น 255.255.255.252 หรือการเชื่อต่อใน
LAN ที่มีเครื่องเพียง 20 เครื่อง ก็ควรmask 27 bit ( /27 ) หรือ ใช้subnet mask เป็น
255.255.255.224 เป็นต้น ดังตัวอย่างในรูปด้านล่าง ใช้หลักการของVLSM จะเห็นว่าแต่ละ
subnet จะมีตัวmask ต่างกันและmask bit ตามความเหมาะสมทำให้ประหยัด IP หรือใช้IP ได้
อย่างมีประสิทธิภาพ

Default Subnet mask ของแต่ล่ะ Class ดั้งนี้
• Class A จะมี Subnet mask เป็น 255.0.0.0 หรือเลขฐานสองดัง้นี้
11111111.00000000.00000000.00000000
(รวมเลข 1 ให้หมด ก็จะได้เท่ากับ 255)

• Class B จะมี Subnet mask เป็น 255.255.0.0 หรือเลขฐานสองดัง้นี้
11111111.11111111.00000000.00000000

• Class C จะมี Subnet mask เป็น 255.255.255.0 หรือเลขฐานสองดัง้นี้
11111111.11111111.11111111.00000000

credit : http://forums.dp-server.com/topics/%7Bdp-server%7D-67-1-1.html

New Delphi Seattle MongoDB Sample

I created some more Delphi 10 Seattle samples to show off MongoDB and FireDAC functionality: LocalSQL, Indexing & Geospatial.

FireDAC MongoDB NoSQL

The first one queries some data from MongoDB allowing you to specify the match, sort and projection, then it stores the results in a DataSet. At that point you can use LocalSQL to write a SQL query against the result set. While FireDAC gives you full native support for MongoDB, it also puts the SQL back into NoSQL.

MongoDB FireDAC LocalSQL

Indexing is used to improve your query performance. It is really easy to work with MongoDB queries with FireDAC.

MongoDB FireDAC Indexes

And one of the cool features of MongoDB is that you can do spatial queries. Here is an example that shows how to create a Spatial index and then do a spatial query with FireDAC. This uses the restaurant data that is included with the shipping samples, so make sure you load the restaurant data first.

Geospatial MongoDB FireDAC

If you missed my previous post I had a MongoDB FireDAC and C++Builder sample.

[You can download my new samples here.]

see more..
https://forums.embarcadero.com/thread.jspa?messageID=714438    — discut insert data to MongoDB via Firedac

 

Why You Should Never Use MongoDB

Disclaimer: I do not build database engines. I build web applications. I run 4-6 different projects every year, so I build a lot of web applications. I see apps with different requirements and different data storage needs. I’ve deployed most of the data stores you’ve heard about, and a few that you probably haven’t.

I’ve picked the wrong one a few times. This is a story about one of those times — why we picked it originally, how we discovered it was wrong, and how we recovered. It all happened on an open source project called Diaspora.

The project

Diaspora is a distributed social network with a long history. Waaaaay back in early 2010, four undergraduates from New York University made a Kickstarter video asking for $10,000 to spend the summer building a distributed alternative to Facebook. They sent it out to friends and family, and hoped for the best.

But they hit a nerve. There had just been another Facebook privacy scandal, and when the dust settled on their Kickstarter, they had raised over $200,000 from 6400 different people for a software project that didn’t yet have a single line of code written.

Diaspora was the first Kickstarter project to vastly overrun its goal. As a result, they got written up in the New York Times – which turned into a bit of a scandal, because the chalkboard in the backdrop of the team photo had a dirty joke written on it, and no one noticed until it was actually printed. In the NEW YORK TIMES. The fallout from that was actually how I first heard about the project.

As a result of their Kickstarter success, the guys left school and came out to San Francisco to start writing code. They ended up in my office. I was working at Pivotal Labs at the time, and one of the guys’ older brothers also worked there, so Pivotal offered them free desk space, internet, and, of course, access to the beer fridge. I worked with official clients during the day, then hung out with them after work and contributed code on weekends.

They ended up staying at Pivotal for more than two years. By the end of that first summer, though, they already had a minimal but working (for some definition) implementation of a distributed social network built in Ruby on Rails and backed by MongoDB.

That’s a lot of buzzwords. Let’s break it down.

“Distributed social network”

If you’ve seen the Social Network, you know everything you need to know about Facebook. It’s a web app, it runs on a single logical server, and it lets you stay in touch with people. Once you log in, Diaspora’s interface looks structurally similar to Facebook’s:

A screenshot of the Diaspora interface

A screenshot of the Diaspora user interface

There’s a feed in the middle showing all your friends’ posts, and some other random stuff along the sides that no one has ever looked at. The main technical difference between Diaspora and Facebook is invisible to end users: it’s the “distributed” part.

The Diaspora infrastructure is not located behind a single web address. There are hundreds of independent Diaspora servers. The code is open source, so if you want to, you can stand up your own server. Each server, called a pod, has its own database and its own set of users, and will interoperate with all the other Diaspora pods that each have their own database and set of users.

The Diaspora Ecosystem

Pods of different sizes communicate with each other, without a central hub.

Each pod communicates with the others through an HTTP-based API. Once you set up an account on a pod, it’ll be pretty boring until you follow some other people. You can follow other users on your pod, and you can also follow people who are users on other pods. When someone you follow on another pod posts an update, here’s what happens:

1. The update goes into the author’s pod’s database.

2. Your pod is notified over the API.

3. The update is saved in your pod’s database.

4. You look at your activity feed and see that post mixed in with posts from the other people you follow.

Comments work the same way. On any single post, some comments might be from people on the same pod as the post’s author, and some might be from people on other pods. Everyone who has permission to see the post sees all the comments, just as you would expect if everyone were on a single logical server.

Who cares?

There are technical and legal advantages to this architecture. The main technical advantage is fault tolerance.

Here is a very important fault tolerant system that every office should have.

If any one of the pods goes down, it doesn’t bring the others down. The system survives, and even expects, network partitioning. There are some interesting political implications to that — for example, if you’re in a country that shuts down outgoing internet to prevent access to Facebook and Twitter, your pod running locally still connects you to other people within your country, even though nothing outside is accessible.

The main legal advantage is server independence. Each pod is a legally separate entity, governed by the laws of wherever it’s set up. Each pod also sets their own terms of service. On most of them, you can post content without giving up your rights to it, unlike on Facebook. Diaspora is free software both in the “gratis” and the “libre” sense of the term, and most of the people who run pods care deeply about that sort of thing.

So that’s the architecture of the system. Let’s look at the architecture within a single pod.

It’s a Rails app.

Each pod is a Ruby on Rails web application backed by a database, originally MongoDB. In some ways the codebase is a ‘typical’ Rails app — it has both a visual and programmatic UI, some Ruby code, and a database. But in other ways it is anything but typical.

The internal structure of one Diaspora pod

The visual UI is of course how website users interact with Diaspora. The API is used by various Diaspora mobile clients — that part’s pretty typical — but it’s also used for “federation,” which is the technical name for inter-pod communication. (I asked where the Romulans’ access point was once, and got a bunch of blank looks. Sigh.) So the distributed nature of the system adds layers to the codebase that aren’t present in a typical app.

And of course, MongoDB is an atypical choice for data storage. The vast majority of Rails applications are backed by PostgreSQL or (less often these days) MySQL.

So that’s the code. Let’s consider what kind of data we’re storing.

I Do Not Think That Word Means What You Think That Means

“Social data” is information about our network of friends, their friends, and their activity. Conceptually, we do think about it as a network — an undirected graph in which we are in the center, and our friends radiate out around us.

Photos all from rubyfriends.com. Thanks Matt Rogers, Steve Klabnik, Nell Shamrell, Katrina Owen, Sam Livingston-Grey, Josh Susser, Akshay Khole, Pradyumna Dandwate, and Hephzibah Watharkar for contributing to #rubyfriends!

When we store social data, we’re storing that graph topology, as well as the activity that moves along those edges.

For quite a few years now, the received wisdom has been that social data is not relational, and that if you store it in a relational database, you’re doing it wrong.

But what are the alternatives? Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production. Other folks say that document databases are perfect for social data, and those aremainstream enough to actually be used. So let’s look at why people think social data fits more naturally in MongoDB than in PostgreSQL.

How MongoDB Stores Data

MongoDB is a document-oriented database. Instead of storing your data in tables made out of individual rows, like a relational database does, it stores your data in collections made out of individual documents. In MongoDB, a document is a big JSON blob with no particular format or schema.

Let’s say you have a set of relationships like this that you need to model. This is quite similar to a project that come through Pivotal that used MongoDB, and was the best use case I’ve ever seen for a document database.

At the root, we have a set of TV shows. Each show has many seasons, each season has many episodes, and each episode has many reviews and many cast members. When users come into this site, typically they go directly to the page for a particular TV show. On that page they see all the seasons and all the episodes and all the reviews and all the cast members from that show, all on one page. So from the application perspective, when the user visits a page, we want to retrieve all of the information connected to that TV show.

There are a number of ways you could model this data. In a typical relational store, each of these boxes would be a table. You’d have atv_shows table, a seasons table with a foreign key into tv_shows, an episodes table with a foreign key into seasons, andreviews and cast_members tables with foreign keys into episodes. So to get all the information for a TV show, you’re looking at a five-table join.

We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.

Here’s an example document for one TV show, Babylon 5.

It’s got some title metadata, and then it’s got an array of seasons. Each season is itself a hash with metadata and an array of episodes. In turn, each episode has some metadata and arrays for both reviews and cast members.

It’s basically a huge fractal data structure.

Sets of sets of sets of sets. Tasty fractals.

All of the data we need for a TV show is under one document, so it’s very fast to retrieve all this information at once, even if the document is very large. There’s a TV show here in the US called “General Hospital” that has aired over 12,000 episodes over the course of 50+ seasons. On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes, while retrieval of the equivalent document by ID in MongoDB takes a fraction of a second.

So in many ways, this application presented the ideal use case for a document store.

Ok. But what about social data?

Right. When you come to a social networking site, there’s only one important part of the page: your activity stream. The activity stream query gets all of the posts from the people you follow, ordered by most recent. Each of those posts have nested information within them, such as photos, likes, reshares, and comments.

The nested structure of activity stream data looks very similar to what we were looking at with the TV shows.

Users have friends, friends have posts, posts have comments and likes, each comment has one commenter and each like has one liker. Relationship-wise, it’s not a whole lot more complicated than TV shows. And just like with TV shows, we want to pull all this data at once, right after the user logs in. Furthermore, in a relational store, with the data fully normalized, it would be a seven-table join to get everything out.

Seven-table joins. Ugh. Suddenly storing each user’s activity stream as one big denormalized nested data structure, rather than doing that join every time, seems pretty attractive.

In 2010, when the Diaspora team was making this decision, Etsy’s articles about using document stores were quite influential, although they’ve since publicly moved away from MongoDB for data storage. Likewise, at the time, Facebook’s Cassandra was also stirring up a lot of conversation about leaving relational databases. Diaspora chose MongoDB for their social data in this zeitgeist. It was not an unreasonable choice at the time, given the information they had.

What could possibly go wrong?

There is a really important difference between Diaspora’s social data and the Mongo-ideal TV show data that no one noticed at first.

With TV shows, each box in the relationship diagram is a different type. TV shows are different from seasons are different from episodes are different from reviews are different from cast members. None of them is even a sub-type of another type.

But with social data, some of the boxes in the relationship diagram are the same type. In fact, all of these green boxes are the same type — they are all Diaspora users.

A user has friends, and each friend may themselves be a user. Or, they may not, because it’s a distributed system. (That’s a whole layer of complexity that I’m just skipping for today.) In the same way, commenters and likers may also be users.

This type duplication makes it way harder to denormalize an activity stream into a single document. That’s because in different places in your document, you may be referring to the same concept — in this case, the same user. The user who liked that post in your activity stream may also be the user who commented on a different post.

Duplicate data Duplicate data

We can represent this in MongoDB in a couple of different ways. Duplication is any easy option. All the information for that friend is copied and saved to the like on the first post, and then a separate copy is saved to the comment on the second post. The advantage is that all the data is present everywhere you need it, and you can still pull the whole activity stream back as a single document.

Here’s what this kind of fully denormalized stream document looks like.

Here we have copies of user data inlined. This is Joe’s stream, and it has a copy of his user data, including his name and URL, at the top level. His stream, just underneath, contains Jane’s post. Joe has liked Jane’s post, so under likes for Jane’s post, we have a separate copy of Joe’s data.

You can see why this is attractive: all the data you need is already located where you need it.

You can also see why this is dangerous. Updating a user’s data means walking through all the activity streams that they appear in to change the data in all those different places. This is very error-prone, and often leads to inconsistent data and mysterious errors, particularly when dealing with deletions.

Is there no hope?

There is another approach you can take to this problem in MongoDB, which will more familiar if you have a relational background. Instead of duplicating user data, you can store references to users in the activity stream documents.

With this approach, instead of inlining this user data wherever you need it, you give each user an ID. Once users have IDs, we store the user’s ID every place that we were previously inlining data. New IDs are in green below.

MongoDB actually uses BSON IDs, which are strings sort of like GUIDs, but to make these samples easier to read I’m just using integers.

This eliminates our duplication problem. When user data changes, there’s only one document that gets rewritten. However, we’ve created a new problem for ourselves. Because we’ve moved some data out of the activity streams, we can no longer construct an activity stream from a single document. This is less efficient and more complex. Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars.

What’s missing from MongoDB is a SQL-style join operation, which is the ability to write one query that mashes together the activity stream and all the users that the stream references. Because MongoDB doesn’t have this ability, you end up manually doing that mashup in your application code, instead.

Simple Denormalized Data

Let’s return to TV shows for a second. The set of relationships for a TV show don’t have a lot of complexity. Because all the boxes in the relationship diagram are different entities, the entire query can be denormalized into one document with no duplication and no references. In this document database, there are no links between documents. It requires no joins.

On a social network, however, nothing is that self-contained. Any time you see something that looks like a name or a picture, you expect to be able to click on it and go see that user, their profile, and their posts. A TV show application doesn’t work that way. If you’re on season 1 episode 1 of Babylon 5, you don’t expect to be able to click through to season 1 episode 1 of General Hospital.

Don’t. Link. The. Documents.

Once we started doing ugly MongoDB joins manually in the Diaspora code, we knew it was the first sign of trouble. It was a sign that our data was actually relational, that there was value to that structure, and that we were going against the basic concept of a document data store.

Whether you’re duplicating critical data (ugh), or using references and doing joins in your application code (double ugh), when you have links between documents, you’ve outgrown MongoDB. When the MongoDB folks say “documents,” in many ways, they mean things you can print out on a piece of paper and hold. A document may have internal structure — headings and subheadings and paragraphs and footers — but it doesn’t link to other documents. It’s a self-contained piece of semi-structured data.

If your data looks like that, you’ve got documents. Congratulations! It’s a good use case for Mongo. But if there’s value in the links between documents, then you don’t actually have documents. MongoDB is not the right solution for you. It’s certainly not the right solution for social data, where links between documents are actually the most critical data in the system.

So social data isn’t document-oriented. Does that mean it’s actually…relational?

That Word Again

When people say “social data isn’t relational,” that’s not actually what they mean. They mean one of these two things:

1. “Conceptually, social data is more of a graph than a set of tables.”

This is absolutely true. But there are actually very few concepts in the world that are naturally modeled as normalized tables. We use that structure because it’s efficient, because it avoids duplication, and because when it does get slow, we know how to fix it.

2. “It’s faster to get all the data from a social query when it’s denormalized into a single document.”

This is also absolutely true. When your social data is in a relational store, you need a many-table join to extract the activity stream for a particular user, and that gets slow as your tables get bigger. However, we have a well-understood solution to this problem. It’s called caching.

At the All Your Base Conf in Oxford earlier this year, where I gave the talk version of this post, Neha Narula had a great talk about caching that I recommend you watch once it’s posted. In any case, caching in front of a normalized data store is a complex but well-understood problem. I’ve seen projects cache denormalized activity stream data into a document database like MongoDB, which makes retrieving that data much faster. The only problem they have then is cache invalidation.

“There are only two hard problems in computer science: cache invalidation and naming things.”

Phil Karlton

It turns out cache invalidation is actually pretty hard. Phil Karlton wrote most of SSL version 3, X11, and OpenGL, so he knows a thing or two about computer science.

Cache Invalidation As A Service

But what is cache invalidation, and why is it so hard?

Cache invalidation is just knowing when a piece of your cached data is out of date, and needs to be updated or replaced. Here’s a typical example that I see every day in web applications. We have a backing store, typically PostgreSQL or MySQL, and then in front of that we have a caching layer, typically Memcached or Redis. Requests to read a user’s activity stream go to the cache rather than the database directly, which makes them very fast.

Typical cache and backing store setup

Application writes are more complicated. Let’s say a user with two followers writes a new post. The first thing that happens (part 1) is that the post data is copied into the backing store. Once that completes, a background job (part 2)  appends that post to the cached activity stream of both of the users who follow the author.

This pattern is quite common. Twitter holds recently-active users’ activity streams in an in-memory cache, which they append to when someone they follow posts something. Even smaller applications that employ some kind of activity stream will typically end up here (see: seven-table join).

Back to our example. When the author changes an existing post, the update process is essentially the same as for a create, except instead of appending to the cache, it updates an item that’s already there.

What happens if that step 2 background job fails partway through? Machines get rebooted, network cables get unplugged, applications restart. Instability is the only constant in our line of work. When that happens, you’ll end up with invalid data in your cache. Some copies of the post will have the old title, and some copies will have the new title. That’s a hard problem, but with a cache, there’s always the nuclear option.

Always an option >_<

You can always delete the entire activity stream record out of your cache and regenerate it from your consistent backing store. It may be slow, but at least it’s possible.

What if there is no backing store? What if you skip step 1? What if the cache is all you have?

When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state.

When Diaspora decided to store social data in MongoDB, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence, transience, duplication, references, data integrity, and speed.

The Conversion

Once we figured out that we had accidentally chosen a cache for our database, what did we do about it?

Well, that’s the million dollar question. But I’ve already answered the billion-dollar question. In this post I’ve talked about how we used MongoDB vs. how it was designed to be used. I’ve talked about it as though all that information were obvious, and the Diaspora team just failed to research adequately before choosing.

But this stuff wasn’t obvious at all. The MongoDB docs tell you what it’s good at, without emphasizing what it’s not good at. That’s natural. All projects do that. But as a result, it took us about six months, a lot of user complaints, and a lot of investigation to figure out that we were using MongoDB the wrong way.

There was nothing to do but take the data out of MongoDB and move it to a relational store, dealing as best we could with the inconsistent data we uncovered along the way. The data conversion itself — export from MongoDB, import to MySQL — was straightforward. For the mechanical details, you can see my slides from All Your Base Conf 2013.

The Damage

We had eight months of production data, which turned into about 1.2 million rows in MySQL. We spent four pair-weeks developing the code for the conversion, and when we pulled the trigger, the main site had about two hours of downtime. That was more than acceptable for a project that was in pre-alpha. We could have reduced that downtime more, but we had budgeted for eight hours of downtime, so two actually seemed fantastic.

NOT BAD

Epilogue

Remember that TV show application? It was the perfect use case for MongoDB. Each show was one document, perfectly self-contained. No references to anything, no duplication, and no way for the data to become inconsistent.

About three months into development, it was still humming along nicely on MongoDB. One Monday, at the weekly planning meeting, the client told us about a new feature that one of their investors wanted: when they were looking at the actors in an episode of a show, they wanted to be able to click on an actor’s name and see that person’s entire television career. They wanted a chronological listing of all of the episodes of all the different shows that actor had ever been in.

We stored each show as a document in MongoDB containing all of its nested information, including cast members. If the same actor appeared in two different episodes, even of the same show, their information was stored in both places. We had no way to tell, aside from comparing the names, whether they were the same person. So to implement this feature, we needed to search through every document to find and de-duplicate instances of the actor that the user clicked on. Ugh. At a minimum, we needed to de-dup them once, and then maintain an external index of actor information, which would have the same invalidation issues as any other cache.

You See Where This Is Going

The client expected this feature to be trivial. If the data had been in a relational store, it would have been. As it was, we first tried to convince the PM they didn’t need it. After that failed, we offered some cheaper alternatives, such as linking to an IMDB search for the actor’s name. The company made money from advertising, though, so they wanted users to stay on their site rather than going off to IMDB.

This feature request eventually prompted the project’s conversion to PostgreSQL. After a lot more conversation with the client, we realized that the business saw lots of value in linking TV shows together. They envisioned seeing other shows a particular director had been involved with, and episodes of other shows that were released the same week this one was, among other things.

This was ultimately a communication problem rather than a technical problem. If these conversations had happened sooner, if we had taken the time to really understand how the client saw the data and what they wanted to do with it, we probably would have done the conversion earlier, when there was less data, and it was easier.

Always Be Learning

I learned something from that experience: MongoDB’s ideal use case is even narrower than our television data. The only thing it’s good at is storing arbitrary pieces of JSON. “Arbitrary,” in this context, means that you don’t care at all what’s inside that JSON. You don’t even look. There is no schema, not even an implicit schema, as there was in our TV show data. Each document is just a blob whose interior you make absolutely no assumptions about.

At RubyConf this weekend, I ran into Conrad Irwin, who suggested this use case. He’s used MongoDB to store arbitrary bits of JSON that come from customers through an API. That’s reasonable. The CAP theorem doesn’t matter when your data is meaningless. But in interesting applications, your data isn’t meaningless.

I’ve heard many people talk about dropping MongoDB in to their web application as a replacement for MySQL or PostgreSQL. There are no circumstances under which that is a good idea. Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value. If you have an implicit schema — meaning, if there are things you are expecting in that JSON — then MongoDB is the wrong choice. I suggest taking a look at PostgreSQL’s hstore (now apparently faster than MongoDBanyway), and learning how to make schema changes. They really aren’t that hard, even in large tables.

Find The Value

When you’re picking a data store, the most important thing to understand is where in your data — and where in its connections — the business value lies. If you don’t know yet, which is perfectly reasonable, then choose something that won’t paint you into a corner. Pushing arbitrary JSON into your database sounds flexible, but true flexibility is easily adding the features your business needs.

Make the valuable things easy.

The End.