Thursday, 15 November 2012

Usage quotas

In an attempt to distract me from OSPF, I have been playing with ways to handle usage quotas.

The idea is that on some services, some customers would like a way to pre-set a cap on usage, whether it is a data SIM or a broadband line.

The catch with such ideas is that the usage metering is by RADIUS accounting which we run every hour. Now, we could do this more often, but even short intervals can be vey large amounts of usage on some services (especially with things like 330Mb/s FTTP lines in the pipeline).

So the trick is to tell the LNS a quota for a connection and have it spot the usage has exceeded that in something like real time.

Now, RADIUS already has something like that, but it was designed for dialup and is a Session-Timeout or Idle-Timeout which is based on time not data usage. We already support Session-Timeout (though not Idle-Timeout as that makes little sense on broadband). I can't find a RADIUS AVP for data quota limit, so I am using the Filter settings (like most of the other special settings we use).

First snag is what you meter. For data SIMs the usage is Tx and Rx (total) but for ADSL it is only Tx that matters. So we have to support a choice of quota metering type.

Then we have the question of what to do when we reach a limit. Well, RADIUS has Terminate-Action AVP which is used with Session-Timeout and allows either ending the session or resending an Access-Request. Terminating the session is messy. Many routers reconnect within a few seconds but some take minutes. It would be neater not to drop the PPP session.

Sadly the idea of re-authentication is somewhat flawed. For a start, the way we work, we throw away all the PPP negotiation and authentication data once the connection is completed so we can't re-auth. We could try a new CHAP challenges, but many routers and systems barf at that (including mobile data links) even though the spec says that should work, and anyway this does not work with PAP. Even if we solved these self imposed issues, the RADIUS server will not have the up to date data to decide what to do as the accounting could be up to an hour before. The solution is not to re-authenticate, but to send an intermediate RADIUS accounting packet instead.

This fits well as the RADIUS accounting server then has the up to date information to decide if over quota, but it can also check the customer database to confirm if there are changes (billing errors, start of new month, top-up, etc), and decide if the line needs to be locked down or not, and what quota it now needs. Sending an accounting update as soon as we hit the quota will allow the accounting server to know we are over limit immediately. It can then use a RADIUS CoA (Change of Authorisation) to change the line in some way if needed.

The CoA can be used to disconnect the line, clamp it to a low speed, or force on to a special routing table to hit a captive portal prompting people to top-up. It can also un-do these effects, and all without ever dropping the PPP link. If can update the quota as we roll over to a new month. All handled in one place. As the saying goes, "simples!".

So, having changed the LNS to support Terminate-Action choice of hang-up or accounting update, and to handle filters for Tx limit or Tx+Rx limit, we can now make some new service options around that. Perhaps pre-pay data SIMs? Or usage capped broadband services... Watch this space.

10 comments:

  1. To limit sessions, we use Radius POD (packet of disconnect). Each time we get an accounting update - every 5 minutes, we check to see if the customer has reached his transfer limit and if so, we send a packet to the LNS.

    We have our own radius server (as everyone should) but if you do not have a look at Radiator which may perhaps be extended with some perl in its configuration file to do something similar.

    The advantage is that it is a cross vendor solution.

    ReplyDelete
    Replies
    1. Yeh, we have POD and COA support, but run RADIUS on the hours (snapshot exactly on the hour, which is nice). We could do every 5 mins, but not really a need, especially now we have this set up... Useful feedback though Thomas. We can do 5 min RADIUS if a customer wants an LNS that does that, obviously - this is neater :-)

      Delete
  2. Your way is clearly less demanding with with radius, and as radius is UDP, less prone to issues in case of networking issue. That said, some implementation do kill the L2TP session if the accounting packet is not acked, so you can be sure the billing never missed - stable connection, correct billing, packet loss, pick two :)
    We use 5 minutes sampling as we are collecting the information and then generating per DSL usage graphs our customers can consult through our portal. Have a good week-end (if you are not working :D) see you Monday.

    ReplyDelete
    Replies
    1. Yes, I have been meaning to add a "kill if no ack". At present we try several times, and then carry on anyway, so would mean the over usage picked up on next hour. As for graphing, the LNS makes real time graph on 100 second samples with loss, latency (min, max, ave), and tx/rx throughput.

      Delete
  3. Personally, on our combination of FreeRADIUS/MySQL and Cisco BRAS we found with the FTTC services that RADIUS accounting became increasingly unreliable the more data the user consumed.

    We started running Netflow and found that on our low end users, data matched perfectly. If RADIUS said they used 10GB in the last week, Netflow agreed. However on the high end users, we found that if RADIUS said they used 100GB in a week, Netflow would say they'd used 300GB. We ended up having to double check it and confirm it via SNMP interface monitoring to be sure.

    We're not C programmers, so we can't confirm if it was FreeRADIUS recording wrong information, or the Cisco sending wrong information(I suspect an integer overflow somewhere) but we run our user statistics from Netflow now.

    ReplyDelete
    Replies
    1. That sounds like a simple matter of using 32 bit counters not 64. It means you lose 4.2GB on lines each time they exceed that between RADIUS updates. If the cisco can send 64 bit counters and free radius can use them, you should be able to solve that. RADIUS should be 100% accurate.

      Delete
    2. You would need to check to see if you are getting Gigaword attributes back in your RADIUS acct packets.

      For example, this is what we see reported by a FireBrick to our FreeRADIUS setup:

      Fri Nov 16 15:14:04 2012
      Acct-Status-Type = Interim-Update
      Acct-Delay-Time = 3
      Event-Timestamp = "Nov 16 2012 15:14:00 GMT"
      Acct-Input-Octets = 305193102
      Acct-Input-Gigawords = 0
      Acct-Output-Octets = 1139388226
      Acct-Output-Gigawords = 2
      Acct-Session-Time = 112007
      Acct-Input-Packets = 5488461
      Acct-Output-Packets = 6794151
      [snip]
      NAS-Port = 205
      Acct-Unique-Session-Id = "9216a739606ff380"
      Timestamp = 1353078844

      ... and from MPD5 to the same FreeRADIUS setup:

      Fri Nov 9 23:00:18 2012
      [snip]
      Acct-Session-Id = "2250857-L4-16"
      NAS-Port = 16
      NAS-Port-Type = Virtual
      Service-Type = Framed-User
      Framed-Protocol = PPP
      [snip]
      mpd-link = "L4-16"
      Tunnel-Type:0 = L2TP
      Tunnel-Medium-Type:0 = IPv4
      [snip]
      Acct-Multi-Session-Id = "2250857-B4-16"
      mpd-bundle = "B4-16"
      mpd-iface = "ng15"
      mpd-iface-index = 19
      Acct-Link-Count = 1
      Acct-Authentic = RADIUS
      Acct-Status-Type = Interim-Update
      Acct-Session-Time = 251163
      Acct-Input-Octets = 238698546
      Acct-Input-Packets = 2705651
      Acct-Input-Gigawords = 0
      Acct-Output-Octets = 821505488
      Acct-Output-Packets = 3761764
      Acct-Output-Gigawords = 1
      Acct-Unique-Session-Id = "9cea8121641c890e"
      Timestamp = 1352502018

      RADIUS doesn't use 64-bit counters although most L2TP implementations use them; it uses a 32-bit integer for Input and Output-Octets respectively and in order to cope with overflows, it uses 32-bit integers for Input and Output-Gigawords to count how many times each of the Input and Output-Octets values have overflowed and been reset to zero.

      You get a 64-bit counter but in the form of two 32-bit counters - a necessary hack to ensure backwards compatibility with legacy RADIUS implementations.

      FreeRADIUS has had support for Gigawords since 1.1.7 (IIRC).

      If I were you, I would check that one, the RADIUS server understands Gigawords and two, ensure that it taking those values into consideration when calculating total usage or the backend database which FreeRADIUS is storing the values in can handle values in the relevant columns of more than 32-bits.

      Delete
  4. Where is your greatest cost for transit? Getting traffic back to the customer through BTs network, or traffic out onto t'internet?

    If the former, then someone could rack (you?) up a massive bill by repeatedly trying to download stuff

    ReplyDelete
    Replies
    1. Obviously a quota system is not a problem here - what could be is what happens when bouncing someone to a captive portal or some such when the reach the quota.

      The cost is, indeed, the link to BT, but it is all in the download (a single user or even a large group of users could not upload enough to ever make that dominant in costs). So it is down to us what we choose to send down the line.

      When "locked down" like this, they just get to the portal, and that will count as their usage still. The portal is likely to be a small page. They will also be clamped, probably to 128Kb/s. So yes, they could hammer that and tie up some of their usage beyond what they wanted to or paid for.

      What was more of a concern was tunnelling by DNS. We are not going to fudge the DNS (for so many good reasons), just divert the IP traffic to the portal. We should have necessary rate limiting in the DNS anyway to stop that, but also, as I say, line capped to 128Kb/s anyway.

      Delete
  5. Pre-pay data SIMS would be great for small projects, especially if you could accept debit card payments. My employer's admin system doesn't allow direct debits.

    ReplyDelete