Most Efficient way of implementing a BlackList

Ask Time：2011-10-27T07:28:50 Author：bratao

I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter.

What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s.

The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)).

I have time and could create anything from zero. The difficulty is not important to me. If you can use anything, what you should do ?

Author:bratao，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/7910130/most-efficient-way-of-implementing-a-blacklist

Salvatore Previti :

Hashtables are the way to go.\nThey have averaged O(1) complexity for lookup, insertion and deletion!\nThey tend to occupy more memory than trees but are much faster.\n\nSince you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast.\n\nYou can just use a sorted array. Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip.\nThe implementation is very simple, perhaps too much :D\n\nBinary trees have complexity of O(log n) for lookup, insertion and deletion.\nA simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement.\nAVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list!\n\nIf instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries.\nHowever these trees can be slower if not well written\\balanced.\nHashtable are always better for simple lookups! They are too fast to be real :)\n\nNow about synchronization:\n\nIf you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking.\n\nIf you need to update it not very often, I would suggest you the use reader-writer locks.\n\nIf you need very frequent updates I would suggest you to use a concurrent hashtable.\nWarning: don't write your own, they are very complicated and bug prone, find an implementation on the web!\nThey use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking.\nUsing them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive.\nSee http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS.\n\nConcurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf\n\nI just found that concurrent radix tree exists:\nftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too.\n\nConcurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update.\n\nConsider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer.\nYou need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored.\nHashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time.\n\nFor both trees and hashtable, the space complexity is linear.\n\nI hope this is a multithreading application and not a multiprocess application (fork).\nIf it is not multithreading you cannot share a portion of memory in a fast and reliable way.",

2011-10-26T23:36:15

caf :

One way to improve the performance of such a system is to use a Bloom Filter. This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not.\n\nWhen you want to look up an IP address, you first check in the Bloom Filter. If there's a miss, you can allow the traffic right away. If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree).\n\nYou could also create a small cache of \"hits in the Bloom Filter but actually allowed\" addresses, that is checked after the Bloom Filter but before the authoritative data structure.\n\nBasically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied).",

2011-10-27T05:14:37

Most Efficient way of implementing a BlackList