高效地查询IP地理位置信息的方法

前段时间写了几份安全报告,在编写报告前都要对大量的IP地址进行分析,其中比较麻烦的就是统计IP的地理位置分布。

一般来说,查询IP地理信息不外乎两种方式,一是调用查询网站的接口(如ip138.com, ip.cn等);二是下载一份IP数据库,本地写程序进行分析匹配。方法一开发成本低,但受制于接口的调用次数限制和网络延时,在数十万个IP面前显得无能为力,所以只能选择方法二。

网上公开的IP数据库质量参差不齐,而且也似乎没找到特别新的数据。后来在公司内网里面找到了淘宝IP数据库,数据很全,而且每周更新一次,就决定用它了。

接下来就是最重要的搜索算法,如何才能做到高效。查看IP库文件,它是每行一个IP段的形式,总数量超过30万行,包括整型地址的始末点分十进制地址的始末IP段地理信息。因此搜索的关键就在于定位IP属于哪个IP段

由于IP库文件给出的IP段是有序的,从 0.0.0.0 开始到 255.255.255.255 结束。我们把每个IP段的整型起始地址依次存放到一个list中,就可以得到一个有序的整型数列,这个数列相邻的两个值刚好就代表了一个IP段,而这个段对应的地理信息跟左侧的值相关。

数列是有序的,到这里问题就变得十分简单了——把一个值插到有序数列中并返回它的位置。实现的方法有很多,我采用的是二分查找。下面是 Python 的实现代码:

#!/usr/bin/python
# -- coding: utf-8 --
# author: ghy459@hack0nair.me

import sys, bisect, string, time

class IPLocation(object): def init(self): super(IPLocation, self).init() self.ip_start = [] # 存放IP段首地址 self.ip_location = {} # 存放段首地址对应的地理信息 self.ipdata_file = 'ipdata_geo_isp_code.txt.utf8' for line in open(self.ipdata_file, 'r'): line = line.decode('utf-8', 'ignore').replace('\n', '').replace('"', '').split(',') # 逐行提取数据并分割 self.ip_start.append(int(line[0])) # 把段首地址存入 ip_start 数组 self.ip_location[int(line[0])] = (line[4], line[8], line[10], line[14]) # 以段首地址 key, 该段地址的地理信息为 value, 建立对应关系

<span class="k">def</span> <span class="nf">str2int</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">ip_str</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot; 点分十进制IP地址 -&gt; 整型IP地址 &quot;&quot;&quot;</span>
    <span class="n">ss</span> <span class="o">=</span> <span class="n">string</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">ip_str</span><span class="p">,</span> <span class="s">&#39;.&#39;</span><span class="p">);</span>
    <span class="n">ip</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">ss</span><span class="p">:</span> <span class="n">ip</span> <span class="o">=</span> <span class="p">(</span><span class="n">ip</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">+</span> <span class="n">string</span><span class="o">.</span><span class="n">atoi</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">ip</span>

<span class="k">def</span> <span class="nf">getIPLoc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">ip</span><span class="p">):</span>
    <span class="n">ip_int</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">str2int</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span> <span class="c"># 把传入的IP地址转为整型</span>
    <span class="n">point</span> <span class="o">=</span> <span class="n">bisect</span><span class="o">.</span><span class="n">bisect_left</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">ip_start</span><span class="p">,</span> <span class="n">ip_int</span><span class="p">)</span> <span class="c"># 二分搜索,返回 ip_int 应该插入 ip_start 的位置</span>
    <span class="k">if</span> <span class="n">point</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">ip_location</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>  <span class="c"># 如果输入的是 0.0.0.0, 二分搜索结果返回null, 实际上它属于第一个IP段</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">ip_location</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ip_start</span><span class="p">[</span><span class="n">point</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]]</span> <span class="c"># 返回 point 左侧的段首地址对应的地理信息</span>

if name == 'main': iploc = IPLocation() for ip in open(sys.argv[1], 'r'): country, province, city, isp = iploc.getIPLoc(ip.replace('\n', ''))

本地测试了一下,使用这个代码查询100万个无序的IP,消耗的时间是4.9482s

最后附上IP数据库和代码的下载地址: http://pan.baidu.com/s/1dDnR3Uh 密码: gxe5

« 返回