簡(jiǎn)單高效的短鏈接生成服務(wù)C#實(shí)現(xiàn)

2015-06-24 10:10:38

項(xiàng)目中有一處需求，需要把長(zhǎng)網(wǎng)址縮為短網(wǎng)址，把結(jié)果通過(guò)短信、微信等渠道推送給客戶。剛開始直接使用網(wǎng)上現(xiàn)成的開放服務(wù)，然后在某個(gè)周末突然手癢想自己動(dòng)手實(shí)現(xiàn)一個(gè)別具特色的長(zhǎng)網(wǎng)址（文本）縮短服務(wù)。

由于以前做過(guò)socket服務(wù)，對(duì)數(shù)據(jù)包的封裝排列還有些印象，因此，短網(wǎng)址服務(wù)我第一反應(yīng)是先設(shè)計(jì)數(shù)據(jù)的存儲(chǔ)格式，我這里沒(méi)有采用數(shù)據(jù)庫(kù)，而是使用2個(gè)文件來(lái)實(shí)現(xiàn)：

Url.db存儲(chǔ)用戶提交的長(zhǎng)網(wǎng)址文本，Url.idx 存儲(chǔ)數(shù)據(jù)索引，記錄每次提交數(shù)據(jù)的位置（Begin）與長(zhǎng)度（Length），還有一些附帶信息（Hits，DateTime）。由于每次添加長(zhǎng)網(wǎng)址，對(duì) 兩個(gè)文件都是進(jìn)行Append操作，因此即使這兩個(gè)文件體積很大（比如若干GB），也沒(méi)有太大的IO壓力。

再看看Url.idx文件的結(jié)構(gòu)，ID是主鍵，設(shè)為Int64類型，轉(zhuǎn)換為字節(jié)數(shù)組后的長(zhǎng)度為8，緊跟的是Begin，該值是把長(zhǎng)網(wǎng)址數(shù)據(jù)續(xù)寫到 Url.db文件之前，Url.db文件的長(zhǎng)度，同樣設(shè)為Int64類型。長(zhǎng)網(wǎng)址的字符串長(zhǎng)度有限，Int16足夠使用了，Int16.MaxValue==65536，比Url規(guī)范定義的4Kb長(zhǎng)度還大，Int16轉(zhuǎn)換為字節(jié)數(shù)組后長(zhǎng)度為2字節(jié)。Hits表示短網(wǎng)址的解析次數(shù)，設(shè)為Int32，字節(jié)長(zhǎng)度為4，DateTime 設(shè)為Int64，長(zhǎng)度8。由于ID不會(huì)像數(shù)據(jù)庫(kù)那樣自動(dòng)遞增，因此需要手工實(shí)現(xiàn)。因此在開始寫入U(xiǎn)rl.idx前，需要預(yù)先讀取最后一行（行是虛的，其實(shí) 就是最后30字節(jié)）中的的ID值，遞增后才開始寫入新的一行。

也就是說(shuō)每次提交一個(gè)長(zhǎng)網(wǎng)址，不管數(shù)據(jù)有多長(zhǎng)（最大不能超過(guò)65536字節(jié)），Url.idx 文件都固定增加 30 字節(jié)。

數(shù)據(jù)結(jié)構(gòu)一旦明確下來(lái)，整個(gè)網(wǎng)址縮短服務(wù)就變得簡(jiǎn)單明了。例如連續(xù)兩次提交長(zhǎng)網(wǎng)址，可能得到的短網(wǎng)址為http://域名/1000，與http://域名/1001，結(jié)果顯然很丑陋，域名后面的ID全是數(shù)字，而且遞增關(guān)系明顯，很容易暴力枚舉全部的數(shù)據(jù)。而且10進(jìn)制的數(shù)字容量有限，一次提交100萬(wàn)條的長(zhǎng)網(wǎng)址，產(chǎn)生的短網(wǎng)址越來(lái)越長(zhǎng)，失去意義。

因此下面就開始對(duì)ID進(jìn)行改造，改造的目標(biāo)有2：

1、增加混淆機(jī)制，相鄰兩個(gè)ID表面上看不出區(qū)別。

2、增加容量，一次性提交100萬(wàn)條長(zhǎng)網(wǎng)址，ID的長(zhǎng)度不能有明顯變化。

最簡(jiǎn)單最直接的混淆機(jī)制，就是把10進(jìn)制轉(zhuǎn)換為62進(jìn)制（0-9a-zA-Z），由于順序的abcdef…也很容易猜到下一個(gè)ID，因此62進(jìn)制字符序列隨機(jī)排列一次：

/// <summary>
    /// 生成隨機(jī)的0-9a-zA-Z字符串
    /// </summary>
    /// <returns></returns>
    public static string GenerateKeys()
    {
        string[] Chars = "0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z".Split(',');
        int SeekSeek = unchecked((int)DateTime.Now.Ticks);
        Random SeekRand = new Random(SeekSeek);
        for (int i = 0; i < 100000; i++)
        {
            int r = SeekRand.Next(1, Chars.Length);
            string f = Chars[0];
            Chars[0] = Chars[r - 1];
            Chars[r - 1] = f;
        }
        return string.Join("", Chars);
    }

運(yùn)行一次上面的方法，得到隨機(jī)序列：

string Seq = "s9LFkgy5RovixI1aOf8UhdY3r4DMplQZJXPqebE0WSjBn7wVzmN2Gc6THCAKut";

用這個(gè)序列字符串替代0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ，具有很強(qiáng)的混淆特性。一個(gè)10進(jìn)制的數(shù)字按上面的序列轉(zhuǎn)換為62進(jìn)制，將變得面目全非，附轉(zhuǎn)換方法：

/// <summary>
    /// 10進(jìn)制轉(zhuǎn)換為62進(jìn)制
    /// </summary>
    /// <param name="id"></param>
    /// <returns></returns>
    private static string Convert(long id)
    {
        if (id < 62)
        {
            return Seq[(int)id].ToString();
        }
        int y = (int)(id % 62);
        long x = (long)(id / 62);

return Convert(x) + Seq[y];
}

    /// <summary>
    /// 將62進(jìn)制轉(zhuǎn)為10進(jìn)制
    /// </summary>
    /// <param name="Num"></param>
    /// <returns></returns>
    private static long Convert(string Num)
    {
        long v = 0;
        int Len = Num.Length;
        for (int i = Len - 1; i >= 0; i--)
        {
            int t = Seq.IndexOf(Num[i]);
            double s = (Len - i) - 1;
            long m = (long)(Math.Pow(62, s) * t);
            v += m;
        }
        return v;
    }

例如執(zhí)行 Convert(123456789) 得到 RYswX，執(zhí)行 Convert(123456790) 得到 RYswP。

如果通過(guò)分析大量的連續(xù)數(shù)值，還是可以暴力算出上面的Seq序列值，進(jìn)而猜測(cè)到某個(gè)ID左右兩邊的數(shù)值。下面進(jìn)一步強(qiáng)化混淆，ID每次遞增的單位不是固定的1，而是一個(gè)隨機(jī)值，比如1000,1005,1013,1014,1020，毫無(wú)規(guī)律可言。

private static Int16 GetRnd(Random seekRand)
    {
        Int16 s = (Int16)seekRand.Next(1, 11);
        return s;
    }

即使把62進(jìn)制的值逆向計(jì)算出10進(jìn)制的ID值，也難于猜測(cè)到左右兩邊的值，大大增加暴力枚舉的難度。難度雖然增加，但是連續(xù)產(chǎn)生的2個(gè)62進(jìn)制值如前面的RyswX與RyswP，僅個(gè)位數(shù)不同，還是很像，因此我們?cè)龠M(jìn)行第三次簡(jiǎn)單的混淆，把62進(jìn)制字符向左（右）旋轉(zhuǎn)一定次數(shù)（解析時(shí)反向旋轉(zhuǎn)同樣的次數(shù)）：

/// <summary>
    /// 混淆id為字符串
    /// </summary>
    /// <param name="id"></param>
    /// <returns></returns>
    private static string Mixup(long id)
    {
        string Key = Convert(id);
        int s = 0;
        foreach (char c in Key)
        {
            s += (int)c;
        }
        int Len = Key.Length;
        int x = (s % Len);
        char[] arr = Key.ToCharArray();
        char[] newarr = new char[arr.Length];
        Array.Copy(arr, x, newarr, 0, Len - x);
        Array.Copy(arr, 0, newarr, Len - x, x);
        string NewKey = "";
        foreach (char c in newarr)
        {
            NewKey += c;
        }
        return NewKey;
    }

    /// <summary>
    /// 解開混淆字符串
    /// </summary>
    /// <param name="Key"></param>
    /// <returns></returns>
    private static long UnMixup(string Key)
    {
        int s = 0;
        foreach (char c in Key)
        {
            s += (int)c;
        }
        int Len = Key.Length;
        int x = (s % Len);
        x = Len - x;
        char[] arr = Key.ToCharArray();
        char[] newarr = new char[arr.Length];
        Array.Copy(arr, x, newarr, 0, Len - x);
        Array.Copy(arr, 0, newarr, Len - x, x);
        string NewKey = "";
        foreach (char c in newarr)
        {
            NewKey += c;
        }
        return Convert(NewKey);
    }

執(zhí)行 Mixup(123456789)得到wXRYs，假如隨機(jī)遞增值為7，則下一條記錄的ID執(zhí)行 Mixup(123456796)得到swWRY，肉眼上很難再聯(lián)想到這兩個(gè)ID值是相鄰的。

以上講述了數(shù)據(jù)結(jié)構(gòu)與ID的混淆機(jī)制，下面講述的是短網(wǎng)址的解析機(jī)制。

得到了短網(wǎng)址，如wXRYs，我們可以通過(guò)上面提供的UnMixup()方法，逆向計(jì)算出ID值，由于ID不是遞增步長(zhǎng)為1的數(shù)字，因此不能根據(jù)ID馬上計(jì)算出記錄在索引文件中的位置（如：ID * 30）。由于ID是按小到大的順序排列，因此在索引文件中定位ID，非二分查找法莫屬。

//二分法查找的核心代碼片段
FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
            long Id =;//解析短網(wǎng)址得到的真實(shí)ID
            long Left = 0;
            long Right = (long)(Index.Length / 30) - 1;
            long Middle = -1;
            while (Left <= Right)
            {
                Middle = (long)(Math.Floor((double)((Right + Left) / 2)));
                if (Middle < 0) break;
                Index.Position = Middle * 30;
                Index.Read(buff, 0, 8);
                long val = BitConverter.ToInt64(buff, 0);
                if (val == Id) break;
                if (val < Id)
                {
                    Left = Middle + 1;
                }
                else
                {
                    Right = Middle - 1;
                }
            }

Index.Close();

二分法查找的核心是不斷移動(dòng)指針，讀取中間的8字節(jié)，轉(zhuǎn)換為數(shù)字后再與目標(biāo)ID比較的過(guò)程。這是一個(gè)非常高速的算法，如果有接近43億條短網(wǎng)址記錄，查找某一個(gè)ID，最多只需要移動(dòng)32次指針（上面的while循環(huán)32次）就能找到結(jié)果，因?yàn)?^32=4294967296。

用二分法查找是因?yàn)榍懊媸褂昧穗S機(jī)遞增步長(zhǎng)，如果遞增步長(zhǎng)設(shè)為1，則二分法可免，直接從 ID*30 就能一次性精準(zhǔn)定位到索引文件中的位置。

下面是完整的代碼，封裝了一個(gè)ShortenUrl類：

using System;
using System.Linq;
using System.Web;
using System.IO;
using System.Text;

/// <summary>
/// ShortenUrl 的摘要說(shuō)明
/// </summary>
public class ShortenUrl
{
const string Seq = "s9LFkgy5RovixI1aOf8UhdY3r4DMplQZJXPqebE0WSjBn7wVzmN2Gc6THCAKut";

    private static string DataFile
    {
        get { return HttpContext.Current.Server.MapPath("/Url.db"); }
    }

    private static string IndexFile
    {
        get { return HttpContext.Current.Server.MapPath("/Url.idx"); }
    }

    /// <summary>
    /// 批量添加網(wǎng)址，按順序返回Key。如果輸入的一組網(wǎng)址中有不合法的元素，則返回?cái)?shù)組的相同位置（下標(biāo)）的元素將為null。
    /// </summary>
    /// <param name="Url"></param>
    /// <returns></returns>
    public static string[] AddUrl(string[] Url)
    {
        FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
        FileStream Data = new FileStream(DataFile, FileMode.Append, FileAccess.Write);
        Data.Position = Data.Length;
        DateTime Now = DateTime.Now;
        byte[] dt = BitConverter.GetBytes(Now.ToBinary());
        int _Hits = 0;
        byte[] Hits = BitConverter.GetBytes(_Hits);
        string[] ResultKey = new string[Url.Length];
        int seekSeek = unchecked((int)Now.Ticks);
        Random seekRand = new Random(seekSeek);
        string Host = HttpContext.Current.Request.Url.Host.ToLower();
        byte[] Status = BitConverter.GetBytes(true);
        //index: ID(8) + Begin(8) + Length(2) + Hits(4) + DateTime(8) = 30
        for (int i = 0; i < Url.Length && i<1000; i++)
        {
            if (Url[i].ToLower().Contains(Host) || Url[i].Length ==0 || Url[i].Length > 4096) continue;
            long Begin = Data.Position;
            byte[] UrlData = Encoding.UTF8.GetBytes(Url[i]);
            Data.Write(UrlData, 0, UrlData.Length);
            byte[] buff = new byte[8];
            long Last;
            if (Index.Length >= 30) //讀取上一條記錄的ID
            {
                Index.Position = Index.Length - 30;
                Index.Read(buff, 0, 8);
                Index.Position += 22;
                Last = BitConverter.ToInt64(buff, 0);
            }
            else
            {
                Last = 1000000; //起步ID，如果太小，生成的短網(wǎng)址會(huì)太短。
                Index.Position = 0;
            }
            long RandKey = Last + (long)GetRnd(seekRand);
            byte[] BeginData = BitConverter.GetBytes(Begin);
            byte[] LengthData = BitConverter.GetBytes((Int16)(UrlData.Length));
            byte[] RandKeyData = BitConverter.GetBytes(RandKey);

            Index.Write(RandKeyData, 0, 8);
            Index.Write(BeginData, 0, 8);
            Index.Write(LengthData, 0, 2);
            Index.Write(Hits, 0, Hits.Length);
            Index.Write(dt, 0, dt.Length);
            ResultKey[i] = Mixup(RandKey);
        }
        Data.Close();
        Index.Close();
        return ResultKey;
    }

    /// <summary>
    /// 按順序批量解析Key，返回一組長(zhǎng)網(wǎng)址。
    /// </summary>
    /// <param name="Key"></param>
    /// <returns></returns>
    public static string[] ParseUrl(string[] Key)
    {
        FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
        FileStream Data = new FileStream(DataFile, FileMode.Open, FileAccess.Read);
        byte[] buff = new byte[8];
        long[] Ids = Key.Select(n => UnMixup(n)).ToArray();
        string[] Result = new string[Ids.Length];
        long _Right = (long)(Index.Length / 30) - 1;
        for (int j = 0; j < Ids.Length; j++)
        {
            long Id = Ids[j];
            long Left = 0;
            long Right = _Right;
            long Middle = -1;
            while (Left <= Right)
            {
                Middle = (long)(Math.Floor((double)((Right + Left) / 2)));
                if (Middle < 0) break;
                Index.Position = Middle * 30;
                Index.Read(buff, 0, 8);
                long val = BitConverter.ToInt64(buff, 0);
                if (val == Id) break;
                if (val < Id)
                {
                    Left = Middle + 1;
                }
                else
                {
                    Right = Middle - 1;
                }
            }
            string Url = null;
            if (Middle != -1)
            {
                Index.Position = Middle * 30 + 8; //跳過(guò)ID
                Index.Read(buff, 0, buff.Length);
                long Begin = BitConverter.ToInt64(buff, 0);
                Index.Read(buff, 0, buff.Length);
                Int16 Length = BitConverter.ToInt16(buff, 0);
                byte[] UrlTxt = new byte[Length];
                Data.Position = Begin;
                Data.Read(UrlTxt, 0, UrlTxt.Length);
                int Hits = BitConverter.ToInt32(buff, 2);//跳過(guò)2字節(jié)的Length
                byte[] NewHits = BitConverter.GetBytes(Hits + 1);//解析次數(shù)遞增, 4字節(jié)
                Index.Position -= 6;//指針撤回到Length之后
                Index.Write(NewHits, 0, NewHits.Length);//覆蓋老的Hits
                Url = Encoding.UTF8.GetString(UrlTxt);
            }
            Result[j] = Url;
        }
        Data.Close();
        Index.Close();
        return Result;
    }

return Convert(x) + Seq[y];
}

    /// <summary>
    /// 返回隨機(jī)遞增步長(zhǎng)
    /// </summary>
    /// <param name="SeekRand"></param>
    /// <returns></returns>
    private static Int16 GetRnd(Random SeekRand)
    {
        Int16 Step = (Int16)SeekRand.Next(1, 11);
        return Step;
    }
}

本方案的優(yōu)點(diǎn)：

把10進(jìn)制的ID轉(zhuǎn)換為62進(jìn)制的字符，6位數(shù)的62進(jìn)制字符容量為 62^6約為568億，如果每次隨機(jī)遞增值為1~10（取平均值為5），6位字符的容量仍然能容納113.6億條！這個(gè)數(shù)據(jù)已經(jīng)遠(yuǎn)遠(yuǎn)大于一般的數(shù)據(jù)庫(kù)承受能力。由于每次提交長(zhǎng)網(wǎng)址采用Append方式寫入，因此寫入性能也不會(huì)差。在解析短網(wǎng)址時(shí)由于采用二分法查找，僅移動(dòng)文件指針與讀取8字節(jié)的緩存，性能上依然非常優(yōu)秀。

缺點(diǎn)：在高并發(fā)的情況下，可能會(huì)出現(xiàn)文件打開失敗等IO異常，如果改用單線程的Node.js來(lái)實(shí)現(xiàn)，或許可以杜絕這種情況。

責(zé)任編輯：王雪燕來(lái)源：蔡大衛(wèi)