十億QQ號如何去重？你知道嗎？

作者：蘇三 2025-07-17 14:33:51

開發架構

10億QQ號去重的本質，是將問題拆解到每個計算單元都能高效處理的粒度。今天這篇文章跟大家一起分享一些常見的解決方案，希望對你會有所幫助。

前言

最近在網上看到一個問題：10億QQ號如何去重？

我覺得挺有意思的。

今天這篇文章跟大家一起分享一些常見的解決方案，希望對你會有所幫助。

一、技術難點

1.數據規模分析

原始數據：10億×8字節 = 8GB
HashSet去重：至少16GB內存（Java對象開銷）
理想方案：<1GB內存

2. 核心挑戰

二、單機解決方案：位圖法

1.算法原理

利用位數組表示數字存在性：

public class BitMap {  
    privatefinalbyte[] bits;  
    
    public BitMap(int maxNum) {  
        this.bits = newbyte[(maxNum >> 3) + 1]; // 每byte存儲8個數字  
    }  
    
    public void add(int num) {  
        int arrayIndex = num >> 3;  // num/8  
        int position = num & 0x07;  // num%8  
        bits[arrayIndex] |= 1 << position;  
    }  
    
    public boolean contains(int num) {  
        int arrayIndex = num >> 3;  
        int position = num & 0x07;  
        return (bits[arrayIndex] & (1 << position)) != 0;  
    }  
}

2.QQ號范圍優化

QQ號范圍：10000（5位） - 9999999999（10位）位圖內存計算：

(10^10 - 10^4) / 8 / 1024/1024 ≈ 1.16GB

優化方案：

// 偏移量優化：存儲(qq - 10000)  
public void add(long qq) {  
    long num = qq - 10000;  
    int arrayIndex = (int)(num >> 3);  
    int position = (int)(num & 7);  
    bits[arrayIndex] |= 1 << position;  
}

三、進階方案：布隆過濾器

1.應對內存限制

2.參數設計與實現

public class BloomFilter {  
    privatefinal BitSet bitset;  
    privatefinalint size;  
    privatefinalint[] seeds;  
    
    public BloomFilter(int size, int hashCount) {  
        this.bitset = new BitSet(size);  
        this.size = size;  
        this.seeds = newint[hashCount];  
        for (int i = 0; i < hashCount; i++) {  
            seeds[i] = i * 31;  
        }  
    }  
    
    public void add(String qq) {  
        for (int seed : seeds) {  
            int hash = hash(qq, seed);  
            bitset.set(Math.abs(hash % size), true);  
        }  
    }  
    
    public boolean contains(String qq) {  
        for (int seed : seeds) {  
            int hash = hash(qq, seed);  
            if (!bitset.get(Math.abs(hash % size))) {  
                returnfalse;  
            }  
        }  
        returntrue;  
    }  
    
    private int hash(String value, int seed) {  
        // MurmurHash 實現  
        int result = 0;  
        for (char c : value.toCharArray()) {  
            result = seed * result + c;  
        }  
        return result;  
    }  
}

3.內存優化效果

方案	內存消耗	誤差率
原始存儲	8 GB	0%
位圖法	1.16 GB	0%
布隆過濾器(0.1%)	171 MB	0.001

四、磁盤方案：外部排序與多路歸并

1.處理流程

2.關鍵代碼實現

// 外部排序  
public void externalSort(String input, String output) throws IOException {  
    List<File> chunks = splitAndSort(input, 100_000_000); // 每個文件1千萬  
    mergeFiles(chunks, output);  
}  

// 多路歸并  
void mergeFiles(List<File> files, String output) {  
    PriorityQueue<MergeEntry> queue = new PriorityQueue<>();  
    List<BufferedReader> readers = new ArrayList<>();  

    // 初始化堆  
    for (File file : files) {  
        BufferedReader reader = new BufferedReader(new FileReader(file));  
        readers.add(reader);  
        String line = reader.readLine();  
        if (line != null) {  
            queue.add(new MergeEntry(line, reader));  
        }  
    }  

    try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {  
        long last = -1;  
        while (!queue.isEmpty()) {  
            MergeEntry entry = queue.poll();  
            long qq = Long.parseLong(entry.value);  

            // 去重：只寫入不重復的QQ號  
            if (qq != last) {  
                writer.write(entry.value);  
                writer.newLine();  
                last = qq;  
            }  

            // 讀取下一行  
            String next = entry.reader.readLine();  
            if (next != null) {  
                queue.add(new MergeEntry(next, entry.reader));  
            }  
        }  
    } finally {  
        readers.forEach(r -> { try { r.close(); } catch (IOException e) {}});  
    }  
}  

class MergeEntry implements Comparable<MergeEntry> {  
    String value;  
    BufferedReader reader;  

    public MergeEntry(String value, BufferedReader reader) {  
        this.value = value;  
        this.reader = reader;  
    }  

    @Override
    public int compareTo(MergeEntry o) {  
        returnthis.value.compareTo(o.value);  
    }  
}

五、分布式解決方案

1.分片策略設計

2.Spark實現方案

val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt")  
  .map(_.toLong)  
  .repartition(1000)  // 分為1000個分區  

// 每個分區內部去重  
val distinctRDD = qqRDD.mapPartitions { iter =>  
val bitmap = newRoaringBitmap()  
  iter.foreach(qq => bitmap.add(qq.toInt))  
  bitmap.iterator.asScala.map(_.toLong)  
}  

// 全局去重（可選）  
val globalDistinct = distinctRDD.distinct()  

globalDistinct.saveAsTextFile("hdfs://result/")

3.內存優化：RoaringBitmap

存儲優勢對比：

普通位圖：10^10 / 8 / 1024/1024 ≈ 1.16 GB  
RoaringBitmap：稀疏數據下可壓縮至100-300 MB

六、生產級架構：Lambda架構

1.系統架構圖

2.各層技術選型

架構層	技術棧	處理目標
批處理層	Spark + HDFS	全量數據去重
速度層	Flink + Redis	實時增量去重
服務層	Spring Boot + HBase	統一查詢接口

3.實時去重實現

public class QQDeduplication {  
    privatestaticfinal String REDIS_KEY = "qq_set";  

    public boolean isDuplicate(String qq) {  
        try (Jedis jedis = jedisPool.getResource()) {  
            // 使用HyperLogLog進行基數估計  
            if (jedis.pfcount(REDIS_KEY) > 1_000_000_000L) {  
                returntrue; // 已超過10億，直接返回重復  
            }  
            return jedis.sadd(REDIS_KEY, qq) == 0;  
        }  
    }  
}

七、終極方案：分層位圖索引

1.架構設計

2.存儲計算

QQ號范圍：10000 - 9999999999（約100億）分層設計：

第一層分片：100個區間（每區間1億）
第二層分片：100個子區間（每區間100萬）
第三層存儲：RoaringBitmap（每分區1.2MB）

總內存需求：

100 × 100 × 1.2MB = 12GB（分布式存儲可行）

3.Java實現

public class LayeredBitmap {  
    privatefinal RoaringBitmap[][][] bitmaps;  
    privatestaticfinalint L1 = 100; // 一級分片  
    privatestaticfinalint L2 = 100; // 二級分片  

    public LayeredBitmap() {  
        bitmaps = new RoaringBitmap[L1][L2][];  
    }  

    public void add(long qq) {  
        int l1Index = (int)((qq - 10000) / 100_000_000);  
        long remainder = (qq - 10000) % 100_000_000;  
        int l2Index = (int)(remainder / 1_000_000);  
        int value = (int)(remainder % 1_000_000);  

        if (bitmaps[l1Index][l2Index] == null) {  
            bitmaps[l1Index][l2Index] = new RoaringBitmap();  
        }  
        bitmaps[l1Index][l2Index].add(value);  
    }  

    public boolean contains(long qq) {  
        // 類似add的分片計算  
        RoaringBitmap bitmap = bitmaps[l1Index][l2Index];  
        return bitmap != null && bitmap.contains(value);  
    }  
}

八、方案對比與選型建議

方案	適用場景	內存/存儲	時間復雜度	精度
單機位圖	<1億數據	O(n)	O(1)	100%
布隆過濾器	百億級容忍誤差	O(1)	O(k)	<99.9%
外部排序	單機磁盤處理	磁盤空間	O(n log n)	100%
Spark分布式	海量數據批量處理	集群存儲	O(n)	100%
Redis實時去重	增量數據實時處理	O(n)	O(1)	100%
分層位圖索引	超大規模精準去重	O(n)壓縮存儲	O(1)	100%

九、實戰經驗與避坑指南

1.數據傾斜解決方案

問題場景：部分QQ號段過于集中（如100000-199999）

解決策略：

// 動態分片函數  
int getShardId(long qq, int shardCount) {  
    String str = String.valueOf(qq);  
    // 取后6位作為分片依據  
    int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6)));  
    return suffix % shardCount;  
}

2.去重精度保障

3.成本優化建議

冷熱分離：熱數據用內存位圖，冷數據存磁盤
壓縮存儲：使用RoaringBitmap替代普通位圖
分級存儲：

a.最近3個月數據：內存存儲

b.歷史數據：HBase+壓縮

總結

分治思想：10億問題拆解為1000個100萬問題
空間換時間：位圖法用存儲空間換取O(1)時間復雜度
概率智慧：布隆過濾器用可控誤差換取千倍空間壓縮
分層設計：億級→百萬級→萬級分層處理
動靜分離：批處理處理歷史數據，實時處理增量數據

10億QQ號去重的本質，是將問題拆解到每個計算單元都能高效處理的粒度。

責任編輯：姜華來源：蘇三說技術