Huffman编码压缩文件时的文件储存结构

本文最后更新于：9 个月前

文件结构

说明

该文件结构针对ASCII码设计，字符的种类数设为 $N$ ，由于ASCII码中可显示字符十进制编码范围是 $[32, 126]$ , 共 $95$ 种，考虑进少量不可见字符则 $N m a x = 127$ ，又因为对于Huffman编码，如果 $N \leq 2$ 则编码没有任何效果，所以可设置范围 $N \in [3, 127]$ 。
设源文件大小不超过4GB( $232 bytes$ )，又因为 $N > 2$ ，所以在平均情况下所有文本中的字符的频数都可以用32位整数表示为 $M i$ 。
由于Huffman编码方式确定，源文件文本对应的Huffman编码不可变，则设Huffman编码的长度为 $H L e n$ 。
用存储所有字符-频数键值对的方式间接表示原Huffman树结构，解压时先根据键值对构建Huffman树再进行解压操作以得到原文件。
Huffman编码中最后一个字节设为 $H u f f L a s t$ ，由于$HuffLast $中存在编码余位不足而补位的情况，故设中存在编码余位不足 8 而补位的情况，故设$ HuffLast[7:loc] $为编码，而为编码，而$ HuffLast[loc-1,0] $为补位，其中为补位，其中$ loc $，特别地，当，特别地，当$ loc=0 $时表示时表示$ HuffLast$中没有补位。
设每一个字符-频数对为 $(K i, M i)$ ，其中 $M i$ 经过计算后可得到等价的 $V i$ ，所以字符-频数对 $(K i, M i)$ 可转换成键值对 $(K i, V i)$ ，则 $P a i r = {(K 1, V 1), (K 2, V 2), \dots, (K N, V N)}$ 。
在该文件结构下，文件大小为 $(2 \cdot N + H L e n / 8 - 1)$ bytes.

整体结构

名称(简称)	占用空间
所有键值对集合( $P a i r$ )	$(2 \cdot N - 1)$ bytes
文本Huffman编码( $H u f f$ )	$(H L e n / 8)$ bytes

$P a i r$

$P a i r$ 为所有字符-频数变形后的键值对的集合。集合大小 $N > 2$ 。

对于 $P a i r$ 中的除 $(K 1, V 1)$ 外每一个元素 $P a i r [i] = (K i, V i)$ ，分配2个字节。

第一个字节表示该字符即 $K i$ 。
第二个字节表示该字符对应原文本中出现的频数 $M i$ 的变形 $V i$ 。

$(K 1, V 1)$ 只分配一个字节存 $K 1$ ，不存 $V 1$ 。

由于每个字符在原文本中出现的频数 $M i$ 属于32位整数，占四个字节，但是结合Huffman树构建的原理可将频数压缩至只占一个字节的 $V i$ 。

$P a i r$ 的计算与存储

定义 $(K i, M i) < (K j, M j) : M i < M j or (M i = M j and K i < K j)$ 。将字符-频数集合按升序排序后得到新的序列 ${(K 1, M 1), (K 2, M 2), \dots, (K N, M N)}$ 。
设函数 $f (M i)$ : $f (M i) = {1 i = 1 M i - 1 / M i i \in [2, N]$
> 当 $i = 1$ 时，设 $其中为常数 V i = C \cdot f (M i) = C 其中C为常数$ 。 > > 当 $i > 1$ 时， $0 < f (M i) < 1$ ，设 $V i$ 表示 $f (M i)$ 小数点后前8位小数，精度为 $1 / 28 = 0.0039$ 。

综上可得到有序序列 $V = {i \in [1, N] ∣ V i}$ 。
由于 $K i$ 表示字符种类，范围为ASCII码前127位，则 $K i$ 最高位 $K i [7]$ 不参与字符种类表示。由于键值对个数为 $N$ ，对 $K i [7]$ 进行标记，使得以下关系成立： $((i < N - 1 \land \forall x \in [1, i]) \to K x [7] = 0) \land K N - 1 [7] = 1$
即对于前 $N - 1$ 个键值对除了 $K N - 1 [7] = 1$ ，其余 $K$ 的最高为都为0。通过这种方法就可以判断出 $N$ 的大小。
对于 $K N [7]$ 进行标记。

$K N [7] = {0 l o c = 0 1 l o c > 0$
即可以来表示 $H u f f L a s t$ 中是否有补位。
综上所述可以得到$Pair={i(K_i, V_i)} $，在存储时由于，在存储时由于$ V_1=C $，，$ C $可以任取，故存储时略过可以任取，故存储时略过$ V_1 $。可得到占用空间为 $2 \cdot N - 1$ bytes。

$H u f f$

对于长度为 $H L e n$ 的二进制编码，每8位为一个字节以unsigned char类型储存进文件，对于最后不足8位的编码，长度满足 $ HLen 8 = (8 - loc) 8 $。由。由$ K_N[7] $可得可得$ loc $得两种状态，得两种状态，$ loc=0 $时时$ HuffLast $全是编码不用特殊处理；全是编码不用特殊处理；$ loc=1 $时对补位时对补位$ HuffLast[loc-1, 0] $进行处理，使以下关系成立进行处理，使以下关系成立:$ $((l o c > 1 x) H u f f L a s t [x] = 1) ((l o c) H u f f L a s t [l o c - 1] = 0)$ $

即

$H u f f L a s t [x] = {1 l o c > 1 and x \in [0, l o c - 2] 0 l o c \geq 1 and x = l o c - 1$

用这种方法可以仅用一个额外的bit( $K N [7]$ )，配合补位的内容就可以把 $H u f f L a s t$ 中补位的个数表示出来。

总结

用 $V i$ 表示 $M i$ 时由于二进制小数位数的限制会有精度的损失，故 $V i$ 可能略小于$ M_{i-1}/M_{i} $，而且，而且$ V_i $的值仅和的值仅和$ M_{i-1} $有关，而还原时是进行有关，而还原时是进行$ M^{d}i=M^{d}{i-1}/V_i $，会对通过，会对通过$ M^{d}_{i-1} $对对$ V_i $精度的误差进行累积，这两点共同导致还原出的精度的误差进行累积，这两点共同导致还原出的$ M^{d}_i $略大于略大于$ M_i $。

但是由于Huffman编码过程并不需要准确的 $M i$ ，仅仅需要保存 $i M i$ 各有可能的不相交子集元素之和大小关系，故这种方式在大概率下不会导致所建Huffman树与原树不同，也就保证了解压后文本的正确性。

代码片段(Huff.h)

2018-12-06更新: 感谢室友的测试，已经更改了一些bug - 修复了各种状态转移间的信息未清空的问题。 - 修复了二进制小数转化的部分的临界问题。 - 修复了如果huffman编码部分没有补位，文件末尾仍有一个空字符的问题。 - 通过测试采用initC = 2替换原来的4。

#pragma once

#include <iostream>
#include <fstream>
#include <algorithm>
#include <queue>
#include <string>
#include <cmath>
#include <map>


#define SHOW_DETAIL

#define COMPRESS   0x1
#define DECOMPRESS 0x2
#define EXIT       0x3

#define BITSIZE    0x8

const int maxN = 1e4 + 1;
const int initC = 2;
int timestamp;


inline unsigned char dbToCh(double);
inline double chToDb(unsigned char);

struct VarAndFrec {
    char var;
    int frec;
    bool operator < (const VarAndFrec &obj) const {
        return this->frec < obj.frec;
    }
    VarAndFrec(char v = '\0', int f = 0) :
        var(v), frec(f) { }
    ~VarAndFrec() { }
};

struct Node {
    int frec;
    bool isLeaf;
    char var;
    Node *left;
    Node *right;

    Node(int frec = 0, char var = '\0') :
        frec(frec), var(var), isLeaf(true), left(NULL), right(NULL) { }
    ~Node() { }
};


class HuffMan {
public:
    HuffMan() {
        root = new Node;
    }
    ~HuffMan() { removeAll(root); }

    void Operate();

private:
    Node *root;
    std::map<char, int> varToFrec;
    std::map<char, std::string> codeList;
    std::ifstream input;
    std::ofstream output;
    std::string texts;

    void removeAll(Node *);
    void coding(Node *, std::string);

    bool getFilestream();
    void buildTree(bool);
    void compress();
    void decompress();
};

void HuffMan::removeAll(Node *root) {
    if (root == NULL) return;

    removeAll(root->left);
    removeAll(root->right);

    delete root;
}

void HuffMan::coding(Node *root, std::string code) {
    if (root->isLeaf) {
        codeList[root->var] = code;
        return;
    }
    coding(root->left, code + "0");
    coding(root->right, code + "1");
}


bool HuffMan::getFilestream() {
    std::string getin = "", getout = "";

    printf("Please input source file name(size less than 4GB)\n>> ");
    getline(std::cin, getin);
    printf("Please input code file name\n>> ");
    getline(std::cin, getout);
    printf("\n");

    input.open(getin.c_str(), std::ios::in | std::ios::binary);
    output.open(getout.c_str(), std::ios::out | std::ios::binary);

    return input.good();
}

void HuffMan::buildTree(bool sign = true) {
    // read text
    if (sign) {
        char alpha = '\0';
        varToFrec.clear();
        while (input.get(alpha)) {
            texts += alpha;
            varToFrec.find(alpha) != varToFrec.end() ?
                varToFrec[alpha] ++ : varToFrec[alpha] = 1;
        }
        if (varToFrec.size() < 3) {
            printf("File no need to be compressed!\n");
            return;
        }
    }
    //
#ifdef SHOW_DETAIL
    std::map<char, int>::iterator mpiter = varToFrec.begin();
    printf("char     frec\n");
    for (; mpiter != varToFrec.end(); mpiter++) {
        printf(" %c        %d\n", mpiter->first, mpiter->second);
    }
#endif // SHOWDETAIL

    // build queue
    removeAll(root);
    root = new Node;
    std::map<std::pair<int, int>, Node *> priQ;
    // init
    std::map<char, int>::iterator iter = varToFrec.begin();
    for (; iter != varToFrec.end(); iter++) {
        priQ[std::pair<int, int>(iter->second, timestamp++)] = new Node(iter->second, iter->first);
    }
    while (priQ.size() > 1) {
        Node *fir = priQ.begin()->second; priQ.erase(priQ.begin());
        Node *sec = priQ.begin()->second; priQ.erase(priQ.begin());

        Node *inter = new Node(fir->frec + sec->frec);
        inter->isLeaf = false;
        inter->left = fir->frec < sec->frec ? fir : sec;
        inter->right = fir->frec < sec->frec ? sec : fir;

        priQ[std::pair<int, int>(inter->frec, timestamp++)] = inter;
    }

    root = priQ.begin()->second; priQ.clear();
}

void HuffMan::compress() {
    codeList.clear();
    coding(root, "");

#ifdef SHOW_DETAIL
    std::map<char, std::string>::iterator iter = codeList.begin();
    for (; iter != codeList.end(); iter++) {
        printf(" %c     %s\n", iter->first, iter->second.c_str());
    }
#endif // SHOW_DETAIL

    std::string LastCode = "";
    for (std::string::iterator iter = texts.begin(); iter != texts.end(); iter++) {
        LastCode += codeList[*iter];
    }

    int recLen = LastCode.length();
    /************ Pair ****************/
    int PairNum = varToFrec.size();
    VarAndFrec FTV[maxN];
    std::map<char, int>::iterator iterC = varToFrec.begin();
    for (int inc = 0; iterC != varToFrec.end(); iterC++, inc++) {
        FTV[inc] = VarAndFrec(iterC->first, iterC->second);
    }
    std::sort(FTV, FTV + PairNum);

    for (int idx = 0; idx < PairNum; idx++) {
        unsigned char K_i = FTV[idx].var, V_i = '\0';
        if (idx == PairNum - 1) {
            K_i |= (recLen % BITSIZE != 0) << (BITSIZE - 1);
        }
        else {
            K_i |= (idx == PairNum - 2) ? 0x80 : 0x00;
        }

        output << K_i;
        if (idx) {
            V_i = dbToCh(1.0 * FTV[idx - 1].frec / FTV[idx].frec);
            output << V_i;
        }
    }
    /************ Pair ****************/

    int len = recLen / BITSIZE;
    for (int inc = 0; inc < len; inc++) {
        unsigned char Tm = '\0';
        for (int snc = 0; snc < BITSIZE; snc++) {
            Tm |= (int(*(LastCode.begin() + inc * BITSIZE + snc) - '0') << snc);
        }
        output << Tm; 
    }

    int Modlen = LastCode.length() % BITSIZE;
    unsigned char Ch = '\0';
    for (int inc = 0; inc < BITSIZE; inc++) {
        if (inc < Modlen)
            Ch |= (int(*(LastCode.begin() + len * BITSIZE + inc) - '0') << inc);
        else if (inc > Modlen)
            Ch |= (0x1 << inc);
    }

    if (Modlen)
        output << Ch;

    printf("Compress succeed!\n");

    input.close();
    output.close();
}


void HuffMan::decompress() {
    std::string line;
    std::string text;
    bool needEnter = false;
    while (std::getline(input, line)) {
        if (needEnter) line = "\n" + line;
        text += line;
        needEnter = true;
    }

    int KVpairNum = 0, loc = 0;
    int nextStop = 0;
    // Read Pair
    varToFrec.clear();
    for (std::string::iterator iter = text.begin(); iter != text.end(); iter++) {
        if (iter != text.begin()) {
            varToFrec[(*iter) & 0x7F] = std::floor(varToFrec[(*(iter - 1 - (varToFrec.size() >= 2))) & 0x7F] / chToDb(*(iter + 1)));
            if (nextStop) {
                loc = ((*iter) >> (BITSIZE - 1)) & 0x01;
                break;
            }
            else {
                nextStop = ((*iter) >> (BITSIZE - 1)) & 0x01;
            }
            iter++;
        }
        else {
            nextStop = ((*iter) >> (BITSIZE - 1)) & 0x01;
            varToFrec[(*iter) & 0x7F] = initC;
        }
    }
    buildTree(false);
    unsigned char HFLast = *text.rbegin();
    if (loc) {
        loc = BITSIZE;
        while (HFLast & (1 << (--loc)));
    }
    else {
        loc = BITSIZE;
    }

    int PairSize = 2 * varToFrec.size() - 1;
    // Read Huffman Code
    Node *curr = root;
    for (std::string::iterator iter = text.begin() + PairSize; iter != text.end(); iter++) {
        unsigned char Tm = *iter;
        int Range = (iter + 1 == text.end() ? loc : BITSIZE);
        for (int inc = 0; inc < Range; inc++) {
            curr = ((Tm >> inc) & 0x1) ? curr->right : curr->left;
            if (curr->isLeaf) {
                output << curr->var;
                curr = root;
            }
        }
    }

    printf("Decompress succeed!\n");

    input.close();
    output.close();
}

void HuffMan::Operate() {
    bool quit = false;
    int  opeNum = 0;
    while (!quit) {
        printf("-------------------\n");
        printf("1. Compress   \n");
        printf("2. Decompress \n");
        printf("3. Exit       \n");
        printf("-------------------\n");
        printf(">> ");
        scanf("%d", &opeNum);
        getchar();

        if (opeNum == COMPRESS) {
            if (getFilestream()) {
                buildTree();
                compress();
            }
        }
        else if (opeNum == DECOMPRESS) {
            if (getFilestream()) {
                decompress();
            }
        }
        else if (opeNum == EXIT) {
            quit = true;
        }
    }
}

inline unsigned char dbToCh(double db) {
    unsigned char Ch = '\0';
    for (int mv = 0; mv < 8; mv++) {
        Ch |= (unsigned char)((db * 2 >= 1) << (7 - mv));
        db = 2 * db >= 1 ? 2 * db - 1 : 2 * db;
    }
    return Ch;
}

inline double chToDb(unsigned char Ch) {
    double db = 0;
    for (int mv = 0; mv < 8; mv++) {
        db += pow(2, mv - 8) * ((Ch >> mv) & 0x01);
    }
    return db;
}
C++

#算法

Huffman编码压缩文件时的文件储存结构

https://blog.scubot.com/article/c83f/

作者

贺翔/CarOL

发布于

2018年12月4日

许可协议

基本空间向量搜索引擎理论(译) 上一篇

基于ID3算法的决策树实现下一篇

Huffman编码压缩文件时的文件储存结构

文件结构

说明

整体结构

Pair

Pair的计算与存储

Huff

总结

代码片段(Huff.h)

$P a i r$

$P a i r$ 的计算与存储

$H u f f$