说明文档

Llama 3 8B Instruct 模型，使用 SparseGPT、SmoothQuant 和 GPTQ 进行一次性压缩，达到 50% 稀疏度和 INT8 权重+激活值。

使用 SparseML+DeepSparse=1.7 制作。安装命令：pip install deepsparse~=1.7 "sparseml[transformers]"~=1.7 "numpy<2"。

以下是用于 SparseML 压缩的脚本：

from datasets import load_dataset
from sparseml.transformers import (
    SparseAutoModelForCausalLM,
    SparseAutoTokenizer,
    load_dataset,
    compress,
)

model = SparseAutoModelForCausalLM.from_pretrained(
    \"meta-llama/Meta-Llama-3-8B-Instruct\", device_map=\"auto\"
)
tokenizer = SparseAutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3-8B-Instruct\")
dataset = load_dataset(\"garage-bAInd/Open-Platypus\")


def format_data(data):
    instruction = tokenizer.apply_chat_template(
        [{\"role\": \"user\", \"content\": data[\"instruction\"]}],
        tokenize=False,
        add_generation_prompt=True,
    )
    return {\"text\": instruction + data[\"output\"]}


dataset = dataset.map(format_data)

recipe = \"\"\"
compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore:
                # These operations don't make sense to quantize
                - LlamaRotaryEmbedding
                - LlamaRMSNorm
                - SiLUActivation
                - QuantizableMatMul
                # Skip quantizing the layers with the most sensitive activations
                - model.layers.1.mlp.down_proj
                - model.layers.31.mlp.down_proj
                - model.layers.14.self_attn.q_proj
                - model.layers.14.self_attn.k_proj
                - model.layers.14.self_attn.v_proj
            post_oneshot_calibration: true
            scheme_overrides:
                # Enable channelwise quantization for better accuracy
                Linear:
                    weights:
                        num_bits: 8
                        symmetric: true
                        strategy: channel
                # For the embeddings, only weight-quantization makes sense
                Embedding:
                    input_activations: null
                    weights:
                        num_bits: 8
                        symmetric: false
        SparseGPTModifier:
            sparsity: 0.5
            quantize: True
            targets: ['re:model.layers.\\d*$']
\"\"\"

compress(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    recipe=recipe,
    output_dir=\"./one-shot-checkpoint\",
)

mgoin/Meta-Llama-3-8B-Instruct-pruned50-quant-ds

作者 mgoin

text-generation transformers

↓ 1 ♥ 0

创建时间: 2024-06-28 16:13:17+00:00

更新时间: 2024-06-28 16:59:41+00:00

在 Hugging Face 上查看

文件 (9)

.gitattributes

README.md

config.json

model-orig.onnx ONNX

model.data

model.onnx ONNX

special_tokens_map.json

tokenizer.json

tokenizer_config.json