How to Deploy Llama on Your Local Machine

February 2nd, 2024

how-to-deploy-llama-on-local-machine-image

This step-by-step guide empowers you to deploy and utilize this cutting-edge language model, fueling professional innovation and unlocking a universe of possibilities.

Master text generation, translation, code-writing, and more - all from the comfort of your local environment.

I. Introduction

In the era of generative AI, particularly with large language models (LLMs), there is a notable presence of both paid options such as ChatGPT and OpenAI's trial Generation API, as well as several free, open-source LLMs. Some of these open-source models exhibit performance that is either on par with or even surpasses that of ChatGPT (specifically GPT-3.5). One of these open-source models is Meta's powerful Large Language Model – Llama.

This article aims to present an innovative method for implementing Llama on your personal machine. This approach allows for the customization of Llama using your own database, all while addressing concerns related to data privacy.

The article is the best fit for you if:

You want to try running Llama on your machine.
You are concerned about data privacy when using third-party LLM models.

II. Implements LLMs on your own machine

Running a large language model normally needs a large memory of GPU with a strong CPU, for example, it is about 280GB VRAM for a 70B model, or 28GB VRAM for a 7B model for a normal LLMs (use 32bits for each parameter). However, with most companies, it is too expensive to invest in the strength hardware enough to make LLMs available for production. Fortunately, there is research which trying to decrease the size of LLM models while keeping the performances.

The idea of running LLMs on your local machine is to lower the precision of each parameter, also called as quantize model from full model (using 32-bit for each parameter) to 8-bit or 4-bit parameters. This way reduces the size of model 4 or 8 times in compare with the standard model. Also, quantization model can have worse performance with the model, it is about 12% decreasing in performance for 75% reduce in memory usage.

LLM locally 100% inference on CPU - with GPT4All, Huggingface and Gradio.

In this guide, I'll explain the process of implementing LLMs on your personal computer. Initially, ensure that your machine is installed with both GPT4All and Gradio.


!pip install gpt4all

!pip install gradio

!pip install huggingface\_hub[cli,torch]

Additional details: GPT4All facilitates the execution of models on CPU, whereas Hugging Face retains open-source models available for download. Gradio serves as the fundamental user interface for utilizing LLMs.

After that, copy this code to file main.py:


import os
from threading import Thread
from typing import Iterator

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

DESCRIPTION = """\
# Llama-2 7B Chat
This Space demonstrates model [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) by Meta, a Llama 2 model with 7B parameters fine-tuned for chat instructions. Feel free to play with it, or duplicate to run generations without a queue! If you want to run your own service, you can also [deploy the model on Inference Endpoints](https://huggingface.co/inference-endpoints).
🔎 For more details about the Llama 2 family of models and how to use them with `transformers`, take a look [at our blog post](https://huggingface.co/blog/llama2).
🔨 Looking for an even more powerful model? Check out the [13B version](https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat) or the large [70B model demo](https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI).
"""

LICENSE = """
<p/>
---
As a derivate work of [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) by Meta,
this demo is governed by the original [license](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/main/LICENSE.txt) and [acceptable use policy](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/main/USE_POLICY.md).
"""

if not torch.cuda.is_available():
    DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"


if torch.cuda.is_available():
    model_id = "meta-llama/Llama-2-7b-chat-hf"
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto",trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.use_default_system_prompt = False


def generate(
    message: str,
    chat_history: list[tuple[str, str]],
    system_prompt: str,
    max_new_tokens: int = 1024,
    temperature: float = 0.6,
    top_p: float = 0.9,
    top_k: int = 50,
    repetition_penalty: float = 1.2,
) -> Iterator[str]:
    conversation = []
    if system_prompt:
        conversation.append({"role": "system", "content": system_prompt})
    for user, assistant in chat_history:
        conversation.extend([{"role": "user", "content": user}, {"role": "assistant", "content": assistant}])
    conversation.append({"role": "user", "content": message})

    input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt")
    if input_ids.shape[1] > MAX_INPUT_TOKEN_LENGTH:
        input_ids = input_ids[:, -MAX_INPUT_TOKEN_LENGTH:]
        gr.Warning(f"Trimmed input from conversation as it was longer than {MAX_INPUT_TOKEN_LENGTH} tokens.")
    input_ids = input_ids.to(model.device)

    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        {"input_ids": input_ids},
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=top_p,
        top_k=top_k,
        temperature=temperature,
        num_beams=1,
        repetition_penalty=repetition_penalty,
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    outputs = []
    for text in streamer:
        outputs.append(text)
        yield "".join(outputs)


chat_interface = gr.ChatInterface(
    fn=generate,
    additional_inputs=[
        gr.Textbox(label="System prompt", lines=6),
        gr.Slider(
            label="Max new tokens",
            minimum=1,
            maximum=MAX_MAX_NEW_TOKENS,
            step=1,
            value=DEFAULT_MAX_NEW_TOKENS,
        ),
        gr.Slider(
            label="Temperature",
            minimum=0.1,
            maximum=4.0,
            step=0.1,
            value=0.6,
        ),
        gr.Slider(
            label="Top-p (nucleus sampling)",
            minimum=0.05,
            maximum=1.0,
            step=0.05,
            value=0.9,
        ),
        gr.Slider(
            label="Top-k",
            minimum=1,
            maximum=1000,
            step=1,
            value=50,
        ),
        gr.Slider(
            label="Repetition penalty",
            minimum=1.0,
            maximum=2.0,
            step=0.05,
            value=1.2,
        ),
    ],
    stop_btn=None,
    examples=[
        ["Hello there! How are you doing?"],
        ["Can you explain briefly to me what is the Python programming language?"],
        ["Explain the plot of Cinderella in a sentence."],
        ["How many hours does it take a man to eat a Helicopter?"],
        ["Write a 100-word article on 'Benefits of Open-Source in AI research'"],
    ],
)

with gr.Blocks(css="style.css") as demo:
    gr.Markdown(DESCRIPTION)
    gr.DuplicateButton(value="Duplicate Space for private use", elem_id="duplicate-button")
    chat_interface.render()
    gr.Markdown(LICENSE)

if __name__ == "__main__":
    demo.queue(max_size=20).launch(server_name='0.0.0.0')

Run the below command for starting the server, the model will be available on: http://0.0.0.0:7860/

python main.py

III. Conclusion

By deploying Llama, you've equipped yourself with an engine of professional innovation. Leverage its diverse capabilities to create top-notch solutions, augment your expertise, and fuel a culture of continuous learning.