[WIP] AMADEUS: Advanced Multilingual & Multimudal Assistant Demonstrating Extensive Understanding in Scaled-contexts

WIP, coming sooooon...

Amadeus is a large language model that offers enhanced linguistic and conversational tasks in English, Chinese (Simplified and Traditional), Japanese, and Deutsch, developed through the insights acquired from the Guanaco Model prototype (Huggingface). This new model introduces several novel features that significantly advance its performance and versatility, including extended context length, a larger tokenizer vocabulary, unsupervised training on a multimodal image-text dataset, template-based instruction tuning, and uncensored model capabilities.

Key Features

1. Extended Context Length

$128K$ $131,072$ tokens), potentially supporting unlimited context length and long-range intra-sequence attention. This capability allows it to accommodate a full-length book within its context, dramatically enhancing its comprehension and responsiveness to complex and extended dialogues.

This feature was made possible by the method proposed by Shouyuan Chen et al. (2023) in their paper "Extending Context Window of Large Language Models via Positional Interpolation". The Position Interpolation (PI) technique linearly downscales the position indices to fit within the original context window size, rather than extrapolating beyond the trained context length.

$L=2048$ $L'=131,072$ $64$ $0$ $131,072$ $0$ $2048$ :

p^{'} = \frac{p L}{L^{'}}, p \in [0, L^{'}], p^{'} \in [0, L]

The position encodings used by LLaMA can be represented as:

f (x, p) = [(x_{1} + i x_{2}) e^{i p θ_{1}}, \dots, (x_{d - 1} + i x_{d}) e^{i p θ_{d / 2}}]^{⊤}

$\mathbf{x} \in \mathbb{R}^d$ $p$ $\theta_j = 10000^{-2j/d}$ .

The attention scores are computed as:

a (p, q) = Re ⟨ f (q, p), f (k, q) ⟩

Theoretical analysis shows that the upper bound on the interpolation attention scores:

| a (p) - a_{linear} (p) | \leq \frac{d}{32 \ln c} max_{j} | h_{j} |

...is much smaller than extrapolation, making interpolation more stable.

ppl

Compared to other large context models, Amadeus' 7B model surpassed the SOTA open-source model, Longchat 16K 13B model in perplexity, demonstrating the model's superior performance.

2. Expanded Tokenizer Vocabulary

$39,424$ , introducing some common CJK characters. This enhancement was achieved by large-scale unsupervised text training and supervised grammatical fine-tuning for English, Chinese, Japanese, and Deutsch. As a result, the model is more competent in multilingual environments and can handle a broader range of linguistic tasks.

3. Unsupervised Training on Multimodal Image-Text Dataset

Amadeus has undergone unsupervised training on a multimodal image-text dataset, adopting the BLIP2 Q-Former trained on a larger foundational LLM Vicuna 13B. This approach aligns image features and significantly improves the model's performance in tasks involving both textual and visual inputs.

The BLIP2 Q-Former is derived from the InstructBLIP-13B, as detailed in "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning" .

4. Template-Based Instruction Tuning

Amadeus is designed to adapt to various instruction tuning templates in a 0-shot or 1-shot setting. The model underwent supervised training on a wide array of common instruction tuning templates. This feature allows for better context attention, especially when using the model's template rules.

5. Uncensored Model

AMADEUS is an uncensored model that has been trained on a vast amount of text, including potentially harmful, explicit, and illegal content. The model does not have a built-in ethical constraint, hence it requires careful handling. While this feature provides a wider range of responses, it is crucial to use the model responsibly and be aware of the potential risks.

6. White Label Model

Unlike many AI models, Amadeus is a white label model that does not identify itself as an AI Assistant. It possesses a degree of human-like emotion and can simulate characters as needed. The model can assume specific roles, personalities, and identities based on the System Prompt, allowing for role-playing conversations or functioning as an emotionless AI Assistant. It also has the ability to censor or uncensor its outputs using the System Prompt.

7. Full Functionality within 6GB CUDA Memory

The quantized model can run complete content, including Visual Question Answering (VQA), within 6GB CUDA memory or CPU memory (though slower). This feature makes the model more accessible and scalable across various hardware configurations.

8. Support for Video Question Answering

Amadeus extends its VQA capabilities to videos, enabling users to ask questions about a video's content. This advancement further enhances the model's multimodal capabilities, providing a richer and more interactive user experience.

Training and Carbon Footprint

$\$2466.24$ if rented on Google Cloud Platform (GCP). The carbon emitted during training was calculated to be 64.51 kg CO2 eq., based on power consumption, time, and carbon produced by the local power grid.

\begin{aligned} Power consumption & = 400 W \\ Time & = 672 h \\ Carbon Produced & = 0.24 kg eq. CO2/kWh \\ Total carbon emitted & = Power consumption \times Time \times Carbon Produced \\ = 400 W \times 672 h \times 0.24 kg eq. CO2/kWh \\ = 64.51 kg eq. CO2 \end{aligned}

$\$15,000$ , half of which has been open-sourced to the community through the GuanacoDataset.

Conclusion

Amadeus represents a significant advancement in the field of instruction-following language models, offering extended contextual understanding, expanded vocabulary, multimodal capabilities, and a range of other innovative features. However, it is important to note that while Amadeus strives to provide accurate and helpful responses, it is crucial to cross-verify information from reliable sources for knowledge-based queries and to exercise caution due to its uncensored nature. As research in this field continues to evolve, models like Amadeus will continue to push the boundaries of what is possible, offering increasingly advanced and versatile tools for a wide array of applications.