KAIRA 5B | Documentation

                    Abstract: This document details the technical architecture and "Curriculum
                    Learning" syllabus of the KAIRA (Knowledge Augmented Intelligent Rational Agent) project. It targets
                    training a 5 Billion (5B) parameter Gemma-2 based model from scratch using a 150 Billion token
                    dataset and a 104,000-entry structured morphological dictionary. The goal is to compete with 100B+
                    models in conversation capabilities via Knowledge Distillation and Dictionary Injection.
                

                    Özet: Bu belge, KAIRA projesinin teknik mimarisini ve eğitim müfredatını
                    (curriculum learning) detaylandırmaktadır. Proje, 5 Milyar (5B) parametreli Gemma-2 tabanlı bir
                    modelin, 150 Milyar tokenlik özel veri seti ve 104.000 maddelik yapılandırılmış morfolojik sözlük
                    kullanılarak sıfırdan eğitilmesini hedefler. Temel amaç, damıtma ve sözlük enjeksiyonu ile 100B
                    modellerle rekabet etmektir.
                

1. Vision and Approach

KAIRA is not just a standard LLM; it is an AI that mathematically internalizes the morphological structure and cultural depth (idioms, sarcasm, emotion) of the Turkish language.

Key Strategies

Efficiency-First Architecture: A 5B parameter structure capable of running on mobile devices but with the efficiency of a 3200 ELO chess engine.
Knowledge Distillation: Intelligence transfer via "Logits" from 9B+ teacher models.
Dictionary Injection: Integrating a 104k-line custom labeled dictionary into the tokenizer and embedding layers.

1. Vizyon ve Yaklaşım

KAIRA, standart bir dil modelinden öte, Türkçe'nin morfolojik yapısını ve kültürel derinliğini matematiksel olarak içselleştirmiş bir yapay zekadır.

Temel Stratejiler

Verimlilik Odaklı Mimari: Mobilde çalışabilen, yüksek verimli 5B parametre yapısı.
Knowledge Distillation (Damıtma): Büyük öğretmen modellerden bilgi transferi.
Sözlük Enjeksiyonu: 104.000 satırlık özel sözlüğün entegrasyonu.

2. Phase 0: Tokenizer Surgery

Before training, the model's vocabulary is modified based on the project's greatest asset: the Custom Dictionary.

Operation: Multi-word idioms in the 104k dictionary (e.g., "Bambu ağacından takım kim ben kim") are added as single tokens.
Goal: Ensuring the model perceives idioms as a whole concept rather than fragmented words.
Smart Initialization: Embedding weights for new tokens are initialized by averaging their constituent words rather than randomly.

2. Faz 0: Tokenizer Ameliyatı

Eğitim öncesi kelime haznesi (Vocabulary), özel sözlüğe göre modifiye edilir.

İşlem: 104k sözlükteki deyimler tekil token olarak eklenir.
Amaç: Deyimleri parçalamadan bütüncül (kavram) olarak algılatmak.
Akıllı Başlatma: Yeni token ağırlıkları, kelimelerin ortalaması alınarak başlatılır.

Dictionary Data Structure

The core of KAIRA's intelligence lies in its structured dictionary data. Here is a raw sample entry representing the depth of a single concept.

Sözlük Veri Yapısı

KAIRA'nın zekasının temeli, yapılandırılmış sözlük verisinde yatmaktadır. İşte tek bir kavramın derinliğini temsil eden ham bir veri örneği.

                
                    <entry>
                    <WORD>Actions speak louder than words</WORD>
                    <DEFINITION>What people do is more important/revealing than what they say.</DEFINITION>
                    <EXAMPLE>He kept promising to help, but actions speak louder than words.</EXAMPLE>
                    <SYNONYM>deeds, execution</SYNONYM>
                    <ANTONYM>empty promises, lip service</ANTONYM>
                    <CONCEPT>integrity and reality</CONCEPT>
                    <EQUIVALENT>Ayinesi iştir kişinin lafa bakılmaz</EQUIVALENT>
                    <SENTIMENT>trustworthy</SENTIMENT>
                    <LICENSE>CC BY-NC - Umut Kökgöz</LICENSE>
                    <MORPHOLOGY>
                    <PART><TOKEN>Action:Noun</TOKEN><ANALYSIS>action:Noun+Plural</ANALYSIS></PART>
                    <PART><TOKEN>speak:Verb</TOKEN><ANALYSIS>speak:Verb+Pres</ANALYSIS></PART>
                    ... (Morphological Analysis Continues) ...
                    </MORPHOLOGY>
                    </entry>
                
                
                    <kelime>
                    <KELIME>Allah bir dediğinden başka sözüne inanılmaz</KELIME>
                    <ANLAM>İnancın veya bir şeyin mutlak doğruluğuna inanmak anlamına gelir.</ANLAM>
                    <ORNEK>Allah bir dediğinden başka sözüne inanılmaz, o halde onu takip etmeliyiz.</ORNEK>
                    <BENZER>inandırmak, mutlak olmak</BENZER>
                    <ZIT>şüphe etmek, kuşkulanmak</ZIT>
                    <KAVRAM>inandırıcılık ve mutlaklık</KAVRAM>
                    <ESDEGER>to have absolute faith in, to believe unconditionally</ESDEGER>
                    <DUYGU>inandırıcı</DUYGU>
                    <LISANS>CC BY-NC - Umut Kökgöz</LISANS>
                    <MORFOLOJI>
                    <PARCA><TOKEN>Allah:Noun,Prop</TOKEN><ANALIZ>allah:Noun+A3sg</ANALIZ></PARCA>
                    <PARCA><TOKEN>bir:Adj</TOKEN><ANALIZ>bir:Adj</ANALIZ></PARCA>
                    <PARCA><TOKEN>bir:Num,Card</TOKEN><ANALIZ>bir:Num</ANALIZ></PARCA>
                    ... (Morphological Analysis Continues) ...
                    </MORFOLOJI>
                    </kelime>
                
            

3. Training Curriculum

To maximize learning capacity, the 150 Billion token dataset is presented in a 4-stage strategy.

Phase 1: Apprenticeship

Volume: 60 Billion Tokens (First 40%)
Source: YouTube Transcripts, Clean Web Data.
Method: Hard Labels (Classic Training). No teacher model.
Goal: Learning grammar, sentence structure, and fluency.

Phase 2: Journeyman (Transfer)

Volume: 60 Billion Tokens (Middle 40%)
Source: Books, Articles, Encyclopedic Knowledge.
Method: Knowledge Distillation (KL-Divergence). Teacher active.
Goal: Logic reasoning, cause-and-effect, deep understanding.

Phase 3: Mastery & Culture

Volume: 25 Billion Tokens (Last 15-20%)
Source: 104k Custom Dictionary (Weighted) + Quality Dialogues.
Goal: Emotional intelligence, idiom usage, "One of Us" speaking style. Eliminating hallucinations.

3. Eğitim Müfredatı

150 Milyar tokenlik veri seti, öğrenme kapasitesini artırmak için 4 aşamada sunulur.

Faz 1: Çıraklık (İnşaat)

Hacim: 60 Milyar Token (İlk %40)
Kaynak: YouTube, Temiz Web Verisi.
Yöntem: Klasik Eğitim.
Amaç: Dilbilgisi ve akıcılık.

Faz 2: Kalfalık (Transfer)

Hacim: 60 Milyar Token (Orta %40)
Kaynak: Kitaplar, Makaleler.
Yöntem: Knowledge Distillation.
Amaç: Mantık ve derin anlama.

Faz 3: Ustalık ve Kültür

Hacim: 25 Milyar Token (Son %20)
Kaynak: 104k Özel Sözlük + Diyaloglar.
Amaç: Duygusal zeka, "Bizden Biri" gibi konuşma.

Figure 2: Surgical Data Cleaning Pipeline

Figure 3: Massive Data Collection Strategy

4. Advanced Tuning: The "Reflection" Protocol

Standard LLMs often hallucinate because they answer immediately. KAIRA is being fine-tuned with a specialized "System 2" dataset that forces the model to think, criticize itself, and correct before outputting a final answer.

We are generating 20,000+ high-quality synthetic samples where the Teacher model demonstrates Self-Correction in both Turkish and English contexts.

4. İleri Seviye Ayar: "Refleksiyon" Protokolü

Standart modeller hemen cevap verdikleri için sıkça halüsinasyon görürler. KAIRA, modele cevap vermeden önce düşünmeyi, kendini eleştirmeyi ve düzeltmeyi öğreten özel bir "Sistem 2" veri seti ile eğitilmektedir.

Öğretmen modelin hem Türkçe hem İngilizce bağlamlarda Öz-Düzeltme (Self-Correction) yeteneğini sergilediği 20.000+ adet yüksek kaliteli sentetik veri üretiyoruz.

Sample Training Data Structure (JSONL) Örnek Eğitim Verisi Yapısı (JSONL)

                
                    {
                    "instruction": "There are 3 cats in a room. 2 leave, 1 enters. How many living beings remain in the
                    room?",
                    "output":
                    "<thought>
                    Mathematically: Start 3. Left 2 (3-2=1). Entered 1 (1+1=2). Result should be 2 cats.
                    </thought>

                    <reflection>
                    Wait a minute! The question doesn't ask 'how many cats', it asks 'how many living beings'. Am I (the
                    observer) in the room?
                    Not specified. Cats are living. Did the ones who left leave the room entirely or are they at the
                    door?
                    Logic implies inside the room. I should count only the cats.
                    </reflection>

                    <correction>
                    The number of cats in the room is 2. Since no other living being is specified, the answer is 2.
                    </correction>

                    <answer>
                    Based on the information provided, there are 2 living beings (cats) in the room.
                    </answer>"
                    }
                
                
                    {
                    "instruction": "Bir odada 3 kedi var, 2'si çıktı, 1'i geri girdi. Odada kaç canlı kaldı?",
                    "output":
                    "<dusunce>
                    Matematiksel olarak bakalım: Başlangıç 3. Çıkan 2 (3-2=1). Giren 1 (1+1=2). Sonuç 2 kedi olmalı.
                    </dusunce>

                    <refleksiyon>
                    Dur bir dakika! Soru 'kaç kedi' demiyor, 'kaç canlı' diyor. Odada ben (gözlemci) var mıyım?
                    Soruda belirtilmemiş. Kediler canlıdır. Peki çıkanlar odayı terk etti mi yoksa kapıda mı duruyor?
                    Genel mantık kurallarına göre odanın içi soruluyor. Sadece kedileri saymalıyım.
                    </refleksiyon>

                    <duzeltme>
                    Odadaki kedi sayısı 2'dir. Başka bir canlı belirtilmediği için cevap 2 olmalıdır.
                    </duzeltme>

                    <cevap>
                    Verilen bilgilere göre odada 2 canlı (kedi) vardır.
                    </cevap>"
                    }
                
            

5. Technical Configuration (A100)

5. Teknik Konfigürasyon

Parameter	Value
Architecture	Gemma-2 (Decoder-Only)
Parameter Count	~5 Billion (5B)
Context Window	4096 Token
Hidden Size	4096
Layers	32
Attention Heads	32
Tokenizer	SentencePiece + Custom Tokens
Training Precision	BF16 (Bfloat16)
Optimizer	AdamW (Fused)
Load	~3 ZettaFLOPs

KAIRA 5B Strategy

KAIRA 5B Stratejisi