Krira Chunker

Beta

High-Performance Rust Chunking Engine for RAG Pipelines

Process gigabytes of text in seconds. 40x faster than LangChain with O(1) memory usage.

Installation

bash

1pip install krira-augment

Quick Usage

python

1from krira_augment.krira_chunker import Pipeline, PipelineConfig, SplitStrategy
2
3config = PipelineConfig(
4    chunk_size=512,
5    strategy=SplitStrategy.SMART,
6    clean_html=True,
7    clean_unicode=True,
8)
9
10pipeline = Pipeline(config=config)
11
12result = pipeline.process("sample.csv", output_path="output.jsonl")
13
14print(result)
15print(f"Chunks Created: {result.chunks_created}")
16print(f"Execution Time: {result.execution_time:.2f}s")
17print(f"Throughput: {result.mb_per_second:.2f} MB/s")
18print(f"Preview: {result.preview_chunks[:3]}")

Performance Benchmark

Processing 42.4 million chunks in 105 seconds (51.16 MB/s).

text

1============================================================
2✅ KRIRA AUGMENT - Processing Complete
3============================================================
4📊 Chunks Created:  42,448,765
5⏱️  Execution Time:  113.79 seconds
6🚀 Throughput:      47.51 MB/s
7📁 Output File:     output.jsonl
8============================================================
9
10📝 Preview (Top 3 Chunks):
11------------------------------------------------------------
12[1] event_time,event_type,product_id,category_id,...
13[2] 2019-10-01 00:00:00 UTC,view,44600062,...
14[3] 2019-10-01 00:00:00 UTC,view,3900821,...

Complete Example: Local (ChromaDB) - FREE

No API keys required. Runs entirely on your machine.

bash

1pip install sentence-transformers chromadb

python

1from krira_augment.krira_chunker import Pipeline, PipelineConfig
2from sentence_transformers import SentenceTransformer
3import chromadb
4import json
5
6# Step 1: Chunk the file (Rust Core)
7config = PipelineConfig(chunk_size=512, chunk_overlap=50)
8pipeline = Pipeline(config=config)
9result = pipeline.process("sample.csv", output_path="chunks.jsonl")
10
11# Step 2: Embed and store (Local)
12print("Loading model...")
13model = SentenceTransformer('all-MiniLM-L6-v2')
14client = chromadb.Client()
15collection = client.get_or_create_collection("my_rag_db")
16
17with open("chunks.jsonl", "r") as f:
18    for line_num, line in enumerate(f, 1):
19        chunk = json.loads(line)
20        embedding = model.encode(chunk["text"])
21        
22        collection.add(
23            ids=[f"chunk_{line_num}"],
24            embeddings=[embedding.tolist()],
25            metadatas=[chunk.get("metadata")] if chunk.get("metadata") else None,
26            documents=[chunk["text"]]
27        )

Cloud Integrations

Swap Local step with these integrations if you have API keys.

python

1from openai import OpenAI
2from pinecone import Pinecone
3
4client = OpenAI(api_key="sk-...")
5pc = Pinecone(api_key="pcone-...")
6index = pc.Index("my-rag")
7
8with open("chunks.jsonl", "r") as f:
9    for line_num, line in enumerate(f, 1):
10        chunk = json.loads(line)
11        
12        response = client.embeddings.create(
13            input=chunk["text"],
14            model="text-embedding-3-small"
15        )
16        embedding = response.data[0].embedding
17        
18        index.upsert(vectors=[(
19            f"chunk_{line_num}",
20            embedding,
21            chunk.get("metadata", {})
22        )])

Streaming Mode (No Files)

Process chunks without saving to disk - maximum efficiency for real-time pipelines.

python

1from krira_augment.krira_chunker import Pipeline, PipelineConfig
2
3# Configure pipeline
4config = PipelineConfig(chunk_size=512, chunk_overlap=50)
5pipeline = Pipeline(config=config)
6
7# Stream and embed (no file created)
8for chunk in pipeline.process_stream("data.csv"):
9    # process chunk directly
10    embedding = model.encode(chunk["text"])
11    # store immediately...

When to Use Streaming vs File-Based

Use Streaming When

Maximum speed (no disk writes)

Real-time pipelines

Limited disk space

Use File-Based When

Inspect/debug chunks

Re-process data

Sharing chunks

Supported Formats

Format	Extension	Method
CSV	`.csv`	Direct processing
Text	`.txt`	Direct processing
JSONL	`.jsonl`	Direct processing
JSON	`.json`	Auto-flattening
PDF	`.pdf`	pdfplumber extraction
Word	`.docx`	python-docx extraction
Excel	`.xlsx`	openpyxl extraction
XML	`.xml`	ElementTree parsing
URLs	`http://`	BeautifulSoup scraping

Provider Comparison

Provider	Cost	Streaming
OpenAI + Pinecone	Paid	✓
OpenAI + Qdrant	Paid	✓
SentenceTransformers + ChromaDB	FREE	✓
Hugging Face + FAISS	FREE	✓

Development

Open Source on GitHub

github.com/Krira-Labs/krira-chunker

Local Development

bash

1# Clone the repository
2git clone https://github.com/Krira-Labs/krira-chunker.git
3cd krira-chunker
4
5# Install Maturin (Rust-Python build tool)
6pip install maturin
7
8# Build and install locally
9maturin develop