Krira Chunker

Beta

High-Performance Rust Chunking Engine for RAG Pipelines

Process gigabytes of text in seconds. 40x faster than LangChain with O(1) memory usage.

Installation

bash
1pip install krira-augment

Quick Usage

python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig, SplitStrategy
2
3config = PipelineConfig(
4 chunk_size=512,
5 strategy=SplitStrategy.SMART,
6 clean_html=True,
7 clean_unicode=True,
8)
9
10pipeline = Pipeline(config=config)
11
12result = pipeline.process("sample.csv", output_path="output.jsonl")
13
14print(result)
15print(f"Chunks Created: {result.chunks_created}")
16print(f"Execution Time: {result.execution_time:.2f}s")
17print(f"Throughput: {result.mb_per_second:.2f} MB/s")
18print(f"Preview: {result.preview_chunks[:3]}")

Performance Benchmark

Processing 42.4 million chunks in 105 seconds (51.16 MB/s).

text
1============================================================
2✅ KRIRA AUGMENT - Processing Complete
3============================================================
4📊 Chunks Created: 42,448,765
5⏱️ Execution Time: 113.79 seconds
6🚀 Throughput: 47.51 MB/s
7📁 Output File: output.jsonl
8============================================================
9
10📝 Preview (Top 3 Chunks):
11------------------------------------------------------------
12[1] event_time,event_type,product_id,category_id,...
13[2] 2019-10-01 00:00:00 UTC,view,44600062,...
14[3] 2019-10-01 00:00:00 UTC,view,3900821,...

Complete Example: Local (ChromaDB) - FREE

No API keys required. Runs entirely on your machine.

bash
1pip install sentence-transformers chromadb
python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig
2from sentence_transformers import SentenceTransformer
3import chromadb
4import json
5
6# Step 1: Chunk the file (Rust Core)
7config = PipelineConfig(chunk_size=512, chunk_overlap=50)
8pipeline = Pipeline(config=config)
9result = pipeline.process("sample.csv", output_path="chunks.jsonl")
10
11# Step 2: Embed and store (Local)
12print("Loading model...")
13model = SentenceTransformer('all-MiniLM-L6-v2')
14client = chromadb.Client()
15collection = client.get_or_create_collection("my_rag_db")
16
17with open("chunks.jsonl", "r") as f:
18 for line_num, line in enumerate(f, 1):
19 chunk = json.loads(line)
20 embedding = model.encode(chunk["text"])
21
22 collection.add(
23 ids=[f"chunk_{line_num}"],
24 embeddings=[embedding.tolist()],
25 metadatas=[chunk.get("metadata")] if chunk.get("metadata") else None,
26 documents=[chunk["text"]]
27 )

Cloud Integrations

Swap Local step with these integrations if you have API keys.

python
1from openai import OpenAI
2from pinecone import Pinecone
3
4client = OpenAI(api_key="sk-...")
5pc = Pinecone(api_key="pcone-...")
6index = pc.Index("my-rag")
7
8with open("chunks.jsonl", "r") as f:
9 for line_num, line in enumerate(f, 1):
10 chunk = json.loads(line)
11
12 response = client.embeddings.create(
13 input=chunk["text"],
14 model="text-embedding-3-small"
15 )
16 embedding = response.data[0].embedding
17
18 index.upsert(vectors=[(
19 f"chunk_{line_num}",
20 embedding,
21 chunk.get("metadata", {})
22 )])

Streaming Mode (No Files)

Process chunks without saving to disk - maximum efficiency for real-time pipelines.

python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig
2
3# Configure pipeline
4config = PipelineConfig(chunk_size=512, chunk_overlap=50)
5pipeline = Pipeline(config=config)
6
7# Stream and embed (no file created)
8for chunk in pipeline.process_stream("data.csv"):
9 # process chunk directly
10 embedding = model.encode(chunk["text"])
11 # store immediately...

When to Use Streaming vs File-Based

Use Streaming When
Maximum speed (no disk writes)
Real-time pipelines
Limited disk space
Use File-Based When
Inspect/debug chunks
Re-process data
Sharing chunks

Supported Formats

FormatExtensionMethod
CSV.csvDirect processing
Text.txtDirect processing
JSONL.jsonlDirect processing
JSON.jsonAuto-flattening
PDF.pdfpdfplumber extraction
Word.docxpython-docx extraction
Excel.xlsxopenpyxl extraction
XML.xmlElementTree parsing
URLshttp://BeautifulSoup scraping

Provider Comparison

ProviderCostStreaming
OpenAI + PineconePaid
OpenAI + QdrantPaid
SentenceTransformers + ChromaDBFREE
Hugging Face + FAISSFREE

Development

Open Source on GitHub
github.com/Krira-Labs/krira-chunker

Local Development

bash
1# Clone the repository
2git clone https://github.com/Krira-Labs/krira-chunker.git
3cd krira-chunker
4
5# Install Maturin (Rust-Python build tool)
6pip install maturin
7
8# Build and install locally
9maturin develop