Krira Chunker
BetaHigh-Performance Rust Chunking Engine for RAG Pipelines
Process gigabytes of text in seconds. 40x faster than LangChain with O(1) memory usage.
Installation
bash
1pip install krira-augment
Quick Usage
python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig, SplitStrategy23config = PipelineConfig(4 chunk_size=512,5 strategy=SplitStrategy.SMART,6 clean_html=True,7 clean_unicode=True,8)910pipeline = Pipeline(config=config)1112result = pipeline.process("sample.csv", output_path="output.jsonl")1314print(result)15print(f"Chunks Created: {result.chunks_created}")16print(f"Execution Time: {result.execution_time:.2f}s")17print(f"Throughput: {result.mb_per_second:.2f} MB/s")18print(f"Preview: {result.preview_chunks[:3]}")
Performance Benchmark
Processing 42.4 million chunks in 105 seconds (51.16 MB/s).
text
1============================================================2✅ KRIRA AUGMENT - Processing Complete3============================================================4📊 Chunks Created: 42,448,7655⏱️ Execution Time: 113.79 seconds6🚀 Throughput: 47.51 MB/s7📁 Output File: output.jsonl8============================================================910📝 Preview (Top 3 Chunks):11------------------------------------------------------------12[1] event_time,event_type,product_id,category_id,...13[2] 2019-10-01 00:00:00 UTC,view,44600062,...14[3] 2019-10-01 00:00:00 UTC,view,3900821,...
Complete Example: Local (ChromaDB) - FREE
No API keys required. Runs entirely on your machine.
bash
1pip install sentence-transformers chromadb
python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig2from sentence_transformers import SentenceTransformer3import chromadb4import json56# Step 1: Chunk the file (Rust Core)7config = PipelineConfig(chunk_size=512, chunk_overlap=50)8pipeline = Pipeline(config=config)9result = pipeline.process("sample.csv", output_path="chunks.jsonl")1011# Step 2: Embed and store (Local)12print("Loading model...")13model = SentenceTransformer('all-MiniLM-L6-v2')14client = chromadb.Client()15collection = client.get_or_create_collection("my_rag_db")1617with open("chunks.jsonl", "r") as f:18 for line_num, line in enumerate(f, 1):19 chunk = json.loads(line)20 embedding = model.encode(chunk["text"])2122 collection.add(23 ids=[f"chunk_{line_num}"],24 embeddings=[embedding.tolist()],25 metadatas=[chunk.get("metadata")] if chunk.get("metadata") else None,26 documents=[chunk["text"]]27 )
Cloud Integrations
Swap Local step with these integrations if you have API keys.
python
1from openai import OpenAI2from pinecone import Pinecone34client = OpenAI(api_key="sk-...")5pc = Pinecone(api_key="pcone-...")6index = pc.Index("my-rag")78with open("chunks.jsonl", "r") as f:9 for line_num, line in enumerate(f, 1):10 chunk = json.loads(line)1112 response = client.embeddings.create(13 input=chunk["text"],14 model="text-embedding-3-small"15 )16 embedding = response.data[0].embedding1718 index.upsert(vectors=[(19 f"chunk_{line_num}",20 embedding,21 chunk.get("metadata", {})22 )])
Streaming Mode (No Files)
Process chunks without saving to disk - maximum efficiency for real-time pipelines.
python
1from krira_augment.krira_chunker import Pipeline, PipelineConfig23# Configure pipeline4config = PipelineConfig(chunk_size=512, chunk_overlap=50)5pipeline = Pipeline(config=config)67# Stream and embed (no file created)8for chunk in pipeline.process_stream("data.csv"):9 # process chunk directly10 embedding = model.encode(chunk["text"])11 # store immediately...
When to Use Streaming vs File-Based
Use Streaming When
Maximum speed (no disk writes)
Real-time pipelines
Limited disk space
Use File-Based When
Inspect/debug chunks
Re-process data
Sharing chunks
Supported Formats
| Format | Extension | Method |
|---|---|---|
| CSV | .csv | Direct processing |
| Text | .txt | Direct processing |
| JSONL | .jsonl | Direct processing |
| JSON | .json | Auto-flattening |
.pdf | pdfplumber extraction | |
| Word | .docx | python-docx extraction |
| Excel | .xlsx | openpyxl extraction |
| XML | .xml | ElementTree parsing |
| URLs | http:// | BeautifulSoup scraping |
Provider Comparison
| Provider | Cost | Streaming |
|---|---|---|
| OpenAI + Pinecone | Paid | ✓ |
| OpenAI + Qdrant | Paid | ✓ |
| SentenceTransformers + ChromaDB | FREE | ✓ |
| Hugging Face + FAISS | FREE | ✓ |
Development
Open Source on GitHub
github.com/Krira-Labs/krira-chunker
Local Development
bash
1# Clone the repository2git clone https://github.com/Krira-Labs/krira-chunker.git3cd krira-chunker45# Install Maturin (Rust-Python build tool)6pip install maturin78# Build and install locally9maturin develop