About

About ILM

What is ILM?

The Insight Language Machine is a diachronic lexical knowledge graph for Malayalam — the language of Kerala, spoken by 38 million people.

Built on 136,629 lemmas and 122,230 senses spanning classical Malayalam to contemporary usage, ILM covers 17 languages and is designed for AI training, machine translation, and computational linguistics.

Architecture

ILM is structured as a concept-centric knowledge graph comparable to WordNet and ConceptNet:

  • Lemma layer — orthographic forms, POS, frequency, script variants
  • Sense layer — definitions (ML + EN), domain, register, examples
  • Gloss layer — multilingual translations across 17 languages
  • Etymology layer — Gundert 1872, DED Dravidian roots, loanword tracking
  • Corpus layer — frequency data from 13 Malayalam text corpora

Sources

Malayalam Wiktionary dump
Gundert Dictionary (1872)
Malayalam Lexicon
Shabdatharavali
Aithihyamala citations
DED Dravidian Etymology
Kittel (Kannada)
Monier-Williams (Sanskrit)
13 Malayalam PDF corpora
Kerala Sahitya Charithram

Team

S

Sumesh

Director, Insight Publica — Project Lead

L

Likhitha

QC Lead

A

Aakash

Data Pipeline

I

Ismail

Data Pipeline

N

Nihal

Data Pipeline

License & Citation

CC BY-SA 4.0

@dataset{

ilm2026,

title = {'Insight Language Machine: A Diachronic Malayalam Lexical Knowledge Graph'},

author = {'Insight Publica'},

year = {'2026'},

url = {'https://huggingface.co/datasets/insightpublica/ilm'},

}