Package 'tipitaka.critical' reference manual

Title:	Lemmatized Critical Edition of the Pali Canon
Description:	A lemmatized critical edition of the complete Pali Canon (Tipitaka), the canonical scripture of Theravadin Buddhism. Based on a five-witness collation of the Pali Text Society (PTS) edition (via 'GRETIL'), 'SuttaCentral', the Vipassana Research Institute (VRI) Chattha Sangayana edition, the Buddha Jayanti Tipitaka (BJT), and the Thai Royal Edition. All text is lemmatized using the 'Digital Pali Dictionary', grouping inflected forms by dictionary headword. Covers all three pitakas (Sutta, Vinaya, Abhidhamma) with 5,777 individual text units. The companion package 'tipitaka' provides the original VRI edition data and Pali text tools. For background on the collation method, see Zigmond (2026) <https://github.com/dangerzig/tipitaka.critical>.
Authors:	Dan Zigmond [aut, cre]
Maintainer:	Dan Zigmond <[email protected]>
License:	CC0
Version:	1.0.0
Built:	2026-06-21 10:30:16 UTC
Source:	https://github.com/dangerzig/tipitaka.critical

Document-Term Matrix (Sparse)

Description

Sparse document-term matrix computed from lemmas. Each row is a text unit, each column is a lemma, and values are frequencies (proportions). Stored as a dgCMatrix from the Matrix package. Computed on first access.

Usage

dtm
dtm

Format

A sparse matrix of class dgCMatrix with text unit IDs as row names and lemma headwords as column names.

Examples


# Sparse document-term matrix
dim(dtm)

# Hierarchical clustering of text units
d <- dist(dtm[1:20, ])
plot(hclust(d))


# Sparse document-term matrix
dim(dtm)

# Hierarchical clustering of text units
d <- dist(dtm[1:20, ])
plot(hclust(d))

Lemma Frequency Table

Description

Lemma frequency table computed from the lemmatized text. Tokenizes texts$text_lemmatized and counts word frequencies per text unit on first access (~5 seconds).

Usage

lemmas
lemmas

Format

A data frame with the variables:

word: Lemma (dictionary headword)
n: Count of this lemma in this text unit
total: Total lemma tokens in this text unit
freq: Frequency (n/total)
id: Text unit ID
collection: Collection code
pitaka: Pitaka name

Examples


# First access triggers computation (~5 seconds)
head(lemmas)

# Most frequent lemmas across the entire canon
totals <- tapply(lemmas$n, lemmas$word, sum)
head(sort(totals, decreasing = TRUE), 20)


# First access triggers computation (~5 seconds)
head(lemmas)

# Most frequent lemmas across the entire canon
totals <- tapply(lemmas$n, lemmas$word, sum)
head(sort(totals, decreasing = TRUE), 20)

Search for Lemma Occurrences

Description

Finds all text units containing a specific lemma, sorted by frequency (most frequent first).

Usage

search_lemma(lemma)
search_lemma(lemma)

Arguments

lemma

Character string of the lemma to search for.

Value

A data frame of occurrences with columns: word, n, total, freq, id, collection, pitaka. Returns an empty data frame if the lemma is not found.

Examples


# Find texts mentioning "nibbana"
nibbana <- search_lemma("nibbana")
head(nibbana)

# Find texts mentioning "dhamma"
dhamma <- search_lemma("dhamma")
head(dhamma[, c("id", "collection", "n", "freq")])


# Find texts mentioning "nibbana"
nibbana <- search_lemma("nibbana")
head(nibbana)

# Find texts mentioning "dhamma"
dhamma <- search_lemma("dhamma")
head(dhamma[, c("id", "collection", "n", "freq")])

Full Text of the Pali Canon (Critical Edition)

Description

Surface-form and lemmatized text for every text unit in the Tipitaka. This is the only dataset shipped with the package; all other data is computed on demand from this text.

Usage

texts
texts

Format

A data frame with 5,777 rows and 6 columns:

id: Text unit ID (e.g., "dn1", "mn1", "sn1.1", "mahavagga")
collection: Collection code (dn, mn, sn, an, kn, vinaya, abhidhamma)
pitaka: Pitaka name (sutta, vinaya, abhidhamma)
title: Pali title of the text
text: Full surface-form Pali text
text_lemmatized: Same text with each word replaced by its lemma headword

Source

Critical edition based on five-witness collation of PTS/GRETIL, SuttaCentral, VRI (Chattha Sangayana), Buddha Jayanti Tipitaka (BJT), and Thai Royal Edition. Lemmatization via the Digital Pali Dictionary.

Examples

# Number of text units per pitaka
table(texts$pitaka)

# Get text of the Brahmajala Sutta (DN 1)
dn1 <- texts[texts$id == "dn1", ]
cat(substr(dn1$text, 1, 200), "...\n")

# Number of text units per pitaka
table(texts$pitaka)

# Get text of the Brahmajala Sutta (DN 1)
dn1 <- texts[texts$id == "dn1", ]
cat(substr(dn1$text, 1, 200), "...\n")

tipitaka.critical: Lemmatized Critical Edition of the Pali Canon

Description

A lemmatized critical edition of the complete Pali Canon (Tipitaka) based on a five-witness collation with the Digital Pali Dictionary.

Details

This package ships the full text data (texts) and computes derived data on first access:

lemmas: lemma frequency table
dtm: sparse document-term matrix
search_lemma: search for a lemma across all texts

For the original VRI edition and Pali text tools, see the companion package tipitaka.

Author(s)

Maintainer: Dan Zigmond [email protected]

Package 'tipitaka.critical'

Help Index

Document-Term Matrix (Sparse)

Description

Usage

Format

Examples

Lemma Frequency Table

Description

Usage

Format

Examples

Search for Lemma Occurrences

Description

Usage

Arguments

Value

Examples

Full Text of the Pali Canon (Critical Edition)

Description

Usage

Format

Source

Examples

tipitaka.critical: Lemmatized Critical Edition of the Pali Canon

Description

Details

Author(s)

See Also