| Title: | Lemmatized Critical Edition of the Pali Canon |
|---|---|
| Description: | A lemmatized critical edition of the complete Pali Canon (Tipitaka), the canonical scripture of Theravadin Buddhism. Based on a five-witness collation of the Pali Text Society (PTS) edition (via 'GRETIL'), 'SuttaCentral', the Vipassana Research Institute (VRI) Chattha Sangayana edition, the Buddha Jayanti Tipitaka (BJT), and the Thai Royal Edition. All text is lemmatized using the 'Digital Pali Dictionary', grouping inflected forms by dictionary headword. Covers all three pitakas (Sutta, Vinaya, Abhidhamma) with 5,777 individual text units. The companion package 'tipitaka' provides the original VRI edition data and Pali text tools. For background on the collation method, see Zigmond (2026) <https://github.com/dangerzig/tipitaka.critical>. |
| Authors: | Dan Zigmond [aut, cre] |
| Maintainer: | Dan Zigmond <[email protected]> |
| License: | CC0 |
| Version: | 1.0.0 |
| Built: | 2026-05-22 18:51:50 UTC |
| Source: | https://github.com/dangerzig/tipitaka.critical |
Sparse document-term matrix computed from lemmas.
Each row is a text unit, each column is a lemma, and values are
frequencies (proportions). Stored as a dgCMatrix from the
Matrix package. Computed on first access.
dtmdtm
A sparse matrix of class dgCMatrix
with text unit IDs as row names and lemma headwords as column names.
# Sparse document-term matrix dim(dtm) # Hierarchical clustering of text units d <- dist(dtm[1:20, ]) plot(hclust(d))# Sparse document-term matrix dim(dtm) # Hierarchical clustering of text units d <- dist(dtm[1:20, ]) plot(hclust(d))
Lemma frequency table computed from the lemmatized text.
Tokenizes texts$text_lemmatized and counts word
frequencies per text unit on first access (~5 seconds).
lemmaslemmas
A data frame with the variables:
Lemma (dictionary headword)
Count of this lemma in this text unit
Total lemma tokens in this text unit
Frequency (n/total)
Text unit ID
Collection code
Pitaka name
# First access triggers computation (~5 seconds) head(lemmas) # Most frequent lemmas across the entire canon totals <- tapply(lemmas$n, lemmas$word, sum) head(sort(totals, decreasing = TRUE), 20)# First access triggers computation (~5 seconds) head(lemmas) # Most frequent lemmas across the entire canon totals <- tapply(lemmas$n, lemmas$word, sum) head(sort(totals, decreasing = TRUE), 20)
Finds all text units containing a specific lemma, sorted by frequency (most frequent first).
search_lemma(lemma)search_lemma(lemma)
lemma |
Character string of the lemma to search for. |
A data frame of occurrences with columns: word, n, total, freq, id, collection, pitaka. Returns an empty data frame if the lemma is not found.
# Find texts mentioning "nibbana" nibbana <- search_lemma("nibbana") head(nibbana) # Find texts mentioning "dhamma" dhamma <- search_lemma("dhamma") head(dhamma[, c("id", "collection", "n", "freq")])# Find texts mentioning "nibbana" nibbana <- search_lemma("nibbana") head(nibbana) # Find texts mentioning "dhamma" dhamma <- search_lemma("dhamma") head(dhamma[, c("id", "collection", "n", "freq")])
Surface-form and lemmatized text for every text unit in the Tipitaka. This is the only dataset shipped with the package; all other data is computed on demand from this text.
textstexts
A data frame with 5,777 rows and 6 columns:
Text unit ID (e.g., "dn1", "mn1", "sn1.1", "mahavagga")
Collection code (dn, mn, sn, an, kn, vinaya, abhidhamma)
Pitaka name (sutta, vinaya, abhidhamma)
Pali title of the text
Full surface-form Pali text
Same text with each word replaced by its lemma headword
Critical edition based on five-witness collation of PTS/GRETIL, SuttaCentral, VRI (Chattha Sangayana), Buddha Jayanti Tipitaka (BJT), and Thai Royal Edition. Lemmatization via the Digital Pali Dictionary.
# Number of text units per pitaka table(texts$pitaka) # Get text of the Brahmajala Sutta (DN 1) dn1 <- texts[texts$id == "dn1", ] cat(substr(dn1$text, 1, 200), "...\n")# Number of text units per pitaka table(texts$pitaka) # Get text of the Brahmajala Sutta (DN 1) dn1 <- texts[texts$id == "dn1", ] cat(substr(dn1$text, 1, 200), "...\n")
A lemmatized critical edition of the complete Pali Canon (Tipitaka) based on a five-witness collation with the Digital Pali Dictionary.
This package ships the full text data (texts) and
computes derived data on first access:
lemmas: lemma frequency table
dtm: sparse document-term matrix
search_lemma: search for a lemma across all texts
For the original VRI edition and Pali text tools, see the companion package tipitaka.
Maintainer: Dan Zigmond [email protected]
Useful links:
Report bugs at https://github.com/dangerzig/tipitaka.critical/issues