Title: | A Text Mining Toolkit for Chinese Document |
---|---|
Description: | The CTM package is designed to solve problems of text mining and is specific for Chinese document. |
Authors: | Jim Liu, Quan Gu |
Maintainer: | Jim Liu <[email protected]> |
License: | GPL-3 |
Version: | 0.2 |
Built: | 2025-02-25 05:00:55 UTC |
Source: | https://github.com/cran/CTM |
Constructs Document-Term Matrix from Chinese Text Documents.
CDTM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
CDTM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
doc |
The Chinese text document. A vector of Chinese strings. |
weighting |
Available weighting function with matrix are binary, count, tf, tfidf. See details. |
EngTermDeleted |
remove English from text documents. |
NumTermDeleted |
remove Numbers from text documents. |
shortTermDeleted |
Deltected short word when nchar <2. |
This function run a Chinese word segmentation by jiebeR and build document-term matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.
Jim Liu, Quan Gu
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) dtm1 <- CTDM(doc = text1, weighting = "tfidf",EngTermDeleted = FALSE, shortTermDeleted = FALSE)
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) dtm1 <- CTDM(doc = text1, weighting = "tfidf",EngTermDeleted = FALSE, shortTermDeleted = FALSE)
Constructs Term-Document Matrix from Chinese Text Documents.
CTDM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
CTDM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
doc |
The Chinese text document. A vector of Chinese strings. |
weighting |
Available weighting function with matrix are binary, count, tf, tfidf. See details. |
EngTermDeleted |
remove English from text documents. |
NumTermDeleted |
remove Numbers from text documents. |
shortTermDeleted |
Deltected short word when nchar <2. |
This function run a Chinese word segmentation by jiebeR and build term-document matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.
Jim Liu, Quan Gu
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) tdm1 <- CTDM(doc = text1, weighting = "tfidf", EngTermDeleted = FALSE, shortTermDeleted = FALSE)
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) tdm1 <- CTDM(doc = text1, weighting = "tfidf", EngTermDeleted = FALSE, shortTermDeleted = FALSE)
Computing term count from text documents
termCount(doc, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
termCount(doc, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)
doc |
The Chinese text document. |
EngTermDeleted |
remove English from text documents. |
NumTermDeleted |
remove Numbers from text documents. |
shortTermDeleted |
Deltected short word when nchar <2. |
This function run a Chinese word segmentation by jiebeR and compute term count from all these text document.
Jim Liu
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) count1 <- termCount(doc = text1, EngTermDeleted=FALSE, shortTermDeleted = FALSE)
library(CTM) a1 <- "hello taiwan" b1 <- "world of tank" c1 <- "taiwan weather" d1 <- "local weather" text1 <- t(data.frame(a1,b1,c1,d1)) count1 <- termCount(doc = text1, EngTermDeleted=FALSE, shortTermDeleted = FALSE)