R의 모형 행렬에있는 요인의 모든 수준

developer tip

R의 모형 행렬에있는 요인의 모든 수준

optionbox 2020. 11. 30. 08:04

R의 모형 행렬에있는 요인의 모든 수준

나는이 data.frame아래와 같이 숫자와 요인 변수로 구성.

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

matrix요인에 더미 변수를 할당하고 숫자 변수는 그대로 두는를 구축하고 싶습니다 .

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

예상대로 실행할 때 lm각 요인의 한 수준을 기준 수준으로 남겨 둡니다. 그러나 matrix모든 요인의 모든 수준에 대해 더미 / 지표 변수를 사용하여 를 구축하고 싶습니다 . 이 행렬을 구축하고 glmnet있으므로 다중 공선성에 대해 걱정하지 않습니다.

model.matrix요인의 모든 수준에 대해 더미를 만드는 방법이 있습니까?

contrasts요인 변수에 대해 재설정해야 합니다.

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

또는 타이핑이 적고 적절한 이름이없는 경우 :

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))

(자신을 구하려고 시도 중입니다 ...) @Fabians에 대한 Jared의 의견에 대한 응답으로, 자동화에 대한 답변으로, 여러분이 제공해야 할 것은 명암 행렬의 명명 된 목록입니다. contrasts()벡터 / 인자를 취하고 그것으로부터 대비 행렬을 생성합니다. 이를 위해 데이터 세트의 각 요소에 대해 lapply()실행 하는 데 사용할 수 있습니다 ( 예 : 제공된 예제).contrasts()testFrame

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

@fabians에 잘 맞는 슬롯은 다음과 같습니다.

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

caretdummyVars두 줄로이를 달성하기 위해 멋진 기능 을 구현했습니다 .

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

마지막 열 확인 :

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

여기서 가장 좋은 점은 원본 데이터 프레임과 변환에 사용 된 원본을 제외한 더미 변수를 얻는 것입니다.

더 많은 정보 : http://amunategui.github.io/dummyVar-Walkthrough/

dummyVarsfrom caret도 사용할 수 있습니다. http://caret.r-forge.r-project.org/preprocess.html

확인. 위의 내용을 읽고 모두 합치면됩니다. 선형 예측자를 얻기 위해 계수 벡터를 곱하는 'X.factors'행렬과 같은 행렬을 원한다고 가정합니다. 여전히 몇 가지 추가 단계가 있습니다.

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

(요인 열이 하나만있는 경우 X [*]를 데이터 프레임으로 다시 변환해야합니다.)

그런 다음 다음과 같은 결과를 얻습니다.

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

각 요인의 ** d 참조 수준을 제거하려고합니다.

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

R 패키지 'CatEncoders'사용

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

나는 현재 Lasso 모델 및 glmnet::cv.glmnet(), model.matrix()및 Matrix::sparse.model.matrix()(고차원 행렬의 model.matrix경우 작성자가 제안한대로 will을 사용하여 시간을 죽입니다 glmnet.)를 배우고 있습니다.

Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package library('CatEncoders') as well.

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

Source : R for Everyone: Advanced Analytics and Graphics (page273)

A tidyverse answer:

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

yields the desired result (same as @Gavin Simpson's answer):

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

should be the most straightforward

A stats package answer:

new_tr <- model.matrix(~.+0,data = testFrame)

Adding +0 (or -1) to a model formula (e.g., in lm()) in R suppresses the intercept.

Please see

참고URL : https://stackoverflow.com/questions/4560459/all-levels-of-a-factor-in-a-model-matrix-in-r

'developer tip' 카테고리의 다른 글

하나의 작업에 대한 컨트롤러 AuthorizeAttribute 재정의 (0)	2020.11.30
Android : 공백이있는 URL 문자열을 URI 객체로 구문 분석하는 방법은 무엇입니까? (0)	2020.11.30
node.js에서 마지막으로 수정 된 파일 날짜 (0)	2020.11.30
iOS 창의 루트 뷰 컨트롤러 변경 (0)	2020.11.30
Python : 여러 값으로 사전 목록을 정렬하는 방법은 무엇입니까? (0)	2020.11.30

현재글R의 모형 행렬에있는 요인의 모든 수준

optionbox

R의 모형 행렬에있는 요인의 모든 수준

R의 모형 행렬에있는 요인의 모든 수준

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

R의 모형 행렬에있는 요인의 모든 수준

R의 모형 행렬에있는 요인의 모든 수준

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바