고정 너비 텍스트 파일 읽기
이 추악한 형식의 데이터 세트를 내 R 세션에로드하려고합니다. http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
Weekly SST data starts week centered on 3Jan1990
Nino1+2 Nino3 Nino34 Nino4
Week SST SSTA SST SSTA SST SSTA SST SSTA
03JAN1990 23.4-0.4 25.1-0.3 26.6 0.0 28.6 0.3
10JAN1990 23.4-0.8 25.2-0.3 26.6 0.1 28.6 0.3
17JAN1990 24.2-0.3 25.3-0.3 26.5-0.1 28.6 0.3
지금까지 나는 줄을 읽을 수 있습니다.
x = readLines(path)
그러나 파일은 '공백'과 '-'를 구분 기호로 섞어서 정규식 전문가가 아닙니다. 이것을 멋지고 깨끗한 R 데이터 프레임으로 바꾸는 데 도움을 주셔서 감사합니다. 감사!
이것은 고정 너비 파일입니다. 사용 read.fwf()
읽어 :
x <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))
head(x)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 03JAN1990 23.4 -0.4 25.1 -0.3 26.6 0.0 28.6 0.3
2 10JAN1990 23.4 -0.8 25.2 -0.3 26.6 0.1 28.6 0.3
3 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5 31JAN1990 25.1 -0.2 25.8 -0.2 26.7 0.1 28.4 0.2
6 07FEB1990 25.8 0.2 26.1 -0.1 26.8 0.1 28.4 0.3
최신 정보
패키지 readr
(2015 년 4 월 출시)는 간단하고 빠른 대안을 제공합니다.
library(readr)
x <- read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))
속도 비교 : readr::read_fwf()
보다 빠른 ~ 2 배를했다 utils::read.fwf ()
.
너비를 결정하는 또 다른 방법 ...
df <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
skip=4
)
The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...
ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6
First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.
That said, the course is mostly intended for learning.
I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.
I think its easier to use readLines()
and then from that use substr()
to make your variables
x <- readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))
# Skip 4 lines
x <- x[-(1:4)]
mydata <- data.frame(var1 = substr(x, 1, 10),
var2 = substr(x, 16, 19),
var3 = substr(x, 20, 23),
var4 = substr(x, 29, 32) # and so on and so on
)
You can now use the read_fwf()
function in Hadley Wickham's readr
package.
- Annoucement: http://blog.rstudio.org/2015/04/09/readr-0-1-0/
- Development page: https://github.com/hadley/readr
- CRAN page: http://cran.r-project.org/web/packages/readr/index.html
A huge performance improvement is to be expected, compared to base read.fwf()
.
I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.
My preferred approach is to combine fread
with stringi
; it's competitive as the fastest approach, and has the added benefit (IMO) of storing your data as a data.table
:
library(data.table)
library(stringi)
col_ends <-
list(beg = c(1, 10, 15, 19, 23, 28, 32, 36,
41, 45, 49, 54, 58),
end = c(9, 14, 18, 22, 27, 31, 35,
40, 44, 48, 53, 57, 61))
data = fread(
"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
header = FALSE, skip = 4L, sep = NULL
)[, lapply(1:(length(col_ends$beg)),
function(ii)
stri_sub(V1, col_ends$beg[ii], col_ends$end[ii]))
][ , paste0("V", c(2, 5, 8, 11)) := NULL]
# V1 V3 V4 V6 V7 V9 V10 V12 V13
# 1: 03JAN1990 23.4 -0.4 25.1 -0.3 26.6 0.0 28.6 0.3
# 2: 10JAN1990 23.4 -0.8 25.2 -0.3 26.6 0.1 28.6 0.3
# 3: 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
# 4: 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
# 5: 31JAN1990 25.1 -0.2 25.8 -0.2 26.7 0.1 28.4 0.2
# ---
# 1365: 24FEB2016 27.1 0.9 28.4 1.8 29.0 2.1 29.5 1.4
# 1366: 02MAR2016 27.3 1.0 28.6 1.8 28.9 1.9 29.5 1.4
# 1367: 09MAR2016 27.7 1.2 28.6 1.6 28.9 1.8 29.6 1.5
# 1368: 16MAR2016 27.5 1.0 28.8 1.7 28.9 1.7 29.6 1.4
# 1369: 23MAR2016 27.2 0.9 28.6 1.4 28.8 1.5 29.5 1.2
Note that fread
automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case set strip.white = FALSE
.
We could also have started with a vector of column widths ww
by doing:
ww <- c(9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4)
nd <- cumsum(ww)
col_ends <-
list(beg = c(1, nd[-length(nd)]+1L),
end = nd)
And we could have picked which columns to exclude more robustly by using negative indices like:
col_ends <-
list(beg = c(1, -10, 15, 19, -23, 28, 32, -36,
41, 45, -49, 54, 58),
end = c(9, 14, 18, 22, 27, 31, 35,
40, 44, 48, 53, 57, 61))
Then replace col_ends$beg[ii]
with abs(col_ends$beg[ii])
and in the next line:
paste0("V", which(col_ends$beg < 0))
Lastly, if you want the column names to be read programatically as well, you could clean up with readLines
:
cols <-
gsub("\\s", "",
sapply(1:(length(col_ends$beg)),
function(ii)
stri_sub(readLines(URL, n = 4L)[4L],
col_ends$beg[ii]+1L,
col_ends$end[ii]+1L)))
cols <- cols[cols != ""]
(note that combining this step with fread
would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)
I don't know a thing about R, but I can provide you with a regex that will match such lines:
\s[0-9]{2}[A-Z]{3}[0-9]{4}(\s{5}[0-9]+\.[0-9]+[ -][0-9]+\.[0-9]+){4}
참고URL : https://stackoverflow.com/questions/14383710/read-fixed-width-text-file
'developer tip' 카테고리의 다른 글
iOS에서 문자열을 하위 문자열로 분할하는 방법은 무엇입니까? (0) | 2020.09.20 |
---|---|
TCP 옵션 SO_LINGER (영)-필요한 경우 (0) | 2020.09.20 |
키워드 'const'는 값을 변경 불가능하게 만들지 않습니다. (0) | 2020.09.20 |
WPF Textblock, Text 속성의 줄 바꿈 (0) | 2020.09.20 |
날짜 문자열 또는 개체의 NSArray 정렬 (0) | 2020.09.20 |