拿到了一批Parquet格式的数据,搜了一下arrow包的read_Parquet
函数可以读取文件,到https://arrow.apache.org/docs/r/articles/install.html网站上看了一下怎么装,发现linux没有预先编译好的二进制文件,安装和其他系统不一样,首先按照Method1:
install.packages("arrow", repos = "https://packagemanager.rstudio.com/all/__linux__/bionic/latest")
一开始会有brotli_ep的错误,我估计是服务器gcc,g++版本的问题,在我调整统一了之后,这个问题不再出现。
然后就会出现一些我推测是下载错误,比如:
[ 1%] Performing download step (download, verify and extract) for 're2_ep'
CMake Error at utf8proc_ep-stamp/utf8proc_ep-download-RELEASE.cmake:37 (message):
Command failed: 1
'/usr/local/bin/cmake' '-Dmake=' '-Dconfig=' '-P' '/tmp/Rtmp2YramV/filec14344f44c6c/utf8proc_ep-prefix/src/utf8proc_ep-stamp/utf8proc_ep-download-RELEASE-impl.cmake'
See also
/tmp/Rtmp2YramV/filec14344f44c6c/utf8proc_ep-prefix/src/utf8proc_ep-stamp/utf8proc_ep-download-*.log
-- stdout output is:
-- Downloading...
dst='/tmp/Rtmp2YramV/filec14344f44c6c/utf8proc_ep-prefix/src/v2.7.0.tar.gz'
timeout='none'
inactivity timeout='none'
-- Using src='https://github.com/JuliaStrings/utf8proc/archive/v2.7.0.tar.gz'
-- stderr output is:
CMake Error at utf8proc_ep-stamp/download-utf8proc_ep.cmake:170 (message):
Each download failed!
error: downloading 'https://github.com/JuliaStrings/utf8proc/archive/v2.7.0.tar.gz' failed
status_code: 35
status_string: "SSL connect error"
log:
--- LOG BEGIN ---
Trying 20.205.243.166:443...
Connected to github.com (20.205.243.166) port 443 (#0)
ALPN, offering h2
ALPN, offering http/1.1
Cipher selection:
ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
successfully set certificate verify locations:
CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
TLSv1.2 (OUT), TLS header, Certificate Status (22):
[5 bytes data]
TLSv1.2 (OUT), TLS handshake, Client hello (1):
[512 bytes data]
[5 bytes data]
TLSv1.2 (IN), TLS handshake, Server hello (2):
[70 bytes data]
[5 bytes data]
TLSv1.2 (IN), TLS handshake, Certificate (11):
[2454 bytes data]
[5 bytes data]
TLSv1.2 (IN), TLS handshake, Server key exchange (12):
[148 bytes data]
[5 bytes data]
TLSv1.2 (IN), TLS handshake, Server finished (14):
[4 bytes data]
[5 bytes data]
TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
[70 bytes data]
[5 bytes data]
TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
[1 bytes data]
[5 bytes data]
TLSv1.2 (OUT), TLS handshake, Finished (20):
[16 bytes data]
OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443
Closing connection 0
--- LOG END ---
CMake Error at utf8proc_ep-stamp/utf8proc_ep-download-RELEASE-impl.cmake:9 (message):
Command failed (1):
'/usr/local/bin/cmake' '-P' '/tmp/Rtmp2YramV/filec14344f44c6c/utf8proc_ep-prefix/src/utf8proc_ep-stamp/download-utf8proc_ep.cmake'
CMake Error at utf8proc_ep-stamp/utf8proc_ep-download-RELEASE.cmake:47 (message):
Stopping after outputting logs.
CMakeFiles/utf8proc_ep.dir/build.make:97: recipe for target 'utf8proc_ep-prefix/src/utf8proc_ep-stamp/utf8proc_ep-download' failed
make[2]: *** [utf8proc_ep-prefix/src/utf8proc_ep-stamp/utf8proc_ep-download] Error 1
CMakeFiles/Makefile2:685: recipe for target 'CMakeFiles/utf8proc_ep.dir/all' failed
make[1]: *** [CMakeFiles/utf8proc_ep.dir/all] Error 2
make[1]: *** 正在等待未完成的任务....
然后我决定换Method 1b - R source package with libarrow binary
Sys.setenv("NOT_CRAN" = TRUE)
install.packages("arrow")
然后又是下载错误:
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
试开URL’https://github.com'
Error in download.file(from_url, to_file, quiet = quietly) :
无法打开URL'https://github.com'
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For a faster, more complete installation, set the environment variable NOT_CRAN=true before installing
See install vignette for details:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2
*** Building C++ library from source, but downloading thirdparty dependencies
is not possible, so this build will turn off all thirdparty features.
See install vignette for details:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
**** arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/RtmpPV9kWB/filee63d7da8ad9e' DEST_DIR='libarrow/arrow-8.0.0' CMAKE='/usr/local/bin/cmake' EXTRA_CMAKE_FLAGS=' -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE' CC='gcc' CXX='g++ -std=gnu++11' LDFLAGS='-L/usr/local/lib' ARROW_S3='OFF' ARROW_MIMALLOC='OFF' ARROW_JEMALLOC='OFF' ARROW_JSON='OFF' ARROW_PARQUET='OFF' ARROW_DATASET='OFF' ARROW_WITH_BROTLI='OFF' ARROW_WITH_BZ2='OFF' ARROW_WITH_LZ4='OFF' ARROW_WITH_SNAPPY='OFF' ARROW_WITH_ZLIB='OFF' ARROW_WITH_ZSTD='OFF' ARROW_WITH_RE2='OFF' ARROW_WITH_UTF8PROC='OFF'
这样虽然能安装成功:
* DONE (arrow)
The downloaded source packages are in
‘/tmp/RtmpycB2I8/downloaded_packages’
但是,基本上不能用,当我想读文件时:
read_parquet("test.parquet", as_data_frame = FALSE)
Error in parquet___arrow___ArrowReaderProperties__Make(isTRUE(use_threads)) :
Cannot call parquet___arrow___ArrowReaderProperties__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.
#查看能使用的功能:
arrow_info()
Arrow package version: 8.0.0
Capabilities:
dataset FALSE
substrait FALSE
parquet FALSE
json FALSE
s3 FALSE
utf8proc FALSE
re2 FALSE
snappy FALSE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 FALSE
lz4_frame FALSE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc FALSE
没一个能用的。
看到网站上还提供了offline 安装的方法,可以试试在自己电脑上装然后再复制到服务器上:
source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")
create_package_with_all_dependencies("my_arrow_pkg.tar.gz")
Downloading Arrow source file
trying URL 'https://mirrors.sustech.edu.cn/CRAN/src/contrib/arrow_8.0.0.tar.gz'
Content type 'application/octet-stream' length 4796875 bytes (4.6 MB)
downloaded 4.6 MB
Error: '\U' used without hex digits in character string starting ""C:\U"
source
函数不稳定,有时候会出现不能连接的错误,但这容易解决,把install-arrow.R
下载到本地运行,然后会出现\U
的问题,这个也好解决,把函数拆开,逐行运行,发现是写入下载文件的命令时,路径没处理。手动调整之后,完美运行,复制到服务器上之后:
install.packages("arrow_8.0.0_with_deps.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))
将程序包安装入‘/home/qqq/R/x86_64-pc-linux-gnu-library/4.1’
(因为‘lib’没有被指定)
inferring 'repos = NULL' from 'pkgs'
* installing *source* package ‘arrow’ ...
file ‘configure.win’ is missing
** using staged installation
ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual'
* removing ‘/home/qqq/R/x86_64-pc-linux-gnu-library/4.1/arrow’
Warning message:
In install.packages("arrow_8.0.0_with_deps.tar.gz", dependencies = c("Depends", :
installation of package ‘arrow_8.0.0_with_deps.tar.gz’ had non-zero exit status
还是不行,虽然我用本地可以很容易的安装,在自己电脑上使用,但是还是想在集群上去跑这些。我推测的是,集群不方便连github,因为上面的报错中有许多提到连接失败,不知道有没有什么方法可以跳过或者解决。