-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5488: [R] Workaround when C++ lib not available #4471
ARROW-5488: [R] Workaround when C++ lib not available #4471
Conversation
So what would occur if:
Is the installed R package broken until it is reinstalled? If the Arrow C++ headers are bundled privately with the R package then it should be able to build against them, and you can omit |
Correct, under this approach, the R package would not find the C++ library if it were installed after the R package were installed. That said, when we discussed this approach before, we proposed adding an |
@wesm I'm going to try the approach of bundling headers in I'm just a little worried about code duplication and bloat for the R package, but this is less problematic as if we were bundling all of arrow (*.h and *.cpp). If we are successful here, then I guess we can do something similar to what @jjallaire did with .onLoad <- function(libname, pkgname) {
# load tbb and tbbmalloc on supported platforms
tbb <- tbbLibPath()
if (!is.null(tbb)) {
if (!file.exists(tbb)) {
warning(paste("TBB library", tbb, "not found."))
} else {
dllInfo <<- dyn.load(tbb, local = FALSE, now = TRUE)
}
}
tbbMalloc <- tbbLibPath("malloc")
if (!is.null(tbbMalloc)) {
if (!file.exists(tbbMalloc)) {
warning(paste("TBB malloc library", tbbMalloc, "not found."))
} else {
mallocDllInfo <<- dyn.load(tbbMalloc, local = FALSE, now = TRUE)
}
}
# load the package library
library.dynam("RcppParallel", pkgname, libname)
# set default thread options
numThreads <- "auto"
numThreadsEnv <- Sys.getenv("RCPP_PARALLEL_NUM_THREADS", unset = NA)
if (!is.na(numThreadsEnv))
setThreadOptions(numThreads = as.integer(numThreadsEnv))
else
setThreadOptions(numThreads = "auto")
} |
This comment has been minimized.
This comment has been minimized.
I easily might be missing something here, so take my comments w/ a grain of salt if my mental picture is incomplete. I'm wondering a couple of things about including the headers:
|
On #1, on Linux I believe if the symbols that the Rcpp library are available in process (e.g. having been loaded with dlopen earlier) then they'll be resolved correctly (this will only work on Linux, though)? I might be wrong about this. On #2, yes if the user has some other version of the shared libraries installed then ABI mismatch will most likely cause symbol resolution failure more often than a flat out crash. The shared libraries have ABI tags (e.g. |
@wesm My worry though is that when we are building/testing on CRAN, we won't have loaded the symbols w/ w/r/t ABI issues, I'm wondering if we are always better off building against the headers that match the library (i.e. I'm not sure what including the headers in the package buys us). Forgive me for being dense if it's implicit in some previous discussion! |
ff8f2ce
to
1ed5596
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes. It might also be good to add a travis job to check the package without the arrow C++ present.
r/data-raw/codegen.R
Outdated
@@ -0,0 +1,160 @@ | |||
# pak::pkg_install("romainfrancois/decor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does data-raw/codegen.R get called? Can you please update the README with developer instructions (and if necessary/appropriate, add Suggests dependencies to DESCRIPTION)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do that when I'm sure this is solid.
I guess I need to edit the
so that it says a shim package has been compiled and the user should use I'll probably do The files |
There isn't a hook like this that I'm aware of (I guess configure could do it if could successfully detect changes). There is this function from the utils package that lets you easily do before/after snapshots of a directory to detect changes: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/changedFiles.html |
You could regenerate it unconditionally in configure, it should be fast, so there is probably no harm in doing it every time configure is called, assuming the output is deterministic across platforms for the same input file. |
+1 on doing it unconditionally. Assuming it has similar overhead to
Rcpp::compileAttributes it won't be noticeable.
…On Thu, Jun 6, 2019 at 9:53 AM Jim Hester ***@***.***> wrote:
You could regenerate it unconditionally in configure, it should be fast,
so there is probably no harm in doing it every time configure is called,
assuming the output is deterministic across platforms for the same input
file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4471?email_source=notifications&email_token=AAAZPR4G3FWFDKATTQQ5K5LPZEJGBA5CNFSM4HS44S4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXC466I#issuecomment-499502969>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAZPR2SU5J5CYFDEGXMKQTPZEJGBANCNFSM4HS44S4A>
.
|
❯ system.time(source('~/git/apache/arrow/r/data-raw/codegen.R'))
user system elapsed
0.656 0.036 0.694 The only thing is that it uses a bunch of packages # pak::pkg_install("romainfrancois/decor")
library(decor)
library(dplyr)
library(stringr)
library(purrr)
library(glue) |
So I guess we don't want those packages to become dependencies of the arrow
package. I'd say just have a special environment variable that triggers the
use of codegen.R within configure so those packages don't have to be
declared in Suggests?
…On Thu, Jun 6, 2019 at 10:14 AM Romain François ***@***.***> wrote:
❯ system.time(source('~/git/apache/arrow/r/data-raw/codegen.R'))
user system elapsed
0.656 0.036 0.694
The only thing is that it uses a bunch of packages
# pak::pkg_install("romainfrancois/decor")
library(decor)
library(dplyr)
library(stringr)
library(purrr)
library(glue)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4471?email_source=notifications&email_token=AAAZPR3WY5TF6QGT46773MDPZELU3A5CNFSM4HS44S4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXC7HFY#issuecomment-499512215>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAZPRYZRDMUQFKTO7WYVGLPZELU3ANCNFSM4HS44S4A>
.
|
Also, |
@nealrichardson I believe this is ready for another review now. I will open another JIRA/PR for the Code generation is triggered on each The code generation is somewhat craft right now, I'm hoping to make progress with my hobby |
@romainfrancois since we're talking about interactions with GPL2 build tools can you confirm that this patch includes no reused code from RcppCore/Rcpp? |
Yes. It generates similar code (and Rcpp is still being used in the package), but this patch does not borrow code from Rcpp. |
OK, thanks. Note that it's fine to utilize GPL tools as part of a build system (for example: GNU autotools), but it's the code reuse / copying where things become problematic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few 💅 ✍️ comments, but assuming CHECK
is passing everywhere, looks like we're really close here.
To @jjallaire 's point about the examples, that's interesting, I hadn't thought about the prospect that readers of the help pages might see the examples and then think they need to wrap everything in if(arrow_available()) {...}
, so it would be better to make them all \dontrun
. Have we seen that happen for other packages?
A counterargument would be that if we did keep the arrow_available()
check in the examples and let them run, not only would we avoid any potential CRAN litigation about \dontrun
, we would also get examples with rendered output in the pkgdown site, which would be nice.
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script and the ARROW_R_DEV
env var should be documented at least minimally for developers. Somewhere around here in the readme it should say that anyone working on the C++ code in the R package should export ARROW_R_DEV=TRUE
and will also need to remotes::install_github("romainfrancois/decor")
.
It might then be sufficient to have a brief comment here at the top of this file (below the 🐀 license) just explaining that this is a supplement to the usual Rcpp magic and wraps all functions in a check for the existence of the Arrow library, or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some words
glue_collapse(sep = "\n") | ||
|
||
writeLines(con = "R/arrowExports.R", glue::glue(' | ||
# Generated by using data-raw/codegen.R -> do not edit by hand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point but is "codegen" the best name for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Rcpp::compileAttributes()
isn't all that vivid either ;)
|
||
#' Is the C++ Arrow library available | ||
#' | ||
#' @export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could use some more substantive docs, like to say that package users generally wouldn't need to call this, to address the concern that JJ raised about the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we use \dontrun
extensively for TensorFlow, Spark, and reticulate. We do however have example functions for any code that doesn't have dependencies (in this case e.g. arrow_available()
) so that we don't trip the "no running examples" flag.
pkgdown is actually setup to deal with this pretty gracefully (example code is enclosed in a subtle not run treatment, e.g. see https://keras.rstudio.com/reference/keras_model.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, pkgdown handles dontrun
gracefully, but the downside is that you don't get to see the example results, like with e.g. https://dplyr.tidyverse.org/reference/select.html#examples. Provided that we build the pkgdown static site in an environment where libarrow is installed (which we will), if we used the arrow_available()
check in the examples, we'd get the example output showing. For pedagogical purposes, I'm not sure which is the better tradeoff (i.e. see real output but possibly be mislead that you always have to check if arrow is available, vs. no output but also no extraneous code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like pkgdown has a run_dont_run
argument (https://github.com/r-lib/pkgdown/blob/89af5341ef2a0eedf07770d3e98f5094dacb9029/R/build-reference.R#L86), so we can have the cleaner \dontrun
and still get example output on the pkgdown site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'd like to document this function in a subsequent pull request, maybe together with install_arrow()
.
FWIW don't think it's worth polluting the examples with
`arrow_available()`. It's both inelegant and probably misleading as users
might add this to their code.
…On Mon, Jun 10, 2019 at 5:41 PM Neal Richardson ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In r/R/zzz.R
<#4471 (comment)>:
> @@ -18,3 +18,11 @@
#' @useDynLib arrow, .registration = TRUE
#' @importFrom Rcpp sourceCpp
NULL
+
+#' Is the C++ Arrow library available
+#'
+#' @export
Right, pkgdown handles dontrun gracefully, but the downside is that you
don't get to see the example results, like with e.g.
https://dplyr.tidyverse.org/reference/select.html#examples. Provided that
we build the pkgdown static site in an environment where libarrow is
installed (which we will), if we used the arrow_available() check in the
examples, we'd get the example output showing. For pedagogical purposes,
I'm not sure which is the better tradeoff (i.e. see real output but
possibly be mislead that you always have to check if arrow is available,
vs. no output but also no extraneous code).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4471?email_source=notifications&email_token=AAAZPR5WAV2WISIIIZCSIPDPZ3C73A5CNFSM4HS44S4KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB3DAYLI#discussion_r292203237>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAZPR7EIS6EN5NOSPMR7STPZ3C73ANCNFSM4HS44S4A>
.
|
@romainfrancois just came across this script that probably needs updating with your change: https://github.com/apache/arrow/blob/master/ci/docker_build_r.sh#L26-L27 |
@jjallaire thanks, I did remove them based on your previous comment. |
Thanks @nealrichardson I did not know about these docker files, I don't really speak docker, I guess I should. @jameslamb @xhochy I updated them to the best of my abilities given other changes in this pull request, it should be fine, but you might want to check. |
…cois/arrow into ARROW-5488/workaround
As long as redoing code generation, perhaps it would be useful to spit out several files in a directory instead of the giant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I say merge this and we can follow with any cleanups after.
glue_collapse(sep = "\n") | ||
|
||
writeLines(con = "R/arrowExports.R", glue::glue(' | ||
# Generated by using data-raw/codegen.R -> do not edit by hand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Rcpp::compileAttributes()
isn't all that vivid either ;)
This is very much a work-in-progress for running the R package tests on Appveyor. It's an attempt to bring in the "standard" R Appveyor build tooling (https://github.com/krlmlr/r-appveyor, Apache 2.0 licensed) into Arrow's configuration. I currently have the C++ library building and installing, and R and all R package dependencies installing, and R checks beginning, but they fail. This is because https://github.com/apache/arrow/blob/master/r/tools/winlibs.R is configured only to build with a released [sic] version of the arrow C++ binaries pre-built and hosted on rwinlib. Two immediate to-dos are: 1. get the package to install and check without libarrow at all (as #4471 now allows) 2. get the package to find the libarrow we built locally from source After that, I can reverse-engineer what appveyor does and document it for any Windows developers (https://issues.apache.org/jira/browse/ARROW-3758). There's obviously some mess I've made of appveyor.yml that I'll clean up before I'm done, and I'll do my best to document and refactor some of the non-obvious features of the setup that I've discovered (e.g. that `ci/appveyor-cpp-build-mingw.bat` also installs GLib and Ruby and runs their test suites too) while I'm at it so that the next person might be able to pick things up faster. cc @romainfrancois @javierluraschi Author: Neal Richardson <neal.p.richardson@gmail.com> Closes #4538 from nealrichardson/r-appveyor and squashes the following commits: 6d6a74e <Neal Richardson> PR feedback 84eef1b <Neal Richardson> Merge upstream/master 328bc37 <Neal Richardson> Merge remote-tracking branch 'upstream/master' into r-appveyor 53c3a20 <Neal Richardson> Makevars.win is dynamically generated now, no need to tweak it in the release script 567f041 <Neal Richardson> Cleanups f4cf5c3 <Neal Richardson> Some appveyor.yml cleanup 19b60df <Neal Richardson> Enable R package to install on windows even if rwinlib isn't found fb19911 <Neal Richardson> Try this name hack 15c0342 <Neal Richardson> Try building first, see where that goes 18803c3 <Neal Richardson> cd this way b6f6c8d <Neal Richardson> No popd? 1ba4839 <Neal Richardson> let the debugging begin d5e8f4e <Neal Richardson> Bootstrap within the package dir f7327ed <Neal Richardson> Find shell script up a level 8c3e898 <Neal Richardson> nm we already fast finish 39236f3 <Neal Richardson> Put mine first and fast-finish 94c9a42 <Neal Richardson> Debug with fewer jobs in the matrix af1852e <Neal Richardson> Throw this at appveyor and see how far off we are
This is very much a work-in-progress for running the R package tests on Appveyor. It's an attempt to bring in the "standard" R Appveyor build tooling (https://github.com/krlmlr/r-appveyor, Apache 2.0 licensed) into Arrow's configuration. I currently have the C++ library building and installing, and R and all R package dependencies installing, and R checks beginning, but they fail. This is because https://github.com/apache/arrow/blob/master/r/tools/winlibs.R is configured only to build with a released [sic] version of the arrow C++ binaries pre-built and hosted on rwinlib. Two immediate to-dos are: 1. get the package to install and check without libarrow at all (as apache/arrow#4471 now allows) 2. get the package to find the libarrow we built locally from source After that, I can reverse-engineer what appveyor does and document it for any Windows developers (https://issues.apache.org/jira/browse/ARROW-3758). There's obviously some mess I've made of appveyor.yml that I'll clean up before I'm done, and I'll do my best to document and refactor some of the non-obvious features of the setup that I've discovered (e.g. that `ci/appveyor-cpp-build-mingw.bat` also installs GLib and Ruby and runs their test suites too) while I'm at it so that the next person might be able to pick things up faster. cc @romainfrancois @javierluraschi Author: Neal Richardson <neal.p.richardson@gmail.com> Closes #4538 from nealrichardson/r-appveyor and squashes the following commits: 6d6a74e31 <Neal Richardson> PR feedback 84eef1b2d <Neal Richardson> Merge upstream/master 328bc376f <Neal Richardson> Merge remote-tracking branch 'upstream/master' into r-appveyor 53c3a2084 <Neal Richardson> Makevars.win is dynamically generated now, no need to tweak it in the release script 567f04142 <Neal Richardson> Cleanups f4cf5c3dd <Neal Richardson> Some appveyor.yml cleanup 19b60df07 <Neal Richardson> Enable R package to install on windows even if rwinlib isn't found fb199110a <Neal Richardson> Try this name hack 15c0342d1 <Neal Richardson> Try building first, see where that goes 18803c353 <Neal Richardson> cd this way b6f6c8de7 <Neal Richardson> No popd? 1ba4839e2 <Neal Richardson> let the debugging begin d5e8f4ed4 <Neal Richardson> Bootstrap within the package dir f7327ed23 <Neal Richardson> Find shell script up a level 8c3e898f8 <Neal Richardson> nm we already fast finish 39236f3f7 <Neal Richardson> Put mine first and fast-finish 94c9a42df <Neal Richardson> Debug with fewer jobs in the matrix af1852e79 <Neal Richardson> Throw this at appveyor and see how far off we are
This is very much a work in progress. The idea is to replace the code generation that is usually done by
Rcpp::compileAttributes()
with something custom. This is driven by thedata-raw/codegen.R
file, which I'll polish. I'm opening this right now for feedback opportunities.All of the
.cpp
files are guarded by :And
ARROW_R_WITH_ARROW
is defined via the configure file, if the library was indeed used.For example, this function:
triggers generation of this in the
generated.cpp
file:So the generated R api
SEXP
y functions only call the real thing when the c++ library is available, otherwise they just throw an error.and this in the
generated.R
file:This also needed some extra care in test functions so that the tests only run if Arrow is available.
@wesm comment from https://issues.apache.org/jira/browse/ARROW-5488 might be more practical and closer to what @jjallaire mentioned about how the
RcppParallel
package does it with Intel tbb:https://github.com/RcppCore/RcppParallel/blob/master/R/hooks.R