Are all default Python libraries available in Notebook also available by default in Spark Processor?

Comments

4 comments

  • Avatar
    Sarath Botlagunta

    We used to have some of the standard libraries to be available in Spark Processors. But we started running into more issues as EMR/Dataproc has there own libraries and dependencies, that were conflicting with the preinstalled packages we were providing. So we removed them.
    But with Jupyter Notebook, as kernel is completely managed by Syntasa, we have the flexibility of preinstalling some libraries with out any issues.

    We will have this disconnect for the foreseeable future.  But eventually we would like to start utilizing Kubernetes Spark Cluster for all the data processing instead of EMR/Dataproc. Once Kubernetes Spark becomes main stream, we can try to be in sync as much as we can.

    1
    Comment actions Permalink
  • Avatar
    Mike Z

    Thanks. Is there a way to see what libraries/dependencies and versions are available for the Spark Processor?

    0
    Comment actions Permalink
  • Avatar
    Sarath Botlagunta

    As Spark processor is not an interactive experience for the user, it is not possible.

    As versions of default packages might be constantly being updated by EMR/Dataproc we can't have a static list and show in one of the section of the Spark processor. 

    We could dynamically get it but it is possible only after the runtime is started. So when user is writing the spark code, runtime might not even be up.

    0
    Comment actions Permalink
  • Avatar
    Mike Z

    What we be suggested best practice then for the user? 

    Two scenarios come to mind...

    • Specify whatever libraries/dependencies you need to install in the Spark Processor code?
    • Run without trying to install any and let it error out?

    Scenario two sounds like bad practice. I don't know all the ramifications of running scenario one, but it sounds tolerable.

    0
    Comment actions Permalink

Please sign in to leave a comment.