Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
242 views
in Technique[技术] by (71.8m points)

python - Installing pandas in docker Alpine

I am having a really hard time trying to install a stable data science package configuration in docker. This should be easier with such mainstream, relevant tools.

The following is the Dockerfile that used to work, with a bit of a hack, removing pandas from the package core and installing it separately, specifying pandas<0.21.0, because, allegedly, higher versions conflict with numpy.

    FROM alpine:3.6

    ENV PACKAGES="
    dumb-init 
    musl 
    libc6-compat 
    linux-headers 
    build-base 
    bash 
    git 
    ca-certificates 
    freetype 
    libgfortran 
    libgcc 
    libstdc++ 
    openblas 
    tcl 
    tk 
    libssl1.0 
    "

ENV PYTHON_PACKAGES="
    numpy 
    matplotlib 
    scipy 
    scikit-learn 
    nltk 
    " 

RUN apk add --no-cache --virtual build-dependencies python3 
    && apk add --virtual build-runtime 
    build-base python3-dev openblas-dev freetype-dev pkgconfig gfortran 
    && ln -s /usr/include/locale.h /usr/include/xlocale.h 
    && python3 -m ensurepip 
    && rm -r /usr/lib/python*/ensurepip 
    && pip3 install --upgrade pip setuptools 
    && ln -sf /usr/bin/python3 /usr/bin/python 
    && ln -sf pip3 /usr/bin/pip 
    && rm -r /root/.cache 
    && pip install --no-cache-dir $PYTHON_PACKAGES 
    && pip3 install 'pandas<0.21.0'     #<---------- PANDAS
    && apk del build-runtime 
    && apk add --no-cache --virtual build-dependencies $PACKAGES 
    && rm -rf /var/cache/apk/*

# set working directory
WORKDIR /usr/src/app

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt # other than data science packages go here
RUN pip install -r requirements.txt

# add entrypoint.sh
COPY ./entrypoint.sh /usr/src/app/entrypoint.sh

RUN chmod +x /usr/src/app/entrypoint.sh

# add app
COPY . /usr/src/app

# run server
CMD ["/usr/src/app/entrypoint.sh"]

The configuration above used to work. What happens now is that build does go through, but pandas fails at import with the following error:

ImportError: Missing required dependencies ['numpy']

Since numpy 1.16.1 was installed, I don't know which numpy pandas is trying to find anymore...

Does anyone know how to obtain a stable solution for this?

NOTE: A solution consisting of a pull from a turnkey docker image for data science with at least the packages mentioned above, into Dockerfile above, would be also very welcomed.


EDIT 1:

If I move install of data packages into requirements.txt, as suggested in the comments, like so:

requirements.txt

(...)
numpy==1.16.1 # or numpy==1.16.0
scikit-learn==0.20.2
scipy==1.2.1
nltk==3.4   
pandas==0.24.1 # or pandas== 0.23.4
matplotlib==3.0.2 
(...)

and Dockerfile:

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt

It breaks again at pandas, complaining about numpy.

Collecting numpy==1.16.1 (from -r requirements.txt (line 61))
  Downloading https://files.pythonhosted.org/packages/2b/26/07472b0de91851b6656cbc86e2f0d5d3a3128e7580f23295ef58b6862d6c/numpy-1.16.1.zip (5.1MB)
Collecting scikit-learn==0.20.2 (from -r requirements.txt (line 62))
  Downloading https://files.pythonhosted.org/packages/49/0e/8312ac2d7f38537361b943c8cde4b16dadcc9389760bb855323b67bac091/scikit-learn-0.20.2.tar.gz (10.3MB)
Collecting scipy==1.2.1 (from -r requirements.txt (line 63))
  Downloading https://files.pythonhosted.org/packages/a9/b4/5598a706697d1e2929eaf7fe68898ef4bea76e4950b9efbe1ef396b8813a/scipy-1.2.1.tar.gz (23.1MB)
Collecting nltk==3.4 (from -r requirements.txt (line 64))
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
Collecting pandas==0.24.1 (from -r requirements.txt (line 65))
  Downloading https://files.pythonhosted.org/packages/81/fd/b1f17f7dc914047cd1df9d6813b944ee446973baafe8106e4458bfb68884/pandas-0.24.1.tar.gz (11.8MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 359, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 732, in <module>
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 475, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1144, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 361, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-_e5z6o6_/pandas/

EDIT 2:

This seems like an open pandas issue. For more details please refer to:

pandas-dev github

"Unfortunately, this means that a requirements.txt file is insufficient for setting up a new environment with pandas installed (like in a docker container)".

  **ImportError**:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the multiarray numpy extension module failed.  Most
  likely you are trying to import a failed build of numpy.
  Here is how to proceed:
  - If you're working with a numpy git repository, try `git clean -xdf`
    (removes all files not under version control) and rebuild numpy.
  - If you are simply trying to use the numpy version that you have installed:
    your installation is broken - please reinstall numpy.
  - If you have already reinstalled and that did not fix the problem, then:
    1. Check that you are using the Python you expect (you're using /usr/local/bin/python),
       and that you have no directories in your PATH or PYTHONPATH that can
       interfere with the Python and numpy versions you're trying to use.
    2. If (1) looks fine, you can open a new issue at
       https://github.com/numpy/numpy/issues.  Please include details on:
       - how you installed Python
       - how you installed numpy
       - your operating system
       - whether or not you have multiple versions of Python installed
       - if you built from source, your compiler versions and ideally a build log

EDIT 3

requirements.txt ---> https://pastebin.com/0icnx0iu


EDIT 4

As of 01/12/20, the accepted solution started not to work anymore. Now, build breaks not at pandas, but at scipy but after numpy, while building scipy's wheel. This is the log:

  ----------------------------------------
  ERROR: Failed building wheel for scipy
  Running setup.py clean for scipy
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"'; __file__='"'"'/tmp/pip-install-s6nahssd/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'
'"'"', '"'"'
'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /tmp/pip-install-s6nahssd/scipy
  Complete output (9 lines):

  `setup.py clean` is not supported, use one of the following instead:

    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)

  Add `--force` to your command to use it anyway if you must (unsupported).

  ----------------------------------------
  ERROR: Failed cleaning build dir for scipy
Successfully built numpy
Failed to build scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

From the error it seems that building process is using python3.6, while I use FROM alpine:3.7.

Full log here -> https://pastebin.com/Tw4ubxSA

And this is the current Dockerfile:

https://pastebin.com/3SftEufx

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you're not bound to Alpine 3.6, using Alpine 3.7 (or later) should work.

On Alpine 3.6, installing matplotlib failed for me with the following:

Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz (36.6MB)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    No local packages or working download links found for numpy>=1.10.0

However, on Alpine 3.7, it worked. This may be due to a numpy versioning issue (see here), but I'm not able to tell for sure. Past that problem, packages were built and installed successfully - taking a good while, about 30 minutes (since Alpine's musl-libc is not compatible to Python's Wheels format, all packages installed with pip have to be built from source).

Note that one important change is needed: you should only remove the build-runtime virtual package (apk del build-runtime) after pip install. Also, if applicable, you could replace numpy 1.16.1 with 1.16.2, which is the shipped version (otherwise 1.16.2 will be uninstalled and 1.16.1 built from source, further increasing the build time) - I haven't tried this, though.

For reference, here's my slightly modified Dockerfile and docker build output.

Note:

Usually Alpine is chosen as the base for minimizing the image size (Alpine is also otherwise very slick, but has compatibility issues with mainland Linux apps due to glibc/musl). Having to build Python packages from source kind of beats that purpose, since you get a very bloated image - 900MB before any cleanup, which also takes ages to build. The image could be greatly compacted by removing all intermediate compilation artifacts, build dependencies etc., but still.

If you can't get the Python package versions you need to work on Alpine, without having to build them from source, I would suggest trying other small and more compatible base images such as debian-slim, or even ubuntu.

Edit:

Following "Edit 3" with added requirements, here are updated Dockerfile and Docker build output. The following packages were added for satisfying build dependencies:

postgresql-dev libffi-dev libressl-dev libxml2 libxml2-dev libxslt libxslt-dev libjpeg-turbo-dev zlib-dev

For packages that failed to build due to specific headers, I used Alpine's package contents search to locate the missing package. Specifically for cffi, the ffi.h header was missing, which needs the libffi-dev package: https://pkgs.alpinelinux.org/contents?file=ffi.h&path=&name=&branch=v3.7.

Alternatively, when a package build failure is not very clear, the installation instructions of the specific package could be referred to, for example, Pillow.

The new image size, before any compaction, is 1.04GB. For cutting it down a bit, you could remove the Python and pip caches:

RUN apk del build-runtime && 
    find -type d -name __pycache__ -prune -exec rm -rf {} ; && 
    rm -rf ~/.cache/pip

This will bring image size down to 661MB, when using docker build --squash.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...