python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

Question

Welcome To Ask or Share your Answers For Others

python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

Inspired by datascience project structure template as the folowing: https://medium.com/swlh/how-to-structure-a-python-based-data-science-project-a-short-tutorial-for-beginners-7e00bff14f56 i was wondering how you would improve this structure to deal with a minimal test example (it is not an unit test but more like a kind of functionnal test) to run the entire project.

I can imagine a main.py modules in my package to run the entire pipeline and provide a sample of data, like that it would allow people interested by the project to run it quickly and understand better the project.

But where could i put this sample data (which would be a compromise between size to increase the speed of running the "functional test" keeping the output significant and not significantless) ?

I have been told that versionning data is not a good idea (event sample), moreover in case of sensible data it would not be seen well by our responsible. Since my company hold shared hard drive with permissions access i was thinking of letting this test data in this location and provide in my code a get_data.py which would download the data in the data/raw folder of the project.

How it sounds for you ? What are best practices done by data scientist ?

question from:https://stackoverflow.com/questions/66060335/how-datascientists-organize-code-structure-in-order-to-provide-minimal-reusabl

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:07:40+0000

set up a good .gitignore in which you make sure do not upload any data or similar to the server. Also good to avoid to upload to the repo notebooks.
For a generic content in the repo see here section: "Generated Project Contents". I will say it is good to have: test, scripts, notebooks, to have modules where you place the cool functions you could reuse in the future and so on.
I think perhaps interesting to place some fake data. Perhaps interesting to have a look to Faker to create samples: https://pypi.org/project/Faker/
Perhaps good to use doctest for the tests and documentation.
Sphinx could be good for documentation.

Overall I would say most of the situation you are going to face has already be faced by another people, so perhaps try to contribute or follow some good packages / libraries as the bellow.

It could be good to check what those are doing / done.

For the particular case of the pipeline for data, it could be complex, because maybe different people maybe involver and different technologies (databases, python, R ...). I think in general it is good to have good documentation, as above. In my experience I would try to place everything in one place. Like in a script or similar. Also I would try to place warning and error to make sure you improve as much as possible the pipeline.

Categories

python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags