Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
145 views
in Technique[技术] by (71.8m points)

python - How datascientists organize code structure in order to provide minimal & reusable example of their data pipeline?

Inspired by datascience project structure template as the folowing: https://medium.com/swlh/how-to-structure-a-python-based-data-science-project-a-short-tutorial-for-beginners-7e00bff14f56 i was wondering how you would improve this structure to deal with a minimal test example (it is not an unit test but more like a kind of functionnal test) to run the entire project.

I can imagine a main.py modules in my package to run the entire pipeline and provide a sample of data, like that it would allow people interested by the project to run it quickly and understand better the project.

But where could i put this sample data (which would be a compromise between size to increase the speed of running the "functional test" keeping the output significant and not significantless) ?

I have been told that versionning data is not a good idea (event sample), moreover in case of sensible data it would not be seen well by our responsible. Since my company hold shared hard drive with permissions access i was thinking of letting this test data in this location and provide in my code a get_data.py which would download the data in the data/raw folder of the project.

How it sounds for you ? What are best practices done by data scientist ?

question from:https://stackoverflow.com/questions/66060335/how-datascientists-organize-code-structure-in-order-to-provide-minimal-reusabl

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
  1. set up a good .gitignore in which you make sure do not upload any data or similar to the server. Also good to avoid to upload to the repo notebooks.
  2. For a generic content in the repo see here section: "Generated Project Contents". I will say it is good to have: test, scripts, notebooks, to have modules where you place the cool functions you could reuse in the future and so on.
  3. I think perhaps interesting to place some fake data. Perhaps interesting to have a look to Faker to create samples: https://pypi.org/project/Faker/
  4. Perhaps good to use doctest for the tests and documentation.
  5. Sphinx could be good for documentation.

Overall I would say most of the situation you are going to face has already be faced by another people, so perhaps try to contribute or follow some good packages / libraries as the bellow.

It could be good to check what those are doing / done.

For the particular case of the pipeline for data, it could be complex, because maybe different people maybe involver and different technologies (databases, python, R ...). I think in general it is good to have good documentation, as above. In my experience I would try to place everything in one place. Like in a script or similar. Also I would try to place warning and error to make sure you improve as much as possible the pipeline.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...