- set up a good
.gitignore
in which you make sure do not upload any data or similar to the server. Also good to avoid to upload to the repo notebooks.
- For a generic content in the repo see here section: "Generated Project Contents". I will say it is good to have: test, scripts, notebooks, to have modules where you place the cool functions you could reuse in the future and so on.
- I think perhaps interesting to place some fake data. Perhaps interesting to have a look to Faker to create samples: https://pypi.org/project/Faker/
- Perhaps good to use doctest for the tests and documentation.
- Sphinx could be good for documentation.
Overall I would say most of the situation you are going to face has already be faced by another people, so perhaps try to contribute or follow some good packages / libraries as the bellow.
It could be good to check what those are doing / done.
For the particular case of the pipeline for data, it could be complex, because maybe different people maybe involver and different technologies (databases, python, R ...). I think in general it is good to have good documentation, as above. In my experience I would try to place everything in one place. Like in a script or similar. Also I would try to place warning and error to make sure you improve as much as possible the pipeline.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…