I feel the preparation and loading of the dataset has been abstracted too far aw...

patelajay285 · on Feb 11, 2024

Thanks for the question. This is built for ML researchers, so in examples we use the defacto source for datasets researchers often use, HF Hub.

However, there is a lot of documentation on the site to help guide users. This documentation page shows you can load in data via local datasets as well. For example, JSON, CSV, text files, a local HF Dataset folder, or even from a Python `dict` or `list`:

https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...

We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)

If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.

mk_stjames · on Feb 11, 2024

I did glance at the docs first before commenting but I was looking in 'datasets' to try and understand importing a potential CSV/JOSN etc and all I saw was verbage on accessing the output.

I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.

If you want this to really take off for people outside of a very, very specific class of researchers... setup an example on your landing page that calls to a local JSON of user prompts/answers/rejects finetuning a llama model with your datadreamer.steps.JSONDataSource into the loader. Or, a txt file with the system/user/assistant prompts tagged and examples given. Yes, your 'lines of code' for your frontpage example may grow a bit!

Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.

patelajay285 · on Feb 11, 2024

That's totally fair and good feedback, it's hard to support everyone's use cases simultaneously, but from my own research and other researchers we collaborate with, this solves and streamlines the right set of problems, but we want to make this as broadly useful as possible. Always happy to chat more / provide support if you would like, feel free to reach out if you want to try it and run into any sharp edges I could help make easier.

ganeshkrishnan · on Feb 11, 2024

The dataset is here I presume: https://huggingface.co/datasets/Intel/orca_dpo_pairs

you can look at the samples. Mostly its questions and accepted/rejected answers.