Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I feel the preparation and loading of the dataset has been abstracted too far away. I have no idea what type of data format I need or how it is loaded for this (it is using a pre-prepared huggingface dataset?). If I have local data how should it be loaded? What does that even look like? Is it expecting some sort of JSON?

When you get so far as to abstracting every step to loading a one-liner from huggingface, including the downloading of a prepared dataset with no example of doing the same on custom local dataset, you've abstracted too far to be useful for anyone other than the first user.



Thanks for the question. This is built for ML researchers, so in examples we use the defacto source for datasets researchers often use, HF Hub.

However, there is a lot of documentation on the site to help guide users. This documentation page shows you can load in data via local datasets as well. For example, JSON, CSV, text files, a local HF Dataset folder, or even from a Python `dict` or `list`:

https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...

We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)

If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.


I did glance at the docs first before commenting but I was looking in 'datasets' to try and understand importing a potential CSV/JOSN etc and all I saw was verbage on accessing the output.

I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.

If you want this to really take off for people outside of a very, very specific class of researchers... setup an example on your landing page that calls to a local JSON of user prompts/answers/rejects finetuning a llama model with your datadreamer.steps.JSONDataSource into the loader. Or, a txt file with the system/user/assistant prompts tagged and examples given. Yes, your 'lines of code' for your frontpage example may grow a bit!

Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.


That's totally fair and good feedback, it's hard to support everyone's use cases simultaneously, but from my own research and other researchers we collaborate with, this solves and streamlines the right set of problems, but we want to make this as broadly useful as possible. Always happy to chat more / provide support if you would like, feel free to reach out if you want to try it and run into any sharp edges I could help make easier.


The dataset is here I presume: https://huggingface.co/datasets/Intel/orca_dpo_pairs

you can look at the samples. Mostly its questions and accepted/rejected answers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: