Once you collected your gossip, you must format it in a way that your computer understands.
.txt
fileThere are many formats a digital dataset can use — JSON for the web, CSV for spreadsheets. In this example, we are going to use a .txt
file, one you can create from your TextEdit app on Mac or using your Notepad on Windows. Just remember to file > save as > gossip.txt
. You can also use Google Docs and download your file as .txt
.
In your app, write your gossip line by line — filling a text document like this:
I heard that he cheated on her sister with the cousin of her aunt
I heard that they lost the funding because of their lack of vision
I heard that the festival will run out of money
I heard that she decided to quit her studies because she couldn't pass the exams
...
Aim for 100-200 lines of gossip.
Language Models use the same logic of your iPhone’s autocomplete: they recognise patterns of words and know to estimate the next probable word.
Your model knows “I heard that” from your dataset, and it knows that it should be followed by a line of gossip.
To help this process, use the same sentence starter across all your lines. Start your gossip entries with a sentence like They said that or I heard that or A little birdie told me that.
Now that your dataset is ready, let’s go finetune! How to Fine-Tune