Generating & Preparing JSONL Training Data with GPT for Fine Tuning an OpenAI Model (2024)

Instructor: [0:00] When working with different AI models, you're limited to the set of data that the models were trained on. Fine tuning gives us a way to customize these models in a way that's going to train it on the data that we want it to understand and know. But before we're able to create a fine tuning mechanism, we first need to have our data prepared in a way that the fine tune will actually understand.

[0:17] In particular, we're going to use the JSONL syntax where each line will contain its own JSON object, complete with a prompt and a completion. Now to start, I created a bunch of fake data, which is something that OpenAI wouldn't know about, which is a bunch of sci-fi characters.

[0:31] To do this, I used methods similar to what we did in past lessons where I created a specific prompt that's going to ask ChatGPT to create that character data in JSON format. If you want to follow along or try this out for yourself, you can find a link to the code on the lesson video page.

[0:46] Now before we dive in too far with our specific example, you should be known that OpenAI might recommend actually using embeddings rather than using fine tuning depending on the specific use case. You should also know, the more prompts you're able to give as part of your example, the better results you're going to have with your trained model.

[1:02] For the sake of this example, I'll be using a smaller data set so I'll be limited in the amount of prompts and training data I'll be able to produce. Starting off, while you're probably able to set this up inside of a serverless function, this is really going to be a one-time use thing unless we're building a service around this.

[1:17] We're going to actually just start off by building a script. Inside of my scripts directory, I'm going to create a new file called generate training data.js. Inside that file, I added some boilerplate code including requiring my local environment variable file. I'm going to require the OpenAI SDK. I'm going to configure it and I'm going to create a new instance of OpenAI API.

[1:38] As far as what's inside of the script, I'm creating a new asynchronous function called run where I'm going to immediately invoke it, where I'm going to then use the create chat completion method in order to send my message.

[1:47] Now, as is the same for a lot of the chat completion lessons we've worked through, I'm going to first paste in the first line of our prompt where we're going to first say, you are an assistant that generates JSONL prompts based off of JSON data for fine tuning.