GPT Application Development and Thoughts

7/31/2023

In the past few months, we seem to be in the middle of an AI revolution. Besides OpenAI ChatGPT that most people know, many novel, interesting, and practical AI applications have emerged. Using these applications has truly improved my productivity.

However, there seems to be limited resources on GPT application development knowledge and roadmaps, so I decided to organize my experience and thoughts into a series to help everyone.

This article mainly introduces thoughts on GPT-related application development. In April this year, I learned GPT-related technical knowledge while developing the open-source project ChatFiles. Due to limited time and energy, I never had the chance to organize this knowledge into an article. Recently, I had new ideas and open-sourced VectorHub, another project based on GPT Prompt and Embeddings technology. This gave me deeper understanding of GPT and Embeddings technologies, leading to this sharing.

Starting with Prompts

AI application development has attracted many developers recently. Besides the well-known ChatGPT, many valuable AI applications have emerged, such as AI translation apps like openai-translator and immersivetranslate, writing apps like Notion AI, and programming assistance apps like GitHub Copilot and GitHub Copilot Chat.

Some of these applications optimize existing experiences, like GPT-based translation in openai-translator, which has much better translation quality and reading experience than previous machine translation. Others provide previously impossible features, like GitHub Copilot's code completion and generation, and GitHub Copilot Chat providing features like answering coding questions, explaining code, generating unit tests, and suggesting fixes for buggy code. These features were unimaginable before.

Although these applications have different functions, they are all mainly implemented based on GPT Prompts. A prompt is text or instructions provided to the model to guide it in generating natural language output (completion). It provides context to the model and is crucial for the model's output results.

We know that GPT (Generative Pre-trained Transformer) is an inference model based on two stages: pre-training and fine-tuning.

In the pre-training stage, a large-scale corpus is used for basic training, such as Wikipedia, news articles, and novels. After training, when you input a sentence, it will make a probabilistic prediction based on this sentence, predicting what word should be concatenated next. This concatenated word is probabilistically selected based on knowledge learned during pre-training. Through repeated word predictions, it can finally concatenate a paragraph. This is why it's called generative AI.

This sentence is what we call a Prompt, and it's the foundation for generative AI's probabilistic generation. This explains why we get different results each time we input the same prompt - because each result is generated based on probability.

So we can understand why prompts are important for GPT application development. Besides fine-tuning, it's the only way we can interact with GPT models (of course, we can also control GPT's more diverse or creative output by adjusting the model's temperature and top_p configurations, but this doesn't significantly affect output quality and downstream task processing capabilities). So prompts are the core part of GPT application development and what developers need to think about and optimize most.

After pre-training, in the fine-tuning stage, the GPT model is loaded onto specific tasks and trained using datasets for those tasks. This way, the model can be fine-tuned according to task requirements to better understand prompts and generate task-related text. Through fine-tuning, GPT can adapt to different tasks like text classification, sentiment analysis, and question-answering systems. However, fine-tuning is currently not a good choice for most GPT developers due to high costs and unstable final results. So most GPT application developers currently develop applications based on prompts.

Prompt Learning Path

For basic prompt knowledge, you can first watch Andrew Ng's ChatGPT Prompt Engineering. You can quickly understand how to use prompts and their charm through less than two hours of video.

After getting a basic understanding, I recommend the Prompt Engineering Guide document. This document contains lots of prompt basics and future development directions. For GPT application developers, besides learning prompt basics, you can also get thoughts from industry and academia about prompt development directions, which are precious for developing AI applications.

Finally, I highly recommend checking out OpenAI's official GPT Best Practices document. It's an official GPT best practices guide from OpenAI, containing many prompt examples and usage tips. For GPT application developers, this is a very valuable document. It's the best practices summarized by OpenAI through different ways like partnerships and hackathons in different business domains for GPT application development, which is very inspiring for developers!

Prompt Best Practices

For prompt writing best practices, I most recommend OpenAI's official GPT Best Practices document. But for developing GPT applications, I want to share some GPT development practices combining this best practices guide with my own experience.

Clear and Detailed

In reality, most developers use GPT mainly to solve programming problems or ask questions, so they easily bring their experience from using Google and other search engines to use and develop GPT.

For example, when you want to know how to write a Fibonacci sequence in Python, if you used Google search engine before, you might input python fibonacci. This would be enough because Google is based on inverted indexing and PageRank algorithms - you just need to input keywords to get high-quality webpage answers.

So this two-word input method is simplest and most efficient. Even if you input more words like how to write python fibonacci, the output quality difference for Google search engine is minimal.

But if you use GPT, input like python fibonacci is very unfriendly to GPT because it can't clearly understand your intention, so it might give unrelated results (varies slightly based on different model quality).

But if you input Write a Python function to efficiently calculate the Fibonacci sequence. Comment each line of code to explain what each part does and why it's written this way, this input is very clear and detailed for GPT. It can clearly understand your intention, so its results will be more accurate, ensuring output quality baseline while improving output quality ceiling!

This is completely different from developers' previous experience using Google and other search engines, and it's the most easily overlooked aspect for GPT developers and users. When I developed ChatFiles early this year, I anonymously collected user prompts and found that over 95% of users used very simple prompts - so simple they seemed to value each word like gold.

So when developers develop GPT applications, they must pay attention to prompt clarity and detail. Try several times and choose a prompt with stable output quality and consistent format. This is key to ensuring GPT application quality.

Handling Complex Tasks

I believe all developers can design good program prompts and get decent output quality for simple scenario tasks by spending some time adjusting prompts. But for complex tasks, improving GPT output quality requires two very important techniques: make GPT reason instead of answer, and break down tasks for guidance.

Reasoning Instead of Answering

Reasoning instead of answering means requiring the GPT model in prompts not to immediately judge correctness or give answers, but to guide the model to think deeply. You can ask it to first list various views on the problem, break down the task, explain reasoning basis for each step, then reach final conclusions. Adding step-by-step reasoning requirements in prompts allows language models to invest more time in logical thinking, making output results more reliable and accurate.

Here's an example from OpenAI. If you need GPT to answer whether a student's answer is correct, and the prompt is Judge whether the student's solution is correct, GPT has a high probability of giving wrong answers for complex calculation problems because GPT won't reason first then answer like humans do, but will immediately make judgments. In brief judgments, it can't give correct answers (just like humans can't calculate complex math in short time).

So if our prompt is First solve the problem yourself, then compare your solution with the student's solution and evaluate whether the student's solution is correct. Don't determine if the student's solution is correct before completing the problem yourself. By giving clear guidance and conditions in prompts, GPT models can spend more time deriving answers, getting more accurate results.

Task Breakdown

Task breakdown for guidance means breaking a complex task into multiple subtasks, then guiding GPT model to reason separately, finally integrating multiple subtask results to get final results. The benefit is allowing GPT model to focus more on one subtask, improving output quality.

Here's an imperfect example: when you need to summarize a book, GPT's direct overall summary doesn't work well. We can use a series of subtasks to summarize each part, then aggregate the generated summaries.

Of course, task breakdown also brings new problems - when single task output quality has issues, overall output quality is affected. Plus current token costs are expensive, so task breakdown guidance brings additional costs. But regardless, how to design and break down complex tasks is currently the core issue all GPT applications need to think about and the core design point for maintaining AI application moats. It's also the core design point for current large model AI frameworks like LangChain. I could write a separate article to discuss this when I have time.

Usage Tips

Besides the important practices above, here are some small tips that are very helpful for developing applications:

  • Provide few-shot examples: Give the model one or two expected input-output samples so the model understands our requirements and expected output format.
  • Require structured output in prompts: Output in JSON format for easier subsequent code processing.
  • Separators: Use separators like """ to isolate different instructions and contexts, preventing system prompts from conflicting with user input prompts.

RAG Application Development

Above we mainly introduced how to develop AI applications based on prompts, but sometimes we encounter new problems. For example, large model training data is often from months or even years ago. When we have scenarios requiring GPT applications to provide latest data, like answering questions based on recent news or private documents, the large model hasn't been trained on these materials and can't solve such problems.

In this case, we can have the model use reference text to answer questions. For example, our prompt can be:

You will be given a document delimited by triple quotes and a question.

Your task is to answer the question using the provided document and cite the text passages used to answer the question.

If the document doesn't contain the information needed to answer the question, simply write: "Insufficient information."

If providing an answer to the question, it must be accompanied by citation annotations.
Use the following format to cite relevant passages ({"citation": …}).

This usage allows GPT to answer based on reference text we provide. For example, if we want to ask who the latest World Cup champion is, we can attach latest World Cup news as reference text, so GPT will first understand the entire news then answer questions. This way, we can solve large model issues with timeliness and specific downstream tasks.

But this solution brings another problem - reference text length limitations. GPT prompts have size limits. Like gpt-3.5-turbo model has a 4K token limit (~3000 words), meaning users can input at most 3000 words for GPT to understand and reason answers.

So once reference text exceeds 3000 words, we can't get answers from GPT in one go. Developers need to find ways to split reference text into multiple parts, then perform RAG conversion to vectors stored in vector databases. For more information about vector databases, see my other blog post Vector Database. When asking questions about reference text, we need to first convert questions to vectors, then retrieve through vector databases, finally convert retrieved vectors back to text output. This way we can get reference text that meets GPT token limits and is related to the question.

We take the question itself and this related reference text, submit both to GPT, and get the answer we want. This process is the core of RAG application development. Its core idea is using vector retrieval to retrieve text segments most relevant to questions, bypassing GPT token limits.

Embedding

The overall development process is shown above:

  1. Load documents and get target text information. For example, File Loader and Web Loader in mainstream LLM framework LangChain.
    1. File system-based File Loader, like loading PDF files, Word files, etc.
    2. Network-based Web Loader, like web pages, AWS S3, etc.
  2. Split target text into multiple paragraphs. Splitting methods are mainly based on two types:
    1. Text quantity-based splitting, like 1000 words per paragraph. The advantage is simplicity, but the disadvantage is possibly splitting one paragraph into multiple paragraphs, reducing paragraph coherence and potentially reducing answer quality due to lack of context.
    2. Punctuation-based splitting, like using line breaks as separators. The advantage is good paragraph coherence, but the disadvantage is varying paragraph sizes that might trigger GPT token limits.
    3. GPT token limit-based splitting, like 2000 tokens per group. When querying, search for the two most relevant paragraphs, so adding them together is only 4000 tokens, not triggering the 4096 token limit.
  3. Store all split text blocks in vector database.
  4. Convert user questions to vectors, then retrieve through vector database to get the most relevant text paragraphs. (Note: this retrieval isn't traditional database fuzzy matching or inverted indexing, but semantic search capability, so retrieved text can answer user questions. For details, see another blog post Vector Database)
  5. Combine retrieved relevant text information, user questions, and system prompts into a RAG scenario-specific prompt. For example, the prompt clearly states to use reference text to answer questions, not GPT answering by itself.
  6. GPT answers the final prompt to get the final answer.

If you're interested in specific implementation code, check out LangChain's Retrieval chapter.

RAG applications can solve problems GPT can't answer in certain scenarios - this is their biggest advantage. No training needed, no fine-tuning needed, just convert text to vectors and retrieve at low cost to make certain business scenarios possible.

But this solution also brings new problems. For example, RAG text splitting and retrieval quality will greatly affect final results. How to balance query scope, quality, and query time? How to handle when retrieved reference text can't answer user questions? These are all issues developers need to carefully consider when facing business requirements and scenarios.

Thinking further, in human history, all documents were written for humans, which isn't the optimal organization mode for vector retrieval. Will there be text specifically written for AI and retrieval in the future? To help AI understand better and better match database retrieval? But these questions need long-term thinking and verification, so I won't expand here.

GPT Agents Application Development

Besides prompt and RAG solutions for business requirements, GPT application development has another very common need: how to integrate existing systems, or how to integrate existing APIs.

Because the software industry has developed for many years, many companies have their own systems and APIs. These APIs can greatly expand GPT application capability boundaries. If GPT applications want to land in real-life scenarios, they inevitably need to integrate with existing systems.

Like if you want to ask GPT a very simple question: What's the weather in Beijing today? From the above chapters, we know GPT can't answer such questions by itself. If GPT could automatically call some weather API, development would be very convenient.

To implement GPT calling weather API queries, we face two problems: making GPT applications understand API functions and call them when appropriate, and requiring structured input/output to ensure system stability.

Understanding and Calling Existing APIs

To make GPT understand API functions, the best way is for developers to manually add names and detailed descriptions to APIs, including input/output value structures and what each field means. This greatly affects GPT's judgment and final decision on whether to call the API.

Like in OpenAI's function calling example, the weather API description is as follows:

function_descriptions = [
            {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {
                            "type": "string",
                            "description": "The temperature unit to use. Infer this from the users location.",
                            "enum": ["celsius", "fahrenheit"]
                        },
                    },
                    "required": ["location"],
                },
            }
        ]

The above method describes in detail that to call this function, you need to provide at least current location information and optionally temperature units. GPT decides whether to call this function based on whether the question relates to the function description. I'll write another article about specific code details when I have time.

We can provide lots of API documentation to help LLMs understand and learn these API functions, usage methods, and combinations. Finally achieve the goal of breaking down overall requirements into multiple subtasks based on overall requirements, with specific subtasks interacting with specific APIs to achieve automation and AI goals.

The existing APIs we describe here aren't just HTTP-based requests, but can include converting to SQL database queries, calling SDKs for complex functions, and even in the future, extending to calling physical world switches, robotic arms, etc. Personally, I believe that based on current GPT capabilities and development trends, human-computer interaction will undergo huge changes.

Structured Output

Besides making GPT understand and call existing APIs, we need GPT output results to be understood by existing systems. This requires GPT output to be structured, like JSON format data.

You might immediately think we can achieve this through prompts, right? In most cases, yes, but in some rare cases, prompts aren't as stable as function calling methods. Our traditional systems have very high stability requirements. Here's an example:

student_description=Xiao Wang is a second-year student majoring in Computer Science at Peking University with a 3.8 GPA. He's very good at programming and is an active member of the university's robotics club. He hopes to pursue work in artificial intelligence after graduation.

From this paragraph, we can use prompts to require JSON format output. For example, the prompt could be:

Please extract the following information from the given text and return it as a JSON object:

name
major
school
grades
club

This is the body of text to extract the information from:
{student_1_description}

But it's hard to handle whether GPT will finally output grades as 3.8 or 3.8 GPA. These two output results have no difference for humans, but for computers, they're completely different - the former is a float, the latter is a string. For some languages, conversion would directly cause errors.

We can certainly reduce such problems by adding prompt descriptions, but reality is complex. It's hard to completely describe requirements through natural language, and it's hard to ensure GPT answers maintain the same output every time. So for such problems, using OpenAI's function calling method can solve natural language and machine language interaction problems to some extent.

Like the above problem can be described as a function, where we describe the grades field as integer to avoid such problems. This structured capability is crucial for developing stable systems.

student_custom_functions = [
    {
        'name': 'extract_student_info',
        'description': 'Get the student information from the body of the input text',
        'parameters': {
            'type': 'object',
            'properties': {
                'name': {
                    'type': 'string',
                    'description': 'Name of the person'
                },
                'major': {
                    'type': 'string',
                    'description': 'Major subject.'
                },
                'school': {
                    'type': 'string',
                    'description': 'The university name.'
                },
                'grades': {
                    'type': 'integer',
                    'description': 'GPA of the student.'
                },
                'club': {
                    'type': 'string',
                    'description': 'School club for extracurricular activities. '
                }
                
            }
        }
    }
]

For the complete debugging process, check out this OpenAI function calling example.

GPT Application Requirements Analysis

Above we mainly covered techniques for developing GPT applications, but if we want to create products, we still need to start from business requirements and think about what business value we can create to meet current potential user needs.

2023/09/06 Update: LangChain official documentation updated, dividing documents into RAG (Retrieval Augmented Generation) and Agents sections. This shows that after several months since GPT explosion, the industry also believes RAG and Agents are two directions for business requirement implementation. RAG is what we mainly discussed in the GPT Embeddings section above. We'll also give examples from these two directions later. I think anchoring to leading frameworks like LangChain is very valuable for developers because they summarize best practices and experiences in business implementation, which help developers understand business requirements.

Content Generation

Content generation is currently the most mainstream requirement and the direction with highest traffic statistics in current AI applications. Besides ChatGPT that everyone knows, there are also applications like Character AI mainly based on prompt development for AI companions (role-playing) with very high traffic.

In broader content generation categories, image generation fields like Midjourney, voice generation fields like ElevenLabs, text creative fields like copy ai, and some niche text content generation like novel generation assistance AI-Novel.

Content generation is currently the highest traffic AI requirement field on the internet and the easiest to implement because it has very wide application scenarios. Content generation assistance often directly improves productivity, so payment willingness is often high. Competition for this field is also most intense.

RAG Requirements

RAG is a direction I think has great potential value. When ChatFiles was just open-sourced, I received some consultations asking if it could optimize existing scenarios and businesses in customer service, sales, operation manuals, knowledge bases, etc. My personal answer is that RAG is very promising in these scenarios.

In this field, startup companies like mendable ai have already occupied certain market share, supporting document Q&A functions for leading GPT frameworks like LangChain, and actively expanding to sales and customer business scenarios.

Besides this, the highest traffic in this field currently belongs to ChatPDF website, which allows uploading PDF files then asking questions and requirements like summaries based on PDFs. This direction also spawns various niche requirements, like paper reading and collaboration assistance like Jenni AI.

GPT Agents Requirements

GPT Agents requirements are diverse because they're based on existing system integration, so we can analyze existing system APIs to determine what business value we can create. For example, many GPT-provided internet functions are implemented by integrating SerpAPI, which is an aggregated query website integrating major search engines, allowing ChatGPT to answer search-based questions like current weather, stock market, news, etc.

The most famous are naturally Auto GPT and AgentGPT projects. If you're interested in GPT Agents applications, check out these two projects. Besides this, maybe companies like cal.com integrating AI agents to use natural language for enhanced meeting booking functions might give you different inspiration.

Unstructured Input and Structured Output

Another point I think many people overlook is GPT/LLM's ability to process unstructured data. In the past, processing simple text required lots of development time. For example, extracting key information from text messages like names, phone numbers, addresses, etc. Due to different message templates, this information is unstructured.

We couldn't use simple methods to convert this content to structured output like JSON format, so we often needed complex regular expressions or NLP technology to extract this information. But regular expressions are complex to develop, and for different message templates, plus templates constantly changing over time, we need to constantly adjust regular expressions. This process consumes lots of development time and engineer energy.

NLP technology only works for specific scenarios, like extracting phone numbers or addresses. For different scenarios, we need different NLP technologies, so once business requirements change, like recognizing license plates, we need to redevelop and adjust.

But with unified APIs like OpenAI GPT, we just need to provide different prompts to the service, and we can guide GPT reasoning through prompts to get desired results. This greatly simplifies our development process, shortens development time, and quickly responds to market changes.

And due to GPT's powerful generalization ability, we just need to think about how to process unstructured data for different scenarios. This unstructured data processing capability will change many previous business development processes and methods in future software development, having profound impact on software lifecycle.

Natural Language Interaction

Finally, I want to say that GPT's ability to process natural language will deeply change human-machine interaction. In the past, from command line to graphical interface to touchscreen, these were all human-computer interaction revolution history. Most people's interaction with machines actually requires programmers as intermediaries. People have various needs, business analysts and developers dig out requirements, programmers create graphical pages through code, then people interact with graphical pages to achieve human-computer interaction capability.

This process loses information and creates lots of limitations and costs. If machines can understand natural language and we can interact directly with machines, then software, machines, and intelligence - concepts we're familiar with - will undergo huge changes, and human-computer interaction will change dramatically.

If you find it hard to understand this uncertain change without reference, try experiencing the open interpreter project. It's based on natural language generating code and executing directly on computers. Although very primitive and unstable, you can see the future of human-computer interaction revolution.

References