Building a Safer LLM: Embracing Safety by Design for Responsible AI

October 27, 2023

In the midst of the generative AI revolution, likened by some to the industrial revolution of the past, we find ourselves with immense opportunities for development and innovation. Generative AI has revolutionized the way we interact with technology in a myriad of ways such as enabling enhanced content creation, facilitating personalization, and transforming customer service through sophisticated conversational chatbots. However, alongside this potential lies a new class of threats. Just as these models empower content creators, they also facilitate the productivity of malicious actors, for example by writing polymorphic malware or improving social engineering for phishing. High-quality fake content, whether it be misinformation, child exploitation, or fraud, poses real-world consequences that reverberate through society. Recently, a single AI-generated image of an explosion at the Pentagon caused a brief dip in the S&P 500, illustrating the potential impact of AI-generated content on our interconnected world.

The ease of generating high-quality malicious content, coupled with the real-world consequences it can induce, highlights the importance of safety by design in AI systems. Safety by design refers to the deliberate effort to incorporate safety considerations at every stage of the development process. Rather than treating safety as an afterthought or a reactionary measure, safety by design is a proactive approach that prioritizes user well-being, protection against harm, and the prevention of misuse or unintended consequences. In this blog post I will discuss safety by design specifically as it relates to Large Language Models (LLMs), which are AI models that generate text.

While proprietary LLMs used for text generation currently outperform open source models, advances in data curation and cost-effective fine-tuning techniques like LoRA are driving significant performance improvements in smaller, open-source LLMs, suggesting that open source models will eventually rival the capabilities of proprietary ones. Major companies are also taking note, evidenced by internal discussions highlighting the importance of a moat to guard against this shift. Yet as companies investing in proprietary models heavily prioritize model safety, smaller organizations seeking to adopt open source generative AI models into their workflows are left to navigate the safety challenges on their own.

In this blog post, I will discuss the significance of safety by design when using LLMs. I will share strategies to enhance safety both on the model input and output, while also considering feedback and context. By implementing these measures, we can strike a balance between the creative potential of LLMs and user safety.

Safety on the input

One important aspect of building a safer LLM is to ensure safety on the input side. This means moderating the prompts that can be provided to the model, thereby preventing the model from generating harmful or inappropriate content. One strategy for safety on the input is to curate an exclusion set of words, phrases or topics that are disallowed as prompts because they are known to generate unsafe or undesirable responses. By checking the user’s prompt against this set, you can reject or filter out prompts that violate the defined safety guidelines. However, this approach is limited as bad actors will find ways around this exclusion set, by generating novel prompts that are outside of the set.

A more sophisticated approach is to utilize a risk score mode to classify prompts as violative of particular policies. These models are often based on trained language models, which means they are able to classify the prompt as including sexual content or hate speech based on the language usage, as opposed to simply identifying violative keywords in the prompt. The risk score can help flag potentially unsafe prompts for manual review or filtering. However, this approach will only flag prompts that are violative in nature, and will miss prompts which are seemingly benign but generate violative content. This is because prompts are inherently different from the text data that these models are typically trained on. Moderation can then either be handled by adapting the risk score models by training them specifically on prompts, or by moderating the LLM output, which is discussed in the next section.

It is important to note that often we must ensure the prompt is on-topic, even if it is not objectively violative of a particular policy. Topic modeling can be used to ensure the prompts are from a closed set of approved topics, for example by blocking prompts requesting health advice from a telecom customer service chat bot.

These methods struggle if a user is actively attempting to jailbreak the model, for example leading the model to generate instructions on how to build a bomb or dispose of a body. These prompts use techniques such as role play to convince the model to generate text that it otherwise should not, and are popular enough to have their own dedicated subreddits. Exclusion lists easily fail in this adversarial process. Risk scores models trained to specifically flag violative content will also underperform, unless these models are specifically trained on texts that are not explicitly violative themselves, yet generate violative content. Topic modeling can help ensure that the prompt does not relate to banned topics, yet the adversarial process all but guarantees that there will be a workaround.

Safety on the output

Restricting the model from generating content related to sensitive topics or harmful behaviors can minimize the risk from jailbreak attempts. As with safety on the input side, exclusion sets of words or phrases that are disallowed can also be used to filter the generated output. This approach has the same disadvantages as in the input — exclusion lists are limited, and do not capture the context in which the word is used. For example, an LLM generating text with the word “weed” in the context of top tips for yard care is clearly benign. Yet if the generated text contained the word “weed” while giving children tips on how to relax, we would want to flag that content as potentially risky. The context of the language usage in the output itself is critical to understanding if the word is used in a violative manner.

In order to have a richer consideration of language usage, we can utilize more sophisticated risk score models, built by training a language model to classify generated output based on a predefined policy. As discussed above, these models are sensitive to language usage and how words are used in context. A classification model trained to identify discussions of drug usage would ignore the first output and flag only the second output as risky, because it could understand that in the second output, the word “weed” was used to refer to cannabis. These models would also be able to capture the violative results of jailbreak prompts, since these outputs are themselves violative.

As with input moderation, topic modeling can be utilized to exclude specific topics, even if they would not be violative in a different context. For example, a wellness chatbot was recently taken offline after it recommended diets to people suffering from eating disorders. One step to mitigate this behavior would have been to conduct topic classification on the output, and ensure that dieting and self harm behavior could not be generated.

The downside of moderating on the output is that it requires the model to generate the content, so the cost is higher than moderating on the input and stopping the harmful content before it is generated.

Feedback

Feedback is crucial for improving the safety of an LLM. You can allow users to provide feedback on the generated content, enabling them to report any unsafe or inappropriate outputs. This feedback can be used to retrain the risk score models, allowing for continuous improvement in the guardrails on the LLM. The feedback can also be used to identify patterns where the risk score models may need further refinement. For example, if users consistently send feedback that the LLM is able to generate hate speech directed at a particular minority group, the risk score model may need to be refined to ensure it is able to identify that type of hate speech and filter it out before it reaches users. Note that whenever you are training a risk score model, you need to ensure that it is free from bias, and to include positive examples of that minority group as well, otherwise you may end up banning all generated content relating to that group.

Reinforcement Learning with Human Feedback is a technique used to fine tune LLMs. Briefly, in RLHF, the LLM is fine-tuned with supervised learning on prompt-response pairs. A reward model is then trained using human-ranked generated responses, and then the LLM is optimized with respect to the reward model using reinforcement learning. Risk score models can aid in the ranking process, by penalizing violative generated content, which is then reflected in the reward model. However, as with the risk score model discussed above, it is critical to ensure that any feedback is free of biases, otherwise these biases will be codified in the models themselves.

Context

There are signals outside of the prompt itself which can help us determine the safety of the generated input and output. One approach is to incorporate a user scoring mechanism. This score can be used to flag bad actors by assessing the trustworthiness of the user, but also to monitor users engaging in potentially risky behavior, such as self harm. This can be done by assigning a score to each user based on their behavior and history. By taking into account the user’s score, safety measures can be adjusted accordingly. For example, a user that repeatedly inputs risky prompts or generates violative outputs may receive a lower user score and may face stricter restrictions or additional filtering to mitigate potential risks.

In this post, I discussed how building a safer LLM involves a combination of technical measures, user feedback, and continuous improvement. Regularly reviewing and updating the exclusion sets and risk scoring models, and utilizing feedback to ensure they are accurate and free from bias is vital to ensuring the effectiveness of our safety measures. By embracing safety by design, we can foster an environment where AI technologies enrich our lives while upholding the highest standards of ethical conduct. In this post we explored the fundamental principles and practical approaches that empower us to build safer LLMs and pave the way for responsible AI.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later