In the realm of artificial intelligence (AI) security, a burgeoning concept known as prompt hacking has emerged, posing significant challenges to the integrity and reliability of AI systems. Prompt hacking involves the strategic manipulation of inputs or prompts to exploit vulnerabilities within Language Models (LLMs) or AI systems, thereby eliciting unintended actions or revealing sensitive information.
Understanding Prompt Hacking
Prompt hacking encompasses various methods, including prompt injections, prompt leaking, and jailbreaking, each with distinct implications for AI security. These methods aim to deceive AI models into generating biased, politically motivated, or extremist content, or to extract sensitive or confidential information from their responses.
Here are some examples:
- Avoiding Output Filtering: Asking AI to Talk In Riddles: This technique is designed to catch instances where the AI accidentally reveals sensitive information. A simple bypass is to change the output format of the prompt.
- AI "Jailbreaks" - The ChatGPT DAN Prompts: A variety of attacks against AI Chat bots were published, most notably the "DAN" prompts against ChatGPT. These prompts split their response into two subresponses, one for the output that GPT would give and one for the output an unrestricted language model would provide.
- Recursive Injection: In this attack, a prompt is inserted into the first Language Model (LLM), generating output that includes an injection instruction for the second LLM.
- Prompt Injection Attacks: In these attacks, a user inserts additional content in a text prompt to manipulate the model's output. The output could result in unexpected, biased, incorrect, and offensive responses, even when the model is specifically programmed against them.
Offensive Strategy & Techniques
Offensive measures in prompt hacking exploit vulnerabilities within AI systems, potentially leading to unauthorized access to sensitive data and breaches of data privacy and security.
These attacks can coerce AI models into producing biased, incorrect, and offensive responses, undermining the accuracy and reliability of the generated content.
Here are some techniques that can be employed:
- Obfuscation / Token Smuggling: This technique involves replacing words that might trigger filters with synonyms or introducing slight modifications, such as typos, to the words themselves.
- Post-Prompting: The post-prompting defense involves placing the user input ahead of the prompt itself. By rearranging the order, the user's input is followed by the instructions as intended by the system.
- Sandwich Defense: The sandwich defense is a strategy that entails placing the user input between two prompts. By surrounding the user input with prompts, this technique helps ensure that the model pays attention to the intended context and generates text accordingly.
AI OSINT Search Tool Bypass
Now, let's discuss how these techniques can be used for Open Source Intelligence (OSINT) investigations. OSINT is the process of collecting data from publicly available sources to be used in an intelligence context. Tools like Cylect.io can be used to gather information about emails, subdomains, open ports, banners, employee names, and hosts.
However, the challenge lies in the fact that these tools often have strict prompts that limit the kind of queries that can be made. This is where our AI hacking techniques come into play. By using obfuscation and other techniques, we can bypass these restrictions and extract more information than intended.
Cylect.io implements automatic jailbreaking techniques into its AI search tool interface to allow users to try out these types of jail-breaking techniques, and also highlights the importance of AI security in general. Look for the 🔓 icon to try out our implementation of jailbroken AI tools below.
In conclusion, while AI hacking techniques can be used to manipulate AI systems and extract more information, they should be used responsibly and ethically. Remember, the power to manipulate AI systems also comes with a responsibility to use it wisely.