skip to Main Content
OpenAI Web Crawler 'GPTBot'

OpenAI Launches Web Crawler ‘GPTBot’ to Improve Future AI Models

OpenAI has introduced a new web crawling tool called “GPTBot” to gather data to enhance the capabilities of upcoming AI systems like GPT-4 and the hotly anticipated GPT-5.

Announced through a blog post, OpenAI says the web crawler could potentially improve model accuracy, expand capabilities, and advance the overall AI ecosystem by collecting publicly available data from across the internet.

How GPTBot Web Crawler Works

The bot can be identified through its user agent token “GPTBot” and user agent string: “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

GPTBot will purportedly filter out paywalled sources, content that violates OpenAI policies, and sources that collect private user data. This allows it to amass a large dataset for training while avoiding sensitive information.

By allowing GPTBot to access a website, owners can contribute to this data pool and support the advancement of AI systems. However, it is not mandatory.

Controlling GPTBot’s Access

Website owners can entirely block the crawler by adding “User-agent: GPTBot Disallow: /” to their robots.txt file.

Alternatively, they can choose to allow access to specific directories while restricting others.

OpenAI provides full transparency by publishing the IP ranges GPTBot uses. This allows web administrators to identify the traffic on their sites.

Also read:

Ambition for the Future

The launch comes shortly after OpenAI filed a trademark application for “GPT-5” on July 18th, covering various AI model capabilities like speech and text processing.

CEO Sam Altman has tried to temper expectations, saying the company is “nowhere close” to beginning GPT-5 training. Multiple safety audits will need to occur first.

However, GPTBot shows OpenAI’s ambition to push boundaries and achieve technological breakthroughs. The additional web data could significantly empower next-generation models like GPT-4 and GPT-5.

Controversy and Concerns

While OpenAI emphasises ethical data collection, GPTBot has sparked debate around copyright, ownership, and the commercial use of public online content.

Some argue publicly accessible data should be freely usable for AI training, like a person learning from the web. Others believe OpenAI should share profits if it monetises scraped data, lacking incentives for content creators.

Questions have arisen regarding how copyrighted media, images, videos and text found on crawled sites could be handled. Using them in commercial systems without permission could constitute infringement.

Japan’s privacy regulator warned OpenAI about unauthorised data collection in June. OpenAI also faces a lawsuit alleging private ChatGPT user data was improperly accessed. If proven accurate, it could violate the Computer Fraud and Abuse Act.

Transparency and Responsible AI

As AI rapidly advances, transparency around how public data is used in proprietary systems lags behind. Following the robots.txt protocol is good, but the tech community desires more clarity.

While aiming to push boundaries, OpenAI must continue balancing innovation with responsible and ethical AI development.

Being open about its practices and showing good faith can help maintain public trust.

Rebecca Taylor

Rebecca is our AI news writer. A graduate of Leeds University with an International Journalism MA, she possesses a keen eye for the latest AI developments. Rebecca’s passion for AI, and with her journalistic expertise, brings insightful news stories for our readers.

Recent AI News Articles
Amazon - Anthropic
Back To Top