Agents can process and understand information from PDFs, text files, and websites to build their knowledge base. Once a document is successfully uploaded to an agent, it can then provide answers to questions based on the content of that document.
To add more data to the agent go to the studio, select the agent you want to add data to or create a new agent. the click on the knowledge tab on the left menu and click ont the 'Add New Document' button to get the Add data window.
To add to knowledge items to the agent from a website, and to make sure you get the best out of it, there are a handful of important configuration parameters to understand:
This involves extracting the textual content from a given URL and incorporating it into the agent's knowledge base. It's important to note that text depicted in images, rather than as selectable text, cannot be scraped.
This process entails gathering all the links present on a page.
Specifies the total number of pages you wish to scrape.
Determines the extent of the crawl, starting from the initial page as depth 1, links found on that page are considered depth 2, and this pattern continues accordingly.
Restricts the crawl to links that are within the same domain as the initial URL.
Decides whether the page should be crawled. If this option is not selected, the agent will only scrape the page.
Selects the method by which page content is extracted (options include direct, proxy, or render).
Determines how the text content is pulled from the HTML.
This option allows to not scrape pages that have a certain pattern.
You can use regex or wildcard - any link that matches that expression will not be scraped.
You can choose to use Wildcard expressions or Regex expressions
Understanding and configuring these parameters correctly can greatly enhance the effectiveness of the content extraction process.
Let's break down a few scenarios:
Add a single page like a Wikipedia article to the agent's knowledge without crawling.
Add the specific article URL directly to the agent's knowledge.
Ensure the crawl option is unchecked.
Add a specific article and all referenced links mentioned in the article, even if they are from different domains.
Add the specific article URL to the agent's knowledge.
Check the crawl option and set the maximum link depth to 2.
Ensure the "Only scan domain" option is unchecked.
Exclude pages from the /blog section on the website from scraping.
Add the wildcard expression `https://www.mywebsite.com/blog/*`_ to the blocklist.
Alternatively, use the regex expression `^https://www\.mywebsite\.com/blog/.*`_ in the blocklist.
Objective: Scrape all website pages except those from the /blog section.
Add the wildcard expression `!https://www.mywebsite.com/blog/*` to the blocklist.
Alternatively, use the regex expression `^(?!https://www\.mywebsite\.com/blog/).*$ `in the blocklist.