Web_Crawler_API/README.md

1.8 KiB

Web Crawler API

The Web Crawler API is a simple API that allows you to crawl websites and store the crawled data in a database. It uses GuzzleHttp to send HTTP requests and parses the HTML content to extract links from web pages. The API is built with Laravel framework.

Features

  • Crawls websites and stores the crawled data in the database.
  • Supports setting the depth of the crawling process.
  • Prevents duplicate URLs from being crawled.
  • Retrieves and saves the HTML content of crawled pages.
  • Extracts valid URLs from the crawled pages.

Prerequisites

  • PHP >= 7.4
  • Composer
  • Laravel framework
  • MongoDB
  • Docker
  • Docker Compose
  • GuzzleHttp
  • MongoDB PHP driver (extension - mongodb.so)
  • jenssegers/mongodb package

Getting Started

  1. Clone the repository:

    git clone https://git.dayanhub.com/kfir/rank_exam
    
    
    

Services

server

Run the server - php artisan serve

MongoDB

Run the server - php artisan serve
run mongo - run docker-compose up -d
migrate - php artisan migrate

Configuration

use .env file to set up the database connection

API Endpoints

GET /api/crawl:

Crawls a website and stores the crawled data in the database. Required query parameter: url. Optional query parameter: depth (default: 1).
Parameters:
- `url` (required): The URL of the website to crawl.
- `depth` (optional): The depth of the crawling process (default: 0).
- `refresh` (optional): If set to 1, the crawler will refresh the results for an existing URL (default: false).

GET /api:

Retrieves all crawled data from the database.

DELETE /api/crawl/{id}:

Deletes a specific crawled data record from the database.

DELETE /api/crawl:

Deletes all crawled data records from the database.