Web_Crawler_API/README.md

# Web Crawler API

The Web Crawler API is a simple API that allows you to crawl websites and store the crawled data in a database. It uses GuzzleHttp to send HTTP requests and parses the HTML content to extract links from web pages. The API is built with Laravel framework.

## Features

- Crawls websites and stores the crawled data in the database.
- Supports setting the depth of the crawling process.
- Prevents duplicate URLs from being crawled.
- Retrieves and saves the HTML content of crawled pages.
- Extracts valid URLs from the crawled pages.

## Prerequisites

- PHP >= 7.4
- Composer
- Laravel framework
- MongoDB
- Docker
- Docker Compose
- GuzzleHttp
- MongoDB PHP driver (extension - mongodb.so)
- jenssegers/mongodb package

## Getting Started

1. Clone the repository:

   ```bash
   git clone https://git.dayanhub.com/kfir/rank_exam


## Services
  # server

    Run the server - php artisan serve
  # MongoDB
    Run the server - php artisan serve
    run mongo - run docker-compose up -d
    migrate - php artisan migrate


## Configuration
use .env file to set up the database connection


## API Endpoints ##

  # GET /api/crawl: 
    Crawls a website and stores the crawled data in the database. Required query parameter: url. Optional query parameter: depth (default: 1).
    Parameters:
    - `url` (required): The URL of the website to crawl.
    - `depth` (optional): The depth of the crawling process (default: 0).
    - `refresh` (optional): If set to 1, the crawler will refresh the results for an existing URL (default: false).
  # GET /api:
    Retrieves all crawled data from the database.
  # DELETE /api/crawl/{id}:
    Deletes a specific crawled data record from the database.
  # DELETE /api/crawl:
    Deletes all crawled data records from the database.
README.md has added 2023-05-31 10:17:19 +00:00			`# Web Crawler API`
first 2023-05-30 09:56:38 +00:00
README.md has added 2023-05-31 10:17:19 +00:00			`The Web Crawler API is a simple API that allows you to crawl websites and store the crawled data in a database. It uses GuzzleHttp to send HTTP requests and parses the HTML content to extract links from web pages. The API is built with Laravel framework.`
first 2023-05-30 09:56:38 +00:00
README.md has added 2023-05-31 10:17:19 +00:00			`## Features`
first 2023-05-30 09:56:38 +00:00
README.md has added 2023-05-31 10:17:19 +00:00			`- Crawls websites and stores the crawled data in the database.`
			`- Supports setting the depth of the crawling process.`
			`- Prevents duplicate URLs from being crawled.`
			`- Retrieves and saves the HTML content of crawled pages.`
			`- Extracts valid URLs from the crawled pages.`

			`## Prerequisites`

			`- PHP >= 7.4`
			`- Composer`
			`- Laravel framework`
			`- MongoDB`
			`- Docker`
			`- Docker Compose`
			`- GuzzleHttp`
			`- MongoDB PHP driver (extension - mongodb.so)`
			`- jenssegers/mongodb package`

			`## Getting Started`

			`1. Clone the repository:`

			```bash
README 2023-05-31 10:30:32 +00:00			`git clone https://git.dayanhub.com/kfir/rank_exam`
README.md has added 2023-05-31 10:17:19 +00:00

			`## Services`
			`# server`

			`Run the server - php artisan serve`
			`# MongoDB`
			`Run the server - php artisan serve`
			`run mongo - run docker-compose up -d`
			`migrate - php artisan migrate`


			`## Configuration`
work in progress 2023-05-30 17:19:53 +00:00			`use .env file to set up the database connection`
first 2023-05-30 09:56:38 +00:00
README.md has added 2023-05-31 10:17:19 +00:00


			`## API Endpoints ##`

README 2023-05-31 10:31:38 +00:00			`# GET /api/crawl:`
			`Crawls a website and stores the crawled data in the database. Required query parameter: url. Optional query parameter: depth (default: 1).`
README 2023-05-31 12:30:31 +00:00			`Parameters:`
			- `url` (required): The URL of the website to crawl.
improved - nameing convention and spaces 2023-06-01 05:50:34 +00:00			- `depth` (optional): The depth of the crawling process (default: 0).
			- `refresh` (optional): If set to 1, the crawler will refresh the results for an existing URL (default: false).
README 2023-05-31 10:31:38 +00:00			`# GET /api:`
			`Retrieves all crawled data from the database.`
			`# DELETE /api/crawl/{id}:`
			`Deletes a specific crawled data record from the database.`
			`# DELETE /api/crawl:`
			`Deletes all crawled data records from the database.`
README.md has added 2023-05-31 10:17:19 +00:00