Extract Words from PDF using PHP

Source: Extract Text from PDF

Step1: Get and Access the License of PHP PDF API

 

For ComPDFKit API users, we provide 1000 free PDF API requests. Follow the steps below to access the license and start your API requests.

 

  1. Register ComPDFKit API to go to the dashboard. You will see the API Keys, the progress of your API plan, and the status of API requests on your dashboard.

 

Register ComPDFKit API

 

  1. Create a project and get the Public Key and Secret Key.

After your account is created, a default project will be created. You can create more projects to call ComPDFKit API. All supported PDF APIs could be checked on the documentation pages.

 

There are unique Public Key and Secret Key for each project. Remember to apply the right key for the corresponding project.

 

ComPDFKit API Dashboard

 

 

Step2: Authentication PDF API for PDF Text Extraction

 

You need to replace the real publicKey and secretKey to get the accessToken. Then, use the accessToken to create a task, upload files, extract PDF words, and get the extracted PDF Text JSON file.

 

PHP code example to authenticate ComPDFKit PDF text Extracting API:

$params = [
    'publicKey' => $publicKey,
    'secretKey' => $secretKey
];
$headers = ['Content-Type: application/json'];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/oauth/token',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_HTTPHEADER => $headers,
    CURLOPT_POSTFIELDS => json_encode($params)
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$accessToken = $result['data']['accessToken'];
$bearerToken = "Bearer $accessToken";


Step3: Create Task - Extract PDF Text

 

You need to replace the accessToken which was obtained from the previous step. Set the language type you want to display the error information (1, English, 2, Chinese). ComPDFKit PDF API parameters can be found on the Quick Start --> Request Description page.

 

After replacing them, you will get the taskId in the response data. PHP code example to create PDF text extracting task:

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/pdf/json?language=' . $language,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$taskId = $result['data']['taskId'];

 

Step4: Upload Files for PDF Parser

 

Replace the information in the PHP code:

 

  • PDF Files: The PDF you want to extract Text from.

  • taskId: Obtained in the tast creating step.

  • Language: The language you want to display the error information.

  • accessToken: Obtained in the Authentication step.

 

ComPDFKit API provide AI, OCR, etc. You can also input the parameters in this step:

  • type:Options to extract contents (0: text, 1: table) Default 0.

  • isAllowOcr: Whether to allow to open OCR (1: yes, 0: no), Default 0.

  • isOnlyAiTable: Whether to enable AI to recognize table (1: yes, 0: no) Default 0.

 

PHP code example to upload PDFs to parsing:

$params = [
    'taskId' => $taskId, // ID of your task
    'file' => new CURLFile($pdfPath), // Files you need to process
    'language' => $language,
    'password' => '',
    'parameter' => json_encode(['type' => 1, 'isAllowOcr' => 1, 'isContainOcrBg' => 0])
];
$headers = [
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/file/upload',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_HTTPHEADER => $headers,
    CURLOPT_POSTFIELDS => $params
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$fileKey = $result['data']['fileKey'];


Step5: Process and Extract Text From Uploaded PDF Files

 

Execute the tast to extract Words from PDF you uploaded. Here is the PHP code example:

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/execute/start?language=' . $language . '&taskId=' . $taskId,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);


Step6: Get Task Information of PDF Text Extraction

 

Follow the PHP code example below to obtain the task information. Replace the needed information like taskId and access_token. The PDF PDF parser and extracted result file is presented in a JSON file, which is a structured data format beneficial for the reuse of PDF text extraction.

$headers = [
    'Content-Type: application/json',
    'Authorization: ' . $bearerToken
];

$curl = curl_init();
curl_setopt_array($curl, array(
    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/taskInfo' . '?taskId=' . $taskId,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'GET',
    CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
捲曲_關閉($捲曲);

Comments

Popular posts from this blog

How to Auto-Workflows to Split Google Drive PDFs on Make

In-depth review of PDF to Word conversion - ComPDFKit Conversion SDK V3.0

PDF/A (ISO 19005): The Standard for Long-Term Electronic Document Preservation and Conversion Tools