Title: | Interface with Google Cloud Document AI API |
---|---|
Description: | R interface for the Google Cloud Services 'Document AI API' <https://cloud.google.com/document-ai/> with additional tools for output file parsing and text reconstruction. 'Document AI' is a powerful server-based OCR service that extracts text and tables from images and PDF files with high accuracy. 'daiR' gives R users programmatic access to this service and additional tools to handle and visualize the output. See the package website <https://dair.info/> for more information and examples. |
Authors: | Thomas Hegghammer [aut, cre] |
Maintainer: | Thomas Hegghammer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-11-13 22:17:12 UTC |
Source: | https://github.com/hegghammer/dair |
Run when daiR is attached
.onAttach(libname, pkgname)
.onAttach(libname, pkgname)
libname |
name of library |
pkgname |
name of package |
no return value, called for side effects
Creates a dataframe with the block bounding boxes identified by Document AI (DAI) in an asynchronous request. Rows are blocks, in the order DAI proposes to read them. Columns are location variables such as page coordinates and page numbers.
build_block_df(object, type = "sync")
build_block_df(object, type = "sync")
object |
either a HTTP response object from
|
type |
one of "sync" or "async" depending on the function used to process the original document. |
The dataframe variables are: page number, block number, confidence score, left boundary, right boundary, top boundary, and bottom boundary.
a block data frame
## Not run: resp <- dai_sync("file.pdf") block_df <- build_block_df(resp) block_df <- build_block_df("pdf_output.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("file.pdf") block_df <- build_block_df(resp) block_df <- build_block_df("pdf_output.json", type = "async") ## End(Not run)
Builds a token dataframe from the text OCRed by Document AI (DAI) in an asynchronous request. Rows are tokens, in the order DAI proposes to read them. Columns are location variables such as page coordinates and block bounding box numbers.
build_token_df(object, type = "sync")
build_token_df(object, type = "sync")
object |
either a HTTP response object from
|
type |
one of "sync" or "async" depending on the function used to process the original document. |
The location variables are: token, start index, end index, confidence, left boundary, right boundary, top boundary, bottom boundary, page number, and block number. Start and end indices refer to character position in the string containing the full text.
a token data frame
## Not run: resp <- dai_sync("file.pdf") token_df <- build_token_df(resp) token_df <- build_token_df("pdf_output.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("file.pdf") token_df <- build_token_df(resp) token_df <- build_token_df("pdf_output.json", type = "async") ## End(Not run)
Create processor
create_processor( name, type = "OCR_PROCESSOR", proj_id = get_project_id(), loc = "eu", token = dai_token() )
create_processor( name, type = "OCR_PROCESSOR", proj_id = get_project_id(), loc = "eu", token = dai_token() )
name |
a string; the proposed display name of the processor. |
type |
a string; one of "OCR_PROCESSOR", "FORM_PARSER_PROCESSOR", "INVOICE_PROCESSOR", or "US_DRIVER_LICENSE_PROCESSOR". |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by |
Creates a Document AI processor and returns the id of the newly created processor. Note that the proposed processor name may already be taken; if so, try again with another name. Consider storing the processor id in an environment variable named DAI_PROCESSOR_ID. For more information about processors, see the Google Document AI documentation at https://cloud.google.com/document-ai/docs/.
a processor id if successful, otherwise NULL.
## Not run: proc_id <- create_processor("my-processor-123") ## End(Not run)
## Not run: proc_id <- create_processor("my-processor-123") ## End(Not run)
Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional data.
dai_async( files, dest_folder = NULL, bucket = Sys.getenv("GCS_DEFAULT_BUCKET"), proj_id = get_project_id(), proc_id = Sys.getenv("DAI_PROCESSOR_ID"), proc_v = NA, skip_rev = "true", loc = "eu", token = dai_token() )
dai_async( files, dest_folder = NULL, bucket = Sys.getenv("GCS_DEFAULT_BUCKET"), proj_id = get_project_id(), proc_id = Sys.getenv("DAI_PROCESSOR_ID"), proc_v = NA, skip_rev = "true", loc = "eu", token = dai_token() )
files |
a vector or list of pdf filepaths in a GCS Storage bucket Filepaths must include all parent bucket folder(s) except the bucket name |
dest_folder |
the name of the GCS Storage bucket subfolder where you want the json output |
bucket |
the name of the GCS Storage bucket where the files to be processed are located |
proj_id |
a GCS project id |
proc_id |
a Document AI processor id |
proc_v |
one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel. |
skip_rev |
whether to skip human review; "true" or "false" |
loc |
a two-letter region code; "eu" or "us" |
token |
an access token generated by |
Requires a GCS access token and some configuration of the
.Renviron file; see package vignettes for details. Currently, a
dai_async()
call can contain a maximum of 50 files (but a
multi-page pdf counts as one file). You can not have more than
5 batch requests and 10,000 pages undergoing processing at any one time.
Maximum pdf document length is 2,000 pages. With long pdf documents,
Document AI divides the JSON output into separate files ('shards') of
20 pages each. If you want longer shards, use dai_tab_async()
,
which accesses another API endpoint that allows for shards of up to
100 pages.
A list of HTTP responses
## Not run: # with daiR configured on your system, several parameters are automatically provided, # and you can pass simple calls, such as: dai_async("my_document.pdf") # NB: Include all parent bucket folders (but not the bucket name) in the filepath: dai_async("for_processing/pdfs/my_document.pdf") # Bulk process by passing a vector of filepaths in the files argument: dai_async(my_files) # Specify a bucket subfolder for the json output: dai_async(my_files, dest_folder = "processed") ## End(Not run)
## Not run: # with daiR configured on your system, several parameters are automatically provided, # and you can pass simple calls, such as: dai_async("my_document.pdf") # NB: Include all parent bucket folders (but not the bucket name) in the filepath: dai_async("for_processing/pdfs/my_document.pdf") # Bulk process by passing a vector of filepaths in the files argument: dai_async(my_files) # Specify a bucket subfolder for the json output: dai_async(my_files, dest_folder = "processed") ## End(Not run)
Checks whether the user can obtain an access token for Google Cloud Services (GCS) using a service account key stored on file.
dai_auth( path = Sys.getenv("GCS_AUTH_FILE"), scopes = "https://www.googleapis.com/auth/cloud-platform" )
dai_auth( path = Sys.getenv("GCS_AUTH_FILE"), scopes = "https://www.googleapis.com/auth/cloud-platform" )
path |
path to a JSON file with a service account key |
scopes |
GCS auth scopes for the token |
daiR takes a very parsimonious approach to authentication,
with the native auth functions only supporting service account files.
Those who prefer other authentication methods can pass those directly
to the token
parameter in the various functions that call the
Document AI API.
no return value, called for side effects
## Not run: dai_auth() ## End(Not run)
## Not run: dai_auth() ## End(Not run)
Queries to the Google Cloud Services (GCS) Document AI API about the status of a previously submitted asynchronous job and emits a sound notification when the job is complete.
dai_notify(response, loc = "eu", token = dai_token(), sound = 2)
dai_notify(response, loc = "eu", token = dai_token(), sound = 2)
response |
a HTTP response object generated by
|
loc |
A two-letter region code; "eu" or "us" |
token |
An authentication token generated by |
sound |
A number from 1 to 10 for the Beepr sound selection (https://www.r-project.org/nosvn/pandoc/beepr.html). |
no return value, called for side effects
## Not run: response <- dai_async(myfiles) dai_notify(response) ## End(Not run)
## Not run: response <- dai_async(myfiles) dai_notify(response) ## End(Not run)
Queries the Google Cloud Services (GCS) Document AI API about the status of a previously submitted asynchronous job.
dai_status(response, loc = "eu", token = dai_token(), verbose = FALSE)
dai_status(response, loc = "eu", token = dai_token(), verbose = FALSE)
response |
A HTTP response object generated by
|
loc |
A two-letter region code; "eu" or "us" |
token |
An authentication token generated by
|
verbose |
boolean; Whether to output the full response |
If verbose was set to TRUE
, a HTTP response object.
If verbose was set to FALSE
, a string summarizing the status.
## Not run: # Short status message: response <- dai_async(myfiles) dai_status(response) # Full status details: response <- dai_async(myfiles) status <- dai_status(response, verbose = TRUE) ## End(Not run)
## Not run: # Short status message: response <- dai_async(myfiles) dai_status(response) # Full status details: response <- dai_async(myfiles) status <- dai_status(response, verbose = TRUE) ## End(Not run)
Sends a single document to the Google Cloud Services (GCS) Document AI v1 API for synchronous (immediate) processing. Returns a HTTP response object containing the OCRed text and additional data.
dai_sync( file, proj_id = get_project_id(), proc_id = Sys.getenv("DAI_PROCESSOR_ID"), proc_v = NA, skip_rev = "true", loc = "eu", token = dai_token() )
dai_sync( file, proj_id = get_project_id(), proc_id = Sys.getenv("DAI_PROCESSOR_ID"), proc_v = NA, skip_rev = "true", loc = "eu", token = dai_token() )
file |
path to a single-page pdf or image file |
proj_id |
a GCS project id. |
proc_id |
a Document AI processor id. |
proc_v |
one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel. |
skip_rev |
whether to skip human review; "true" or "false". |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by |
Requires a GCS access token and some configuration of the
.Renviron file; see package vignettes for details.Input files can be in
either .pdf, .bmp, .gif, .jpeg, .jpg, .png, or .tiff format. PDF files
can be up to five pages long. Extract the text from the response object with
text_from_dai_response()
. Inspect the entire response object with
httr::content()
.
a HTTP response object.
## Not run: response <- dai_sync("doc_page.pdf") response <- dai_sync("doc_page.pdf", proc_v = "pretrained-ocr-v1.1-2022-09-12" ) ## End(Not run)
## Not run: response <- dai_sync("doc_page.pdf") response <- dai_sync("doc_page.pdf", proc_v = "pretrained-ocr-v1.1-2022-09-12" ) ## End(Not run)
Produces an access token for Google Cloud Services (GCS)
dai_token( path = Sys.getenv("GCS_AUTH_FILE"), scopes = "https://www.googleapis.com/auth/cloud-platform" )
dai_token( path = Sys.getenv("GCS_AUTH_FILE"), scopes = "https://www.googleapis.com/auth/cloud-platform" )
path |
path to a JSON file with a service account key |
scopes |
GCS auth scopes for the token |
a GCS access token object (if credentials are valid) or a message (if not).
## Not run: token <- dai_token() ## End(Not run)
## Not run: token <- dai_token() ## End(Not run)
Fetches the Google Cloud Services (GCS) user information associated with a service account key.
dai_user()
dai_user()
a list of user information elements
## Not run: dai_user() ## End(Not run)
## Not run: dai_user() ## End(Not run)
Delete processor
delete_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
delete_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
proc_id |
a Document AI processor id. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
no return value, called for side effects
## Not run: delete_processor(proc_id = get_processors()$id[1]) ## End(Not run)
## Not run: delete_processor(proc_id = get_processors()$id[1]) ## End(Not run)
Disable processor
disable_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
disable_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
proc_id |
a Document AI processor id. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
no return value, called for side effects
## Not run: disable_processor(proc_id = get_processors()$id[1]) ## End(Not run)
## Not run: disable_processor(proc_id = get_processors()$id[1]) ## End(Not run)
Plots the block bounding boxes identified by Document AI (DAI) onto images of the submitted document. Generates an annotated .png file for each page in the original document.
draw_blocks( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
draw_blocks( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
prefix |
string to be prepended to the output png filename. |
dir |
path to the desired output directory. |
linecol |
color of the bounding box line. |
linewd |
width of the bounding box line. |
fontcol |
color of the box numbers. |
fontsize |
size of the box numbers. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects.
## Not run: resp <- dai_sync("page.pdf") draw_blocks(resp) draw_blocks("page.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("page.pdf") draw_blocks(resp) draw_blocks("page.json", type = "async") ## End(Not run)
Plots the entity bounding boxes identified by a Document AI form parser processor onto images of the submitted document. Generates an annotated .png file for each page in the original document.
draw_entities( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
draw_entities( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
prefix |
string to be prepended to the output png filename. |
dir |
path to the desired output directory. |
linecol |
color of the bounding box line. |
linewd |
width of the bounding box line. |
fontcol |
color of the box numbers. |
fontsize |
size of the box numbers. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects.
## Not run: resp <- dai_sync("page.pdf") draw_entities(resp) draw_tokens("page.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("page.pdf") draw_entities(resp) draw_tokens("page.json", type = "async") ## End(Not run)
Plots the line bounding boxes identified by Document AI (DAI) onto images of the submitted document. Generates an annotated .png file for each page in the original document.
draw_lines( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
draw_lines( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
prefix |
string to be prepended to the output png filename. |
dir |
path to the desired output directory. |
linecol |
color of the bounding box line. |
linewd |
width of the bounding box line. |
fontcol |
color of the box numbers. |
fontsize |
size of the box numbers. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects.
## Not run: resp <- dai_sync("page.pdf") draw_lines(resp) draw_lines("page.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("page.pdf") draw_lines(resp) draw_lines("page.json", type = "async") ## End(Not run)
Plots the paragraph bounding boxes identified by Document AI (DAI) onto images of the submitted document. Generates an annotated .png file for each page in the original document.
draw_paragraphs( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
draw_paragraphs( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
prefix |
string to be prepended to the output png filename. |
dir |
path to the desired output directory. |
linecol |
color of the bounding box line. |
linewd |
width of the bounding box line. |
fontcol |
color of the box numbers. |
fontsize |
size of the box numbers. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects.
## Not run: resp <- dai_sync("page.pdf") draw_paragraphs(resp) draw_paragraphs("page.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("page.pdf") draw_paragraphs(resp) draw_paragraphs("page.json", type = "async") ## End(Not run)
Plots the token (i.e., word) bounding boxes identified by Document AI (DAI) onto images of the submitted document. Generates an annotated .png file for each page in the original document.
draw_tokens( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
draw_tokens( object, type = "sync", prefix = NULL, dir = getwd(), linecol = "red", linewd = 3, fontcol = "blue", fontsize = 4 )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
prefix |
string to be prepended to the output png filename. |
dir |
path to the desired output directory. |
linecol |
color of the bounding box line. |
linewd |
width of the bounding box line. |
fontcol |
color of the box numbers. |
fontsize |
size of the box numbers. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects.
## Not run: resp <- dai_sync("page.pdf") draw_tokens(resp) draw_tokens("page.json", type = "async") ## End(Not run)
## Not run: resp <- dai_sync("page.pdf") draw_tokens(resp) draw_tokens("page.json", type = "async") ## End(Not run)
Enable processor
enable_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
enable_processor( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
proc_id |
a Document AI processor id. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
no return value, called for side effects
## Not run: enable_processor(proc_id = get_processors()$id[1]) ## End(Not run)
## Not run: enable_processor(proc_id = get_processors()$id[1]) ## End(Not run)
This is a specialized function for use in connection
with text reordering. It takes the output from the image
annotation tool 'Labelme' https://github.com/wkentaro/labelme
and turns it into a one-row data frame compatible with other
'daiR' functions for text reordering such as
reassign_tokens2()
. See package vignette on text reconstruction
for details.
from_labelme(json, page = 1)
from_labelme(json, page = 1)
json |
a json file generated by 'Labelme' |
page |
the number of the annotated page |
a data frame with location coordinates for the rectangle marked in 'Labelme'.
## Not run: new_block <- from_labelme("document1_blocks.json") new_block <- from_labelme("document5_blocks.json", 5) ## End(Not run)
## Not run: new_block <- from_labelme("document1_blocks.json") new_block <- from_labelme("document5_blocks.json", 5) ## End(Not run)
Extracts entities Document AI (DAI) identified by a Document AI form parser processor.
get_entities(object, type = "sync")
get_entities(object, type = "sync")
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
a list of dataframes, one per page
## Not run: entities <- get_entities(dai_sync("file.pdf")) entities <- get_entities("file.json", type = "async") ## End(Not run)
## Not run: entities <- get_entities(dai_sync("file.pdf")) entities <- get_entities("file.json", type = "async") ## End(Not run)
List ids of available processors of a given type
get_ids_by_type( type, proj_id = get_project_id(), loc = "eu", token = dai_token() )
get_ids_by_type( type, proj_id = get_project_id(), loc = "eu", token = dai_token() )
type |
name of a processor type, e.g. "FORM_PARSER_PROCESSOR". |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
a vector of processor ids.
## Not run: get_ids_by_type("OCR_PROCESSOR") ## End(Not run)
## Not run: get_ids_by_type("OCR_PROCESSOR") ## End(Not run)
Get information about processor
get_processor_info( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
get_processor_info( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
proc_id |
a Document AI processor id. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
Retrieves information about a processor. For more information about processors, see the Google Document AI documentation at https://cloud.google.com/document-ai/docs/.
a list.
## Not run: info <- get_processor_info() info <- get_processor_info(proc_id = get_processors()$id[1]) ## End(Not run)
## Not run: info <- get_processor_info() info <- get_processor_info(proc_id = get_processors()$id[1]) ## End(Not run)
List available versions of processor
get_processor_versions( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
get_processor_versions( proc_id, proj_id = get_project_id(), loc = "eu", token = dai_token() )
proc_id |
a Document AI processor id. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
a dataframe.
## Not run: df <- get_processor_versions() df <- get_processor_versions(proc_id = get_processors()$id[1]) ## End(Not run)
## Not run: df <- get_processor_versions() df <- get_processor_versions(proc_id = get_processors()$id[1]) ## End(Not run)
List created processors
get_processors(proj_id = get_project_id(), loc = "eu", token = dai_token())
get_processors(proj_id = get_project_id(), loc = "eu", token = dai_token())
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by |
Retrieves information about the processors that have been created in the current project and are ready for use. For more information about processors, see the Google Document AI documentation at https://cloud.google.com/document-ai/docs/.
a dataframe.
## Not run: df <- get_processors() ## End(Not run)
## Not run: df <- get_processors() ## End(Not run)
Fetches the Google Cloud Services (GCS) project id associated with a service account key.
get_project_id(path = Sys.getenv("GCS_AUTH_FILE"))
get_project_id(path = Sys.getenv("GCS_AUTH_FILE"))
path |
path to the JSON file with your service account key |
a string with a GCS project id
## Not run: project_id <- get_project_id() ## End(Not run)
## Not run: project_id <- get_project_id() ## End(Not run)
Extracts tables identified by a Document AI form parser processor.
get_tables(object, type = "sync")
get_tables(object, type = "sync")
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
a list of data frames
## Not run: tables <- get_tables(dai_sync("file.pdf")) tables <- get_tables("file.json", type = "async") ## End(Not run)
## Not run: tables <- get_tables(dai_sync("file.pdf")) tables <- get_tables("file.json", type = "async") ## End(Not run)
Extracts the text OCRed by Document AI (DAI)
get_text( object, type = "sync", save_to_file = FALSE, dest_dir = getwd(), outfile_stem = NULL )
get_text( object, type = "sync", save_to_file = FALSE, dest_dir = getwd(), outfile_stem = NULL )
object |
either a HTTP response object from
|
type |
one of "sync" or "async", depending on the function used to process the original document. |
save_to_file |
boolean; whether to save the text as a .txt file |
dest_dir |
folder path for the .txt output file if |
outfile_stem |
string to form the stem of the .txt output file |
a string (if save_to_file = FALSE
)
## Not run: text <- get_text(dai_sync("file.pdf")) text <- get_text("file.json", type = "async", save_to_file = TRUE) ## End(Not run)
## Not run: text <- get_text(dai_sync("file.pdf")) text <- get_text("file.json", type = "async", save_to_file = TRUE) ## End(Not run)
List versions of available processors of a given type
get_versions_by_type( type, proj_id = get_project_id(), loc = "eu", token = dai_token() )
get_versions_by_type( type, proj_id = get_project_id(), loc = "eu", token = dai_token() )
type |
name of a processor type, e.g. "FORM_PARSER_PROCESSOR". |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
a message with the available version aliases and full names
## Not run: get_versions_by_type("OCR_PROCESSOR") ## End(Not run)
## Not run: get_versions_by_type("OCR_PROCESSOR") ## End(Not run)
This helper function converts a vector of images to a single PDF.
image_to_pdf(files, pdf_name)
image_to_pdf(files, pdf_name)
files |
a vector of image files |
pdf_name |
a string with the name of the new PDF |
Combines any number of image files of almost any type to a single PDF. The vector can consist of different image file types. See the 'Magick' package documentation https://cran.r-project.org/package=magick for details on supported file types. Note that on Linux, ImageMagick may not allow conversion to pdf for security reasons.
no return value, called for side effects
## Not run: # Single file new_pdf <- file.path(tempdir(), "document.pdf") image_to_pdf("document.jpg", new_pdf) # A vector of image files: image_to_pdf(images) ## End(Not run)
## Not run: # Single file new_pdf <- file.path(tempdir(), "document.pdf") image_to_pdf("document.jpg", new_pdf) # A vector of image files: image_to_pdf(images) ## End(Not run)
Converts an image file to a base64-encoded binary .tiff file.
img_to_binbase(file)
img_to_binbase(file)
file |
path to an image file |
a base64-encoded string
## Not run: img_encoded <- img_to_binbase("image.png") ## End(Not run)
## Not run: img_encoded <- img_to_binbase("image.png") ## End(Not run)
Checks whether a string is a valid colour representation.
is_colour(x)
is_colour(x)
x |
a string |
a boolean
## Not run: is_colour("red") is_colour("#12345") ## End(Not run)
## Not run: is_colour("red") is_colour("#12345") ## End(Not run)
Checks whether a file is a JSON file.
is_json(file)
is_json(file)
file |
a filepath |
a boolean
## Not run: is_json("file.json") ## End(Not run)
## Not run: is_json("file.json") ## End(Not run)
Checks whether a file is a PDF file.
is_pdf(file)
is_pdf(file)
file |
a filepath |
a boolean
## Not run: is_pdf("document.pdf") ## End(Not run)
## Not run: is_pdf("document.pdf") ## End(Not run)
List available processor types
list_processor_types( full_list = FALSE, proj_id = get_project_id(), loc = "eu", token = dai_token() )
list_processor_types( full_list = FALSE, proj_id = get_project_id(), loc = "eu", token = dai_token() )
full_list |
boolean. |
proj_id |
a GCS project id. |
loc |
a two-letter region code; "eu" or "us". |
token |
an authentication token generated by
|
Retrieves information about the processors that
can be created in the current project. With
full_list = TRUE
it returns a list with detailed
information about each processor. With full_list = FALSE
it returns a character vector with just the processor names.
For more information about processors, see the
Google Document AI documentation at
https://cloud.google.com/document-ai/docs/.
list or character vector
## Not run: avail_short <- list_processor_types() avail_long <- list_processor_types(full_list = TRUE) ## End(Not run)
## Not run: avail_short <- list_processor_types() avail_long <- list_processor_types(full_list = TRUE) ## End(Not run)
Creates a hOCR file from Document AI output.
make_hocr(type, output, outfile_name = "out.hocr", dir = getwd())
make_hocr(type, output, outfile_name = "out.hocr", dir = getwd())
type |
one of "sync" or "async" depending on the function used to process the original document. |
output |
either a HTTP response object (from |
outfile_name |
a string with the desired filename. Must end with
either |
dir |
a string with the path to the desired output directory. |
hOCR is an open standard of data representation for formatted text obtained from optical character recognition. It can be used to generate searchable PDFs and many other things. This function generates a file compliant with the official hOCR specification (https://github.com/kba/hocr-spec) complete with token-level confidence scores. It also works with non-latin scripts and right-to-left languages.
no return value, called for side effects.
## Not run: make_hocr(type = "async", output = "output.json") resp <- dai_sync("file.pdf") make_hocr(type = "sync", output = resp) make_hocr(type = "sync", output = resp, outfile_name = "myfile.xml") ## End(Not run)
## Not run: make_hocr(type = "async", output = "output.json") resp <- dai_sync("file.pdf") make_hocr(type = "sync", output = resp) make_hocr(type = "sync", output = resp, outfile_name = "myfile.xml") ## End(Not run)
Merges text files from Document AI output shards into a single text file corresponding to the parent document.
merge_shards(source_dir = getwd(), dest_dir = getwd())
merge_shards(source_dir = getwd(), dest_dir = getwd())
source_dir |
folder path for input files |
dest_dir |
folder path for output files |
The function works on .txt files generated from .json output files,
not on .json files directly. It also presupposes that the .txt filenames
have the same name stems as the .json files from which they were extracted.
For the v1 API, this means files ending with "-0.txt", "-1.txt", "-2.txt",
and so forth. The safest approach is to generate .txt files using
get_text()
with the save_to_file
parameter set to TRUE.
no return value, called for side effects
## Not run: merge_shards() merge_shards(tempdir(), getwd()) ## End(Not run)
## Not run: merge_shards() merge_shards(tempdir(), getwd()) ## End(Not run)
Converts a PDF file to a base64-encoded binary .tiff file.
pdf_to_binbase(file)
pdf_to_binbase(file)
file |
path to a single-page pdf file |
a base64-encoded string
## Not run: doc_encoded <- pdf_to_binbase("document.pdf") ## End(Not run)
## Not run: doc_encoded <- pdf_to_binbase("document.pdf") ## End(Not run)
This is a specialized function for use in connection with text reordering. It modifies a token dataframe by assigning new block bounding box values to a subset of tokens based on prior modifications made to a block dataframe.
reassign_tokens(token_df, block_df)
reassign_tokens(token_df, block_df)
token_df |
a dataframe generated by |
block_df |
a dataframe generated by |
The token and block data frames provided as input must be from the same JSON output file.
a token data frame
## Not run: new_token_df <- reassign_tokens(token_df, new_block_df) ## End(Not run)
## Not run: new_token_df <- reassign_tokens(token_df, new_block_df) ## End(Not run)
This is a specialized function for use in connection
with text reordering. It is designed to facilitate manual splitting
of block boundary boxes and typically takes a one-row block dataframe
generated by from_labelme()
.
reassign_tokens2(token_df, block, page = 1)
reassign_tokens2(token_df, block, page = 1)
token_df |
a data frame generated by |
block |
a one-row data frame of the same format as |
page |
the number of the page on which the block belongs |
a token data frame
## Not run: new_token_df <- reassign_tokens2(token_df, new_block_df) new_token_df <- reassign_tokens2(token_df, new_block_df, 5) ## End(Not run)
## Not run: new_token_df <- reassign_tokens2(token_df, new_block_df) new_token_df <- reassign_tokens2(token_df, new_block_df, 5) ## End(Not run)
Tool to visually check the order of block bounding boxes after
manual processing (e.g. block reordering or splitting). Takes as its main
input a token dataframe generated with build_token_df()
,
reassign_tokens()
, or reassign_tokens2()
.
The function plots the block bounding boxes onto images of the submitted
document. Generates an annotated .png file for each page in the
original document.
redraw_blocks(json, token_df, dir = getwd())
redraw_blocks(json, token_df, dir = getwd())
json |
filepath of a JSON file obtained using |
token_df |
a token data frame generated with |
dir |
path to the desired output directory. |
Not vectorized, but documents can be multi-page.
no return value, called for side effects
## Not run: redraw_blocks("pdf_output.json", revised_token_df, dir = tempdir()) ## End(Not run)
## Not run: redraw_blocks("pdf_output.json", revised_token_df, dir = tempdir()) ## End(Not run)
This function 'splits' (in the sense of changing the coordinates) of an existing block bounding box vertically or horizontally at a specified point. It takes a block data frame as input and modifies it. The splitting produces a new block, which is added to the data frame while the old block's coordinates are updated. The function returns a revised block data frame.
split_block(block_df, page = 1, block, cut_point, direction = "v")
split_block(block_df, page = 1, block, cut_point, direction = "v")
block_df |
A dataframe generated by |
page |
The number of the page where the split will be made. Defaults to 1. |
block |
The number of the block to be split. |
cut_point |
A number between 0 and 100, where 0 is the existing left/top limit and 100 is the existing right/bottom limit. |
direction |
"V" for vertical split or "H" for horizontal split. Defaults to "V". |
a block data frame
## Not run: new_block_df <- split_block(df = old_block_df, block = 7, cut_point = 33) ## End(Not run)
## Not run: new_block_df <- split_block(df = old_block_df, block = 7, cut_point = 33) ## End(Not run)
tables_from_dai_file()
is deprecated; please use get_text()
instead.
tables_from_dai_file(file)
tables_from_dai_file(file)
file |
filepath of a JSON file obtained using |
a list of data frames
## Not run: tables <- tables_from_dai_file("document.json") ## End(Not run)
## Not run: tables <- tables_from_dai_file("document.json") ## End(Not run)
tables_from_dai_response()
is deprecated; please use get_tables()
instead.
tables_from_dai_response(object)
tables_from_dai_response(object)
object |
an HTTP response object returned by |
a list of data frames
## Not run: tables <- tables_from_dai_response(response) ## End(Not run)
## Not run: tables <- tables_from_dai_response(response) ## End(Not run)