Screenshots to tables
What's all this then?
Imagine you have this problem: you need to migrate tabular data off of a software that you don't want to use anymore. The software displays it as tables, on the screen. However, there is no export function, and there is no database that you can access. What to do?
Today, I worked up a tool chain where you can take screenshots of your tabular data, convert it to data, and put it all together in Excel. Let me break it down.
OCR
There are a whole bunch of services that will use Optical Character Recognition of some sort of pick up text and numbers from images, and serve it to you as textual data. Many of these have a web-based interface where you can drag-and-drop images (.jpg, .png, etc) into it, and it will give you a table that you can download as .xlsx, .csv, or something else.
I was specifically looking for something with a web API, so that I can blast through a whole set of images without needing to manually upload, wait, download, and rename for each image. If it had a web API, I can write a script in PowerShell to do all that.
i cam across extracttable, which has an API with a method to send up images. It even gives code snippets in PowerShell. Brilliant!
PowerShell
This is the code that orchestrates the process of sending up a set of images from a folder, and then saving the results in another folder.
$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("x-api-key", $apiKey)
$source = dir .\source
$dest = ".\result\"
if (!(Test-Path $dest)) { New-Item -ItemType Directory $dest}
$source |
% {
$filename = $_.FullName
$name = $_.Name
$basename = $_.BaseName
Write-Host Sending off $name
$multipartContent = [System.Net.Http.MultipartFormDataContent]::new()
$multipartFile = $filename
$FileStream = [System.IO.FileStream]::new($multipartFile, [System.IO.FileMode]::Open)
$fileHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new("form-data")
$fileHeader.Name = "input"
$fileHeader.FileName = $filename
$fileContent = [System.Net.Http.StreamContent]::new($FileStream)
$fileContent.Headers.ContentDisposition = $fileHeader
$multipartContent.Add($fileContent)
$stringHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new("form-data")
$stringHeader.Name = "dup_check"
$stringContent = [System.Net.Http.StringContent]::new("False")
$stringContent.Headers.ContentDisposition = $stringHeader
$multipartContent.Add($stringContent)
$body = $multipartContent
$response = Invoke-RestMethod 'https://trigger.extracttable.com' -Method 'POST' -Headers $headers -Body $body # <---- this is the important bit
$json = $response | ConvertTo-Json -Depth 100
$json > ".\result\$basename.json"
Write-Host Saved in ".\result\$basename.json"
}
Power Query
After that part is done, we can put it all together in Power Query.
let
getTable = ( fileContent as binary ) as table =>
let
Source = Json.Document( fileContent ),
Tables = Source[Tables],
Tables1 = Tables{0},
TableJson = Tables1[TableJson] as record,
fieldnames = Record.FieldNames(TableJson),
rowOrder =
let
toNumber = List.Transform(fieldnames, each Number.From(_)),
sort = List.Sort(toNumber),
toText = List.Transform(sort, each Text.From(_))
in
toText,
reorder = Record.ReorderFields(TableJson, rowOrder),
listOfRows = Record.ToList(TableJson),
toTable = Table.FromRecords(listOfRows),
cols = Table.ColumnNames(toTable),
blanksAsNulls = Table.ReplaceValue(toTable,"",null,Replacer.ReplaceValue,cols),
trimTable = removeEmptyColumns(blanksAsNulls),
#"Promoted Headers" = Table.PromoteHeaders(trimTable, [PromoteAllScalars=true])
in
#"Promoted Headers",
folder = Folder.Files(resultsFolder),
jsonFiles = Table.SelectRows(folder, each [Extension] = ".json"),
tables = Table.TransformRows(
jsonFiles,
each getTable([Content])
),
combined = Table.Combine(tables)
in
combined