|
This function allows you to extract the content from a page.
There are four formats supported - "Text", "SVG", "SVG+" and
"SVG+2".
Text is in layout order which may not be the same as reading
order. For example - what to a user may look like a space - may
simply be two items of text positioned apart from each other - or
it may not. ABCpdf will make sensible assumptions on how items of
text should be combined but many situations are ambiguous.
SVG is an XML based format for representing vector graphics.
Because SVG is standard XML it's easy to parse and gives you the
precise position of each item of text on the page. The way that
ABCpdf constructs the SVG should make it easy to extract any
information you require. ABCpdf currently supports SVG text, paths
and image placeholders.
For example a simple "Hello World" PDF might produce the
following content:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<text x="0" y="76.8" font-size="96" font-family="Times-Roman"
>Hello World</text>
</svg>
SVG+ and SVG+2 are annotated forms of SVG which include details
of the PDF operators and how they relate to the items of content in
the SVG. They can be very useful if you are trying to deconstruct a
page and determine how objects in the PDF relate to objects in the
SVG. In SVG+, SVG elements appear before the pdf elements for their
generating operators, and the pdf elements for the Do operator on
Form XObjects are not generated. In SVG+2, SVG elements appear
after the pdf elements of their generating operators, and the pdf
elements for the Do operator on Form XObjects are generated.
For example you could use SVG+ to identify the section of a PDF
stream that relates to a particular word on a page. You could then
replace the text show operator for that word with another one.
Effectively you'd be performing a low-level Search/Replace on the
PDF document. However you should note that this would not mean that
your layout would automatically adjust if - for example - you were
to replace a short word with a long one.
There is no official standard for SVG+ but if you are familiar
with the PDF specification it should be easy enough to
understand.
For example a simple "Hello World" PDF might produce the
following content:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="612" height="792" x="0" y="0">
<pdf pdf_Op="q" pdf_StreamID="5" pdf_StreamOffset="0"
pdf_StreamLength="1" />
<pdf pdf_Op="BT" pdf_StreamID="5" pdf_StreamOffset="3"
pdf_StreamLength="2" />
<pdf pdf_Op="0 Tr" pdf_StreamID="5" pdf_StreamOffset="7"
pdf_StreamLength="4" />
<pdf pdf_Op="/Fabc6 96 Tf" pdf_StreamID="5"
pdf_StreamOffset="13" pdf_StreamLength="12" />
<pdf pdf_Op="0 0 0 rg" pdf_StreamID="5" pdf_StreamOffset="27"
pdf_StreamLength="8" />
<pdf pdf_Op="1 0 0 1 0 715.2 Tm" pdf_StreamID="5"
pdf_StreamOffset="37" pdf_StreamLength="18" />
<pdf pdf_Op="0 Ts" pdf_StreamID="5" pdf_StreamOffset="57"
pdf_StreamLength="4" />
<text x="0" y="76.8" font-size="96" font-family="Times-Roman"
pdf_CTM="1 0 0 1 0 0" pdf_TM="1 0 0 1 0 715.2" pdf_Trm="96 0 0 96 0
715.2" pdf_Tf="Fabc6" pdf_Tz="100" pdf_Ts="0" pdf_w1000="5027"
pdf_Op="(Hello World) Tj" pdf_StreamID="5" pdf_StreamOffset="63"
pdf_StreamLength="16" >Hello World</text>
<pdf />
<pdf pdf_Op="ET" pdf_StreamID="5" pdf_StreamOffset="81"
pdf_StreamLength="2" />
<pdf pdf_Op="Q" pdf_StreamID="5" pdf_StreamOffset="85"
pdf_StreamLength="1" />
</svg>
The operators within the PDF stream are detailed in the SVG. For
example the first 'q' operator is located in Object ID 5 at offset
0 and has a length of 1 byte. The 'Tj' operator which shows "Hello
World" is at offset 63 and has length 16. The Current
Transformation Matrix (CTM) the Text Matrix (TM) and other
important PDF state values are shown.
|