Title: | Read Text Out of Modern Office Files |
---|---|
Description: | Reads in text from 'unstructured' modern Microsoft Office files (XML based files) such as Word (.docx) and PowerPoint (.pptx). This does not read in structured data (from Excel or Access) as there are many other great packages to that do so already. |
Authors: | Mark Ewing |
Maintainer: | Mark Ewing <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2025-03-01 05:52:23 UTC |
Source: | https://github.com/bmewing/readoffice |
Read data from a Modern Word File
read_docx(docx, tables = T, drawings = T, diagrams = T)
read_docx(docx, tables = T, drawings = T, diagrams = T)
docx |
The .docx file to read |
tables |
Should tables be processed from the document? |
drawings |
Should drawings be processed from the document? |
diagrams |
Should diagrams be processed from the document? |
Only accepts one file at a time and only .docx files. Modifying file extensions will not work.
List is comprised of named elements, one per 'section' (sections are recognized after a page break). If tables exist in the document and are processed, then the named list elements will be lists containing the text of paragraphs, drawings (if present and processed) and matrices holding the table structure. Otherwise, the list elements will contain vectors of the text processed.
Diagrams are typically what Microsoft calls 'SmartArt'
Named list with document contents
read_docx(docx = system.file('extdata','example.docx',package='readOffice')) read_docx(docx = system.file('extdata','example.docx',package='readOffice'),diagrams=FALSE)
read_docx(docx = system.file('extdata','example.docx',package='readOffice')) read_docx(docx = system.file('extdata','example.docx',package='readOffice'),diagrams=FALSE)
Read data from a Modern PowerPoint File
read_pptx(pptx, tables = T, drawings = T, diagrams = T)
read_pptx(pptx, tables = T, drawings = T, diagrams = T)
pptx |
The .pptx file to read |
tables |
Should tables be processed from the document? |
drawings |
Should drawings be processed from the document? |
diagrams |
Should diagrams be processed from the document? |
Only accepts one file at a time and only .pptx files. Modifying file extensions will not work.
The returned list contains named lists of the elements on the slide, each element of which is either a data.frame or a matrix containing the text and minor details about the structure on the page.
Data frames will contain the text in addition to the following columns: "Bulleted" indicates if the text is part of a bulleted or numbered list on the slide. "Hierarchy" indicates the tabbed depth of the element in a bulleted or numbered list (NA if not bulleted).
Alternatively, returns a matrix for tables on the slide.
List containing slide elements.
read_pptx(system.file('extdata','example.pptx',package='readOffice')) read_pptx(system.file('extdata','example.pptx',package='readOffice'),diagrams=FALSE)
read_pptx(system.file('extdata','example.pptx',package='readOffice')) read_pptx(system.file('extdata','example.pptx',package='readOffice'),diagrams=FALSE)