Thursday, August 11, 2011

Img2Txt.pl (beta forensic tool)

If you followed the Casey Anthony murder trial or my DF Source interview with Detective Sandra Osborne of the Orange County Sheriff's Office, the chloroform searches were a hot item of discussion and evidentiary value during the Casey Anthony Trial. Part I and Part II of my interview with Det. Sandra Osborne.

Matthew Seyer  (part of Richland College's digital forensics program) has created a perl script called Img2Txt.pl [beta], which extracts text from images using the program tesseract. After running Img2Txt.pl the extracted text can be indexed and used for reference for keyword searches. Matthew was inspired to write this script from watching the digital forensics of the Casey Anthony Trial. Matthew has created a YouTube video on how his {beta} script works and he even uses an image presented during the Casey Anthony Trial during the demo.

During a conversation with Mr. Seyer, "I think the best way that this script can be used as of now is to extract out all image files with EnCase and preserve the folder structure, then run the script on the root folder of the extracted images. The script should preserve the folder structure of the images and copy the text to the output folder; the text files are named the same as the image in the same folder just with a .txt added to it. You can then index the output folders and search for keywords with your tool of choice. This was the quickest way of keeping track of what image it came from that I could think of on the spot."


Matthew stated, Img2Txt.pl has not been peer reviewed or tested (this is where you come in). This tool is in beta and Matthew is looking for feedback from the digital forensic community and is releasing this tool via DF Source. Please give it a test drive and provide feedback. While this tool is in its beta stage, with DF community support it has the possibility to bring context to digital forensic investigations and how we approach keyword searches, pulling text from images, indexing the text, and then integrating text data into keyword searches. (Disclaimer: As with any digital forensic tool, you must test and validate your findings. YMMV)

Visit the Img2Txt.pl page for more information on this beta digital forensic tool.
Img2Txt-v1.0Beta.pl (hosted on DF Source)

Contact Matthew Seyer
About Matthew Seyer

1 comment:

Dennis said...

I find all of this very interesting. It's great to see applications of OCR software for "the good guys".

I've been doing some work regarding command-line OCR software and sensitive information, though more from the attacker's side.

Here is a link to a blog post with some earlier work I've done on the topic - Mining Sensitive Information with Command-Line OCR.

I will be expanding on this at the upcoming DerbyCon conference.

I would love to see more on the topic and open up conversations with anyone interested.

Good work and thanks so much for sharing the tool and the approach!