Tuesday, January 11, 2011

C#.net - Extract image from PDF file.

In this article i will show you how to extract image from PDF file.

Step 1
First you need to download "ITextSharp.dll" from the following link.
http://sourceforge.net/projects/itextsharp/

Step 2
Create a Console application and give the solution name as ConExtractImagefromPDF.

Step 3
Add two assembly reference to the project from solution explorer.

1.ITextSharp.dll




2.System.Drawing.dll






Step 4
Write a static method for extracting image from pdf file,it is look like this






/// <summary>
        ///  Extract Image from PDF file and Store in Image Object
        /// </summary>
        /// <param name="PDFSourcePath">Specify PDF Source Path</param>
        /// <returns>List</returns>
        private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
        {
            List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

            iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
            iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
            iTextSharp.text.pdf.PdfObject PDFObj = null;
            iTextSharp.text.pdf.PdfStream PDFStremObj = null;

            try
            {
                RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
                PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

                for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
                {
                    PDFObj = PDFReaderObj.GetPdfObject(i);

                    if ((PDFObj != null) && PDFObj.IsStream())
                    {
                        PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                        iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);

                        if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                        {
                             try
                                {

                                    iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                             new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);
                                    
                                    System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                                   

                                    ImgList.Add(ImgPDF);

                                }
                                catch (Exception)
                                {
                                    
                                }
                        }
                    }
                }
                PDFReaderObj.Close();
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
            return ImgList;
        }


Step 5
Write a static method for store extracting image file in folder,it is look like this

 /// <summary>
        ///  Write Image File
        /// </summary>
        private static void WriteImageFile()
        {
            try
            {
                System.Console.WriteLine("Wait for extracting image from PDF file....");

                // Get a List of Image
                List<System.Drawing.Image> ListImage = ExtractImages(@"C:\Users\Kishor\Desktop\TuterPDF\ASP.net\ASP.NET 3.5 Unleashed.pdf");

                for (int i = 0; i < ListImage.Count; i++)
                {
                    try
                    {
                        // Write Image File
                        ListImage[i].Save(AppDomain.CurrentDomain.BaseDirectory + "ImageStore\\Image" + i + ".jpeg", System.Drawing.Imaging.ImageFormat.Jpeg);
                        System.Console.WriteLine("Image" + i + ".jpeg write sucessfully"); 
                    }
                    catch (Exception)
                    { }
                }

            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }

Step 6
Call above function in main method,it is look like this

static void Main(string[] args)
        {
            try
            {
                WriteImageFile(); // write image file
            }
            catch (Exception ex)
            {
                System.Console.WriteLine(ex.Message);  
            }
        }

Full Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConExtractImagefromPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                WriteImageFile(); // write image file
            }
            catch (Exception ex)
            {
                System.Console.WriteLine(ex.Message);  
            }
        }

        #region Methods

        /// <summary>
        ///  Extract Image from PDF file and Store in Image Object
        /// </summary>
        /// <param name="PDFSourcePath">Specify PDF Source Path</param>
        /// <returns>List</returns>
        private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
        {
            List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

            iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
            iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
            iTextSharp.text.pdf.PdfObject PDFObj = null;
            iTextSharp.text.pdf.PdfStream PDFStremObj = null;

            try
            {
                RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
                PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

                for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
                {
                    PDFObj = PDFReaderObj.GetPdfObject(i);

                    if ((PDFObj != null) && PDFObj.IsStream())
                    {
                        PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                        iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);

                        if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                        {
                            try
                                {

                                    iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                             new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);
                                    
                                    System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                                   

                                    ImgList.Add(ImgPDF);

                                }
                                catch (Exception)
                                {
                                    
                                }
                        }
                    }
                }
                PDFReaderObj.Close();
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
            return ImgList;
        }


        /// <summary>
        ///  Write Image File
        /// </summary>
        private static void WriteImageFile()
        {
            try
            {
                System.Console.WriteLine("Wait for extracting image from PDF file....");

                // Get a List of Image
                List<System.Drawing.Image> ListImage = ExtractImages(@"C:\Users\Kishor\Desktop\TuterPDF\ASP.net\ASP.NET 3.5 Unleashed.pdf");

                for (int i = 0; i < ListImage.Count; i++)
                {
                    try
                    {
                        // Write Image File
                        ListImage[i].Save(AppDomain.CurrentDomain.BaseDirectory + "ImageStore\\Image" + i + ".jpeg", System.Drawing.Imaging.ImageFormat.Jpeg);
                        System.Console.WriteLine("Image" + i + ".jpeg write sucessfully"); 
                    }
                    catch (Exception)
                    { }
                }

            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
        #endregion
    }
}


Download
Download Source Code

49 comments:

  1. Nice article..Thanks for Sharing code..
    you help me man..........

    ReplyDelete
  2. was really looking for it, tested it and it works beautifully.
    but wanted instead to extract the files, convert each page of a PDF image, can anyone help?

    ReplyDelete
  3. Peterbunny use this following link.
    I hope this article will help you.

    http://bytescout.com/products/developer/pdfextractorsdk/how-to-extract-images-from-pdf-page-by-page-in-c-%2523

    ReplyDelete
  4. kishor bhai ek number...........

    ReplyDelete
  5. Hi great code but I have some pdf and i can't extract images because the Images objects are not jpeg or bitmap.Can you help me?

    ReplyDelete
  6. Can you tell me about Image type(Extension)?????

    ReplyDelete
  7. Hi kishor..

    Even im facing the same problem. I have a pdf having image which has filter type "CCITFaxDecode". So im not able to extract the image out of the pdf. Cud u please help me with this....

    Thank you
    Mugdha

    ReplyDelete
  8. Same problem for me...

    ReplyDelete
  9. thank you for sharing the code, I can't extract images I'm using .tif

    ReplyDelete
  10. Fantastic!! I got the code up and running in ten minutes and it worked perfectly. Saved me hours of work. I did not use the class as static in a console app but simply pulled the two main functions to read and write and plopped them in my existing class and ran it. Worked first time. Thank you Kishor.

    ReplyDelete
  11. HI I am getting error "parameter is invalid".
    Please help

    ReplyDelete
  12. thanks for shared :)

    ReplyDelete
  13. Thanks that is what I need. But for CCITT images the .net framework has no the support... however to have the stream with usefull data is the main first step.

    ReplyDelete
  14. I am getting error
    "A generic error occurred in GDI+."
    Please help me.

    ReplyDelete
    Replies
    1. Can you specify more details??????????

      Delete
  15. hi Kishor Naik

    Can You help me please. i Developing a System which read text from PDF Image. I try to contact you, but i cannot find out any contact details. please dude help me. i stuck here. i try lot of ways. my email address is ganeshrasi.lk@gmail.com.
    please replay me.

    ReplyDelete
  16. Hi Kishore

    Do you have code for extracting text from pdf into a text file, using any other package except iTextSharp dll. as iTextsharp is getting it in left to right manner, which is spoiling the textual information. for example I am not getting the address in one string, its embedding the right side text into the address if it is on the left side. i want it should maintain the location information to get the data from the tables. it'll great if you could help me. thanks in advance.
    if you have the code pls send it to my email id simyg17@gmail.com

    Thanks,
    Simy

    ReplyDelete
  17. The code is not working. It shows the error as 'Parameter is not valid'. Pls help

    ReplyDelete
    Replies
    1. Can you send your Code In My Mail ID???

      kishor.naik011.net@gmail.com

      Delete
  18. I too get a parameter is not valid exception thrown when trying to run this code.

    the invalid parameter is 'MS' in line

    "Image ImgPDF = Image.FromStream(MS);"

    Any advice on what the cause is would be greatly appreciated.

    ReplyDelete
    Replies
    1. First thanks for great code and great discussion.
      I had the invalid parameter exception too, I added your code and it fixed it but I got another exception:
      "Color Depth one is not supported".
      Any idea what that is.
      appreciate your help

      Delete
    2. Sorry i did not reply because i am busy in projects.

      Can you pass your solution to my mail ID

      Delete
  19. Hello Kishore,
    I am able to run the program but the image count is showing zero. Does it means Images from PDF not being extracted or am I missing something.
    P.S. My PDF file comes from scan pages.

    ReplyDelete
    Replies
    1. Kishor, will you have a chance to upload your new code?

      Delete
    2. Sorry for late Reply.
      I updated Code in My Solution Project and Blog.

      Delete
  20. Thanks Kishor... Its really a nice work. But all images are saving as jpeg. Is it possible to save with its original extension then may be the quality of the images will be same as source images.

    Thanking again.

    ReplyDelete
    Replies
    1. Yes you can identify original extension of image by using System.Drawing.Image object.

      here is Extension Method of System.Drawing.Image.

      public static class Extension
      {
      #region Methods

      public static System.Drawing.Imaging.ImageFormat GetImageFormat(this System.Drawing.Image ImageFormatObj)
      {
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Jpeg))
      return System.Drawing.Imaging.ImageFormat.Jpeg;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Bmp))
      return System.Drawing.Imaging.ImageFormat.Bmp;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Png))
      return System.Drawing.Imaging.ImageFormat.Png;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Emf))
      return System.Drawing.Imaging.ImageFormat.Emf;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Exif))
      return System.Drawing.Imaging.ImageFormat.Exif;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Gif))
      return System.Drawing.Imaging.ImageFormat.Gif;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Icon))
      return System.Drawing.Imaging.ImageFormat.Icon;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.MemoryBmp))
      return System.Drawing.Imaging.ImageFormat.MemoryBmp;
      if (ImageFormatObj.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Tiff))
      return System.Drawing.Imaging.ImageFormat.Tiff;
      else
      return System.Drawing.Imaging.ImageFormat.Wmf;
      }

      #endregion
      }

      Call Extension Method Like this

      System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();

      System.Drawing.Imaging.ImageFormat ImageFormatObj = ImgPDF.GetImageFormat();

      Delete
  21. i have a requirement where i need to get dimension and top left coordinate of image, is that posible with iText

    ReplyDelete
    Replies
    1. I think you have to use this library....

      http://bitmiracle.com/pdf-library/help/extract-image-coordinates.aspx

      Delete
  22. Sir do you know how to determine if the image extracted is grayscale or truecolor? Need help please. Thank you!

    ReplyDelete
    Replies
    1. I had created a method for you which detect image is grayscale or TrueColor.

      Please forward your Mail-ID. so I will send whole Solution on your mail ID with some instruction.

      Delete
    2. Sir here is my Mail-ID : softwareengineer.eighteen@gmail.com. My project is to count the pages in a pdf and count the images then detect if it is gray or rgb or cmyk. please help me. thank you so much.

      Delete
    3. Hi Sir Kishor, can you also provide me with the same solution as John asked? Here's my email-ID : magscy@gmail.com

      Thank you so much!

      Delete
  23. nice article, however often you need to do the opposite - get pdf converted to image. It's not possible to convert pdf to image with iText and I used Apitron PDF Rasterizer for .NET for this task

    ReplyDelete
  24. Great Thanks!
    Alistair in England

    ReplyDelete
  25. i have one issue how do i get one by one pdf page image

    ReplyDelete
  26. How can i extract image details from a image and store in database?

    ReplyDelete
  27. I'm not a developer, i always use this free online service to extract image from pdf online

    ReplyDelete
    Replies
    1. rasteredge can provide youc# add comments to pdf reader, and download it to try it free on rasteredge page http://www.rasteredge.com/how-to/csharp-imaging/pdf-html5-feature-annotate/

      Delete
  28. hi, kishor, can u plz help me to extract image from specific coordinates point of pdf file using c# and store the image in database..... i need this as soon as possible.......... plz send me the code on this E-Mail ID - pioneer.shanky@gmail.com

    ReplyDelete
  29. Hello everyone.
    Can anyone help me in saving the images with their original names.

    Thank you

    ReplyDelete
  30. hi, kishor, can u plz tell me how to find coordinates of extracted image

    ReplyDelete
    Replies
    1. c# coolmuster pdf image extractor on pag ehttp://www.rasteredge.com/how-to/csharp-imaging/pdf-text-extract/

      Delete
  31. Here is the link for you to c# .net extract text from pdf. Hope this gives you a start on rasteredge page ttp://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-text/

    ReplyDelete