python - Spark using PySpark read images

Question

Welcome To Ask or Share your Answers For Others

python - Spark using PySpark read images

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Spark using PySpark read images

Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of (filename1, content1), (filename2, content2) ... into a big RDD.

images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*")

However, I got really confused what to do with the unicode representation of the image.

Here is an example of one image/file:

(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'ufffdufffdufffdufffdx00x10JFIFx00x01x01x01x00`x00`x00x00ufffdufffdx01x1eExifx00x00II*x00x08x00x00x00x08x00x12x01x03x00x01x00x00x00x01x00x00x00x1ax01x05x00x01x00x00x00nx00x00x00x1bx01x05x00x01x00x00x00vx00x00x00(x01x03x00x01x00x00x00x02x00x00x001x01x02x00x0bx00x00x00~x00x00x002x01x02x00x14x00x00x00ufffdx00x00x00x13x02x03x00x01x00x00x00x01x00x00x00iufffdx04x00x01x00x00x00ufffdx00x00x00x00x00x00x00`x00x00x00x01x00x00x00`x00x00x00x01x00x00x00GIMP 2.8.2x00x002013:07:29 10:41:35x00x07x00x00ufffdx07x00x04x00x00x000220ufffdufffdx02x00x04x00x00x00407x00x00ufffdx07x00x04x00x00x000100x01ufffdx03x00x01x00x00x00ufffdufffdx00x00x02ufffdx04x00x01x00x00x00x04x04x00x00x03ufffdx04x00x01x00x00x00Xx01x00x00x05ufffdx04x00x01x00x00x00ufffdx00x00x00x00x00x00x00x02x00x01x00x02x00x04x00x00x00R98x00x02x00x07x00x04x00x00x000100x00x00x00x00ufffdufffdx04_http://ns.adobe.com/xap/1.0/x00<?xpacket begin='ufeff' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

 <rdf:Description xmlns:exif='http://ns.adobe.com/exif/1.0/'>
  <exif:Orientation>Top-left</exif:Orientation>
  <exif:XResolution>96</exif:XResolution>
  <exif:YResolution>96</exif:YResolution>
  <exif:ResolutionUnit>Inch</exif:ResolutionUnit>
  <exif:Software>ACD Systems Digital Imaging</exif:Software>
  <exif:DateTime>2013:07:29 10:37:00</exif:DateTime>
  <exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>
  <exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>
  <exif:SubsecTime>407</exif:SubsecTime>
  <exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>
  <exif:ColorSpace>Uncalibrated</exif:ColorSpace>

Looking closer, there are actually some characters look like the metadata like

...
<x:xmpmeta xmlns:x='adobe:ns:meta/'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

 
<rdf:Description xmlns:exif='http://ns.adobe.com/exif/1.0/'>
  
<exif:Orientation>Top-left</exif:Orientation>
  
<exif:XResolution>96</exif:XResolution>
  
<exif:YResolution>96</exif:YResolution>
  
...

My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with.

Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

1.4m articles

1.4m replys

5 comments

57.0k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深蓝 · Answer 1 · 2021-10-17T00:48:40+0000

Your data looks like the raw bytes from a real image file (JPG?). The problem with your data is that it should be bytes, not unicode. You have to figure out how to convert from unicode to bytes. There is a whole can of worms full of encoding traps you have to deal with, but you may be lucky using img.encode('iso-8859-1'). I don't know and I will not deal with that in my answer.

The raw data for a PNG image looks like this:

rawdata = 'x89PNG
x1a
x00x00...x00x00IENDxaeB`x82'

Once you have it in bytes, you can create a PIL image from the raw data, and read it as a nparray:

>>> from StringIO import StringIO
>>> from PIL import Image
>>> import numpy as np
>>> np.asarray(Image.open(StringIO(rawdata)))

array([[[255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    ...,
    [255, 255, 255,   0],
    [255, 255, 255,   0],
    [255, 255, 255,   0]]], dtype=uint8)

All you need to make it work on Spark is SparkContext.binaryFiles:

>>> images = sc.binaryFiles("path/to/images/")
>>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))
>>> images.values().map(image_to_array)

Categories

python - Spark using PySpark read images

python - Spark using PySpark read images

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags