Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
243 views
in Technique[技术] by (71.8m points)

parsing large JSON with java/GSON, can't read the JSON structure

I'm trying to parse, using Java and GSON, a large (about 10GB) database dump in JSON format from the Musicbrainz.org

the JSON file has this structure. No '[' ']' to indicate that this is gonna be an array of objects, and no ',' between each object. Don't know why, but this JSON file is just like that.

{
    "id": "d0ab06e1-751a-414b-a976-da72670391b1",
    "name": "Arcing Wires",
    "sort-name": "Arcing Wires"
}
{
    "id": "6f0c2c16-dd7e-4268-a484-bc7b2ac78108",
    "name": "Another",
    "sort-name": "Another"
}
{
    "id": "e062b6cd-5506-47b0-afdb-72f4279ec38c",
    "name": "Agent S",
    "sort-name": "Agent S"
}

and this is the code that I'm using:

        try(JsonReader jsonReader = new JsonReader(
            new InputStreamReader(
                    new FileInputStream(jsonFilePath), StandardCharsets.UTF_8))) {
        Gson gson = new GsonBuilder().create();
        jsonReader.beginArray();
        while (jsonReader.hasNext()) {
            Artist mapped = gson.fromJson(jsonReader, Artist.class);
            //TODO do something with the object
            }
        }
        jsonReader.endArray();
    }
    catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

and the class that I mapped is this:

public class Artist {

@SerializedName("id")
public String id;
@SerializedName("name")
public String name;
@SerializedName("sort-name")
public String sortName;

}

the error I'm getting:

Exception in thread "main" java.lang.IllegalStateException: Expected BEGIN_ARRAY but was BEGIN_OBJECT at line 1 column 2 path $
at com.google.gson.stream.JsonReader.beginArray(JsonReader.java:350)
at DBLoader.parse(DBLoader.java:39)
at DBLoader.main(DBLoader.java:23)

I believe that the GSON expect a different structure from what I declared, but I don't understand how should I define this kind of JSON with no commas and no brackets. Any clues? thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

JSON by default declares one top value only (and yes, this would be a valid JSON document), but there is JSON streaming that uses arbitrary techniques to concatenate multiple JSON elements into a single stream assuming that the stream consumer can parse it (read more). Gson supports a so-called lenient mode that turns off the "one top value only" mode (and does some more things irrelevant to the question) for JsonReader: setLenient. Having the lenient mode on, you can read JSON elements one by one, and it turns out that this mode can be used to parse/read line-delimited JSON and concatenated JSON values since they are simply delimited by zero or more whitespaces that are ignored by Gson (therefore more exotic record separator-delimited JSON and length-prefixed JSON are unsupported). The reason of why it does not work for you is that your initial code assumes that the stream contains a single JSON array (and it does not obviously: it is supposed to be a stream of elements that does not conform the JSON array syntax).

A simple generic JSON stream support might look like this (using Stream API for its more rich API than Iterator has, but it is fine to show an idea, and you can easily adapt it to iterators, callbacks, observable streams, whatever you like):

@UtilityClass
public final class JsonStreamSupport {

    public static <T> Stream<T> parse(@WillNotClose final JsonReader jsonReader, final Function<? super JsonReader, ? extends T> readElement) {
        final boolean isLenient = jsonReader.isLenient();
        jsonReader.setLenient(true);
        final Spliterator<T> spliterator = new Spliterators.AbstractSpliterator<T>(Long.MAX_VALUE, Spliterator.ORDERED) {
            @Override
            public boolean tryAdvance(final Consumer<? super T> action) {
                try {
                    final JsonToken token = jsonReader.peek();
                    if ( token == JsonToken.END_DOCUMENT ) {
                        return false;
                    }
                    // TODO: read more elements in batch
                    final T element = readElement.apply(jsonReader);
                    action.accept(element);
                    return true;
                } catch ( final IOException ex ) {
                    throw new RuntimeException(ex);
                }
            }
        };
        return StreamSupport.stream(spliterator, false)
                .onClose(() -> jsonReader.setLenient(isLenient));
    }

}

And then:

JsonStreamSupport.<Artist>parse(jsonReader, jr -> gson.fromJson(jr, Artist.class))
        .forEach(System.out::println);

Output (assuming Artist has Lombok-generated toString()):

Artist(id=d0ab06e1-751a-414b-a976-da72670391b1, name=Arcing Wires, sortName=Arcing Wires)
Artist(id=6f0c2c16-dd7e-4268-a484-bc7b2ac78108, name=Another, sortName=Another)
Artist(id=e062b6cd-5506-47b0-afdb-72f4279ec38c, name=Agent S, sortName=Agent S)

How many bytes does such an approach, JSON streaming, save so that it is used at the service you're trying to consume? I don't know.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...