Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
431 views
in Technique[技术] by (71.8m points)

java - how to write a valid decoding file based on a given .proto, reading from a .pb

Based on the answer to this question I'm thinking that I've provided my .pb file with a "faulty decoder".

This is the data I'm trying to decode.

This is my .proto file.

Based on the ListPeople.java example provided in the Java tutorial documentation, I tried to write something similar to start picking apart that data, I wrote this:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;


public class ListDocument
{
    // Iterates though all people in the AddressBook and prints info about them.
    static void Print(Document document)
    {
        for ( Sentence sentence: document.getSentencesList() )
        {
            for(int i=0; i < sentence.getTokensCount(); i++)
            {
                System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
            }
        }
    }

    // Main function:  Reads the entire address book from a file and prints all
    //   the information inside.
    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.err.println("Usage:  ListPeople ADDRESS_BOOK_FILE");
            System.exit(-1);
        }

        // Read the existing address book.
        Document addressBook =
                Document.parseFrom(new FileInputStream(args[0]));

        Print(addressBook);
    }
}

But when I run that I get this error

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
    at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
    at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
    at ListDocument.main(ListDocument.java:40)

so, as I said above I think that has to do with me not properly defining the decoder. Is there some way to look at the .proto file I'm trying to use and figure out a way to just read off all that data?

Is there some way to look at that .proto file and see what I'm doing wrong?

These are the first few lines of the file I want to read:

ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"?

S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb?€€€??????€€€?????"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*?The occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."?
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb?€€€?????…€€€?????"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*êTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .a
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"ù

EDIT


This is a file another researcher used to parse these files, so I was told, is it possible that I could use this?

package edu.stanford.nlp.kbp.slotfilling.multir;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;

/**
 * Converts Hoffmann's data in protobuf format to our MultiLabelDataset
 * @author Mihai
 *
 */
public class ProtobufToMultiLabelDataset {
  static class RelationAndMentions {
    String arg1;
    String arg2;
    Set<String> posLabels;
    Set<String> negLabels;
    List<Mention> mentions;

    public RelationAndMentions(String types, String a1, String a2) {
      arg1 = a1;
      arg2 = a2;
      String [] rels = types.split(",");
      posLabels = new HashSet<String>();
      for(String r: rels){
        if(! r.equals("NA")) posLabels.add(r.trim());
      }
      negLabels = new HashSet<String>(); // will be populated later
      mentions = new ArrayList<Mention>();
    }
  };

  static class Mention {
    List<String> features;
    public Mention(List<String> feats) {
      features = feats;
    }
  }

    public static void main(String[] args) throws Exception {
      String input = args[0];

      InputStream is = new GZIPInputStream(
        new BufferedInputStream
        (new FileInputStream(input)));

      toMultiLabelDataset(is);
      is.close();
    }

    public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, true);
      MultiLabelDataset<String, String> dataset = toDataset(relations);
      return dataset;
    }

    public static void toDatums(InputStream is,
        List<List<Collection<String>>> relationFeatures,
        List<Set<String>> labels) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, false);
      toDatums(relations, relationFeatures, labels);
    }

    private static void toDatums(List<RelationAndMentions> relations,
        List<List<Collection<String>>> relationFeatures,
      List<Set<String>> labels) {
    for(RelationAndMentions rel: relations) {
      labels.add(rel.posLabels);
      List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
      for(int i = 0; i < rel.mentions.size(); i ++){
        mentionFeatures.add(rel.mentions.get(i).features);
      }
      relationFeatures.add(mentionFeatures);
    }
    assert(labels.size() == relationFeatures.size());
    }

    public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException {
      //
      // Parse the protobuf
      //
    // all relations are stored here
    List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
    // all known relations (without NIL)
    Set<String> relTypes = new HashSet<String>();
    Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
      new HashMap<String, Map<String,Set<String>>>();
    Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
    Relation r = null;
    while ((r = Relation.parseDelimitedFrom(is)) != null) {
      RelationAndMentions relation = new RelationAndMentions(
          r.getRelType(), r.getSourceGuid(), r.getDestGuid());
      labelCountHisto.incrementCount(relation.posLabels.size());
      relTypes.addAll(relation.posLabels);
      relations.add(relation);

      for(int i = 0; i < r.getMentionCount(); i ++) {
        DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
        // String s = mention.getSentence();
        relation.mentions.add(new Mention(mention.getFeatureList()));
      }

      for(String l: relation.posLabels) {
        addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
      }
    }
    Log.severe("Loaded " + relations.size() + " relations.");
    Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
    Log.severe("Label count histogram: " + labelCountHisto);

    Counter<Integer> slotCountHisto = new ClassicCounter&

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Updated; the confusion here is two points:

  • the root object is Relation, not Document (in fact, only Relation and RelationMentionRef are even used)
  • the pb file is actually multiple objects, each varint-delimited, i.e. prefixed by their length expressed as a varint

As such, Relation.parseDelimitedFrom should work. Processing it manually, I get:

test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed

Old; outdated; exploratory:

I extracted your 4 documents and ran them through a little test rig:

        ProcessFile("testNegative.pb");
        ProcessFile("testPositive.pb");
        ProcessFile("trainNegative.pb");
        ProcessFile("trainPositive.pb");

where ProcessFile first dumps the first 10 bytes as hex, and then tries to process it via a ProtoReader. Here's the results:

Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.

Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.

Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Same as above.

Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://stackoverflow.com/q/2152978/23354

CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).

Your data is well and truly garbage. Sorry.


And with the bonus round of test-multiple.pb from comments (after gz decompression):

Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

This starts identically to testNegative.pb, and hence fails for exactly the same reason.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...