Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
18.0k views
in Technique[技术] by (71.8m points)

How to use HIGH_COMPRESSION in Lucene.Net 4.8

I'm trying to compress the index size as much as possible, Any help please? https://lucenenet.apache.org/docs/4.8.0-beta00013/api/core/Lucene.Net.Codecs.Compressing.CompressionMode.html#Lucene_Net_Codecs_Compressing_CompressionMode_HIGH_COMPRESSION

public class LuceneIndexer
    {
        private Analyzer _analyzer = new ArabicAnalyzer(Lucene.Net.Util.LuceneVersion.LUCENE_48);
        private string _indexPath;
        private Directory _indexDirectory;
        public IndexWriter _indexWriter;

        public LuceneIndexer(string indexPath)
        {
            this._indexPath = indexPath;
            _indexDirectory = new SimpleFSDirectory(new System.IO.DirectoryInfo(_indexPath));
        }

        public void BuildCompleteIndex(IEnumerable<Document> documents)
        {
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Lucene.Net.Util.LuceneVersion.LUCENE_48, _analyzer) { OpenMode = OpenMode.CREATE_OR_APPEND };
            indexWriterConfig.MaxBufferedDocs = 2;
            indexWriterConfig.RAMBufferSizeMB = 128;
            indexWriterConfig.MaxThreadStates = 2;

            _indexWriter = new IndexWriter(_indexDirectory, indexWriterConfig);

            _indexWriter.AddDocuments(documents);

            _indexWriter.Flush(true, true);
            _indexWriter.Commit();

            _indexWriter.Dispose();
        }

        
        public IEnumerable<Document> Search(string searchTerm, string searchField, int limit)
        {
            IndexReader indexReader = DirectoryReader.Open(_indexDirectory);
            var searcher = new IndexSearcher(indexReader);
            var termQuery = new TermQuery(new Term(searchField, searchTerm)); // Lucene.Net.Util.LuceneVersion.LUCENE_48, searchField, _analyzer
            var hits = searcher.Search(termQuery, limit).ScoreDocs;

            var documents = new List<Document>();
            foreach (var hit in hits)
            {
                documents.Add(searcher.Doc(hit.Doc));
            }

            _analyzer.Dispose();
            return documents;
        }

    }

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The first thing to know is that there are many aspects to the "Lucene Index". When not using compound files, this manifests in the various files that are created. Just looking at two of those, we can talk about the inverted index which is called postings and we can talk about the stored documents. Of these two, there aren't any readily available tunable settings regarding the compression of the inverted index as best I can tell.

The HIGH_COMPRESSION mode relates to the stored fields. If you are not storing fields and you are only using Lucene.Net to create an inverted index then doing work to turn on high compression for stored fields won't reduce the size of the "Lucene Index".

That said, if you are storing fields and want to use high compression on that stored fields data, then you will need to create your own codec that has high compression turned on for stored fields. And to do that, you will first need a Stored fields class that has high compression turned on. Below are those two classes followed by a unit test that uses this new codec that I have written for you. I haven't tried this code on a large amount of data to see the effect, I leave that for you as an exercise, but this should point the way to getting your stored fields compressed with High Compression.

/*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */

public sealed class Lucene41StoredFieldsHighCompressionFormat : CompressingStoredFieldsFormat {
        /// <summary>
        /// Sole constructor. </summary>
        public Lucene41StoredFieldsHighCompressionFormat()
            : base("Lucene41StoredFieldsHighCompression", CompressionMode.HIGH_COMPRESSION, 1 << 14) {
        }
    }

Here is a custom codec to use this High Compression format:

/*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */

    using Lucene40LiveDocsFormat = Lucene.Net.Codecs.Lucene40.Lucene40LiveDocsFormat;
    using Lucene41StoredFieldsFormat = Lucene.Net.Codecs.Lucene41.Lucene41StoredFieldsFormat;
    using Lucene42NormsFormat = Lucene.Net.Codecs.Lucene42.Lucene42NormsFormat;
    using Lucene42TermVectorsFormat = Lucene.Net.Codecs.Lucene42.Lucene42TermVectorsFormat;
    using PerFieldDocValuesFormat = Lucene.Net.Codecs.PerField.PerFieldDocValuesFormat;
    using PerFieldPostingsFormat = Lucene.Net.Codecs.PerField.PerFieldPostingsFormat;

    /// <summary>
    /// Implements the Lucene 4.6 index format, with configurable per-field postings
    /// and docvalues formats.
    /// <para/>
    /// If you want to reuse functionality of this codec in another codec, extend
    /// <see cref="FilterCodec"/>.
    /// <para/>
    /// See <see cref="Lucene.Net.Codecs.Lucene46"/> package documentation for file format details.
    /// <para/>
    /// @lucene.experimental 
    /// </summary>
    // NOTE: if we make largish changes in a minor release, easier to just make Lucene46Codec or whatever
    // if they are backwards compatible or smallish we can probably do the backwards in the postingsreader
    // (it writes a minor version, etc).
    [CodecName("Lucene46HighCompression")]
    public class Lucene46HighCompressionCodec : Codec {
        private readonly StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsHighCompressionFormat();    //<--This is the only line different then the stock Lucene46Codec
        private readonly TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat();
        private readonly FieldInfosFormat fieldInfosFormat = new Lucene46FieldInfosFormat();
        private readonly SegmentInfoFormat segmentInfosFormat = new Lucene46SegmentInfoFormat();
        private readonly LiveDocsFormat liveDocsFormat = new Lucene40LiveDocsFormat();

        private readonly PostingsFormat postingsFormat;

        private class PerFieldPostingsFormatAnonymousInnerClassHelper : PerFieldPostingsFormat {
            private readonly Lucene46HighCompressionCodec outerInstance;

            public PerFieldPostingsFormatAnonymousInnerClassHelper(Lucene46HighCompressionCodec outerInstance) {
                this.outerInstance = outerInstance;
            }

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public override PostingsFormat GetPostingsFormatForField(string field) {
                return outerInstance.GetPostingsFormatForField(field);
            }
        }

        private readonly DocValuesFormat docValuesFormat;

        private class PerFieldDocValuesFormatAnonymousInnerClassHelper : PerFieldDocValuesFormat {
            private readonly Lucene46HighCompressionCodec outerInstance;

            public PerFieldDocValuesFormatAnonymousInnerClassHelper(Lucene46HighCompressionCodec outerInstance) {
                this.outerInstance = outerInstance;
            }

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public override DocValuesFormat GetDocValuesFormatForField(string field) {
                return outerInstance.GetDocValuesFormatForField(field);
            }
        }

        /// <summary>
        /// Sole constructor. </summary>
        public Lucene46HighCompressionCodec()
            : base() {
            postingsFormat = new PerFieldPostingsFormatAnonymousInnerClassHelper(this);
            docValuesFormat = new PerFieldDocValuesFormatAnonymousInnerClassHelper(this);
        }

        public override sealed StoredFieldsFormat StoredFieldsFormat => fieldsFormat;

        public override sealed TermVectorsFormat TermVectorsFormat => vectorsFormat;

        public override sealed PostingsFormat PostingsFormat => postingsFormat;

        public override sealed FieldInfosFormat FieldInfosFormat => fieldInfosFormat;

        public override sealed SegmentInfoFormat SegmentInfoFormat => segmentInfosFormat;

        public override sealed LiveDocsFormat LiveDocsFormat => liveDocsFormat;

        /// <summary>
        /// Returns the postings format that should be used for writing
        /// new segments of <paramref name="field"/>.
        /// <para/>
        /// The default implementation always returns "Lucene41"
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public virtual PostingsFormat GetPostingsFormatForField(string field) {
            // LUCENENET specific - lazy initialize the codec to ensure we get the correct type if overridden.
            if (defaultFormat == null) {
                defaultFormat = Lucene.Net.Codecs.PostingsFormat.ForName("Lucene41");
            }
            return defaultFormat;
        }

        /// <summary>
        /// Returns the docvalues format that should be used for writing
        /// new segments of <paramref name="field"/>.
        /// <para/>
        /// The default implementation always returns "Lucene45"
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public virtual DocValuesFormat GetDocValuesFormatForField(string field) {
            // LUCENENET specific - lazy initialize the codec to ensure we get the correct type if overridden.
            if (defaultDVFormat == null) {
                defaultDVFormat = Lucene.Net.Codecs.DocValuesFormat.ForName("Lucene45");
            }
            return defaultDVFormat;
        }

        public override sealed DocValuesFormat DocValuesFormat => docValuesFormat;

        // LUCENENET specific - lazy initialize the codecs to ensure we get the correct type if overridden.
        private PostingsFormat defaultFormat;
        private DocValuesFormat defaultDVFormat;

        private readonly NormsFormat normsFormat = new Lucene42NormsFormat();

        public override sealed NormsFormat NormsFormat => normsFormat;
    }

Thanks to @NightOwl888, I now understand that you will also need to register the new Codec at startup like so:

Codec.SetCodecFactory(new DefaultCodecFactory {
    CustomCodecTypes = new Type[] { typeof(Lucene46HighCompressionCodec) }
});

Here is a unit test to demonstrate use of the High Compression Codec:

public class TestCompression {


        [Fact]
        public void HighCompression() {
            FxTest.Setup();

            Directory indexDir = new RAMDirectory();

            Analyzer standardAnalyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);

            IndexWriterConfig indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, standardAnalyzer);
            indexConfig.Codec = new Lucene46HighCompressionCodec();     //<--------Install the High Compression codec.

            indexConfig.UseCompoundFile = true;

            IndexWriter writer = new IndexWriter(indexDir, indexConfig);

            //souce: https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00006/src/Lucene.Net/Search/SearcherFactory.cs
            SearcherManager searcherManager = new SearcherManager(writer, applyAllDeletes: true, new SearchWarmer());

            Document doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "001", Field.Store.YES));
            doc.Add(new TextField("exampleField", "Unique gifts are great gifts.", Field.Store.YES));
            writer.AddDocument(doc);

            doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "002", Field.Store.YES));
            doc.Add(new TextField("exampleField", "Everyone is gifted.", Field.Store.YES));
            writer.AddDocument(doc);

            doc = new Document();
            doc.Add(new StringField("examplePrimaryKey", "003", Field.Store.YES));
            doc.Add(new TextField("exampleField"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...