Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB.I've done a few things to speed up the indexing: The index only has 4 fields - title, date, body, and docid. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. There is no field truncation taking place, since this is now disabled by default - every token in every Wikipedia article is being indexed. I use Lucene's contrib/benchmark package to do the indexing here's the alg I used:Īnalyzer=.standard.StandardAnalyzerĬontent.source = .ĭocs.file = /lucene/enwiki-20100904-pages-articles.txt Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour - not bad! Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |