I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444
or 222 33 4444
.
As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444
to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/"
but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?
femtoRgon :
If you simply have a field of identifiers, or some such, use a StringField, or some other untokenized field, in which case a simple RegExpQuery is simple enough to define.\n\nIf you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery API to construct the appropriate query:\n\nSpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{3}\")));\nSpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{2}\")));\nSpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{4}\")));\n\nQuery query = new SpanNearQuery({span1, span2, span3}, 0, true);\n\nsearcher.search(query, maxResults)\n",
2014-03-30T23:23:45
l'L'l :
You can use the INTERVAL flag:\n\n/<000-999>/ /<00-99>/ /<0000-9999>/\n\n\n> INTERVAL",
2014-03-29T02:43:51