Home:ALL Converter>Searching for Social security number using Lucene 4 regexp

Searching for Social security number using Lucene 4 regexp

Ask Time:2014-03-29T10:14:39         Author:user1001630

Json Formatter

I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444 or 222 33 4444.

As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444 to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/" but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?

Author:user1001630,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/22726165/searching-for-social-security-number-using-lucene-4-regexp
femtoRgon :

If you simply have a field of identifiers, or some such, use a StringField, or some other untokenized field, in which case a simple RegExpQuery is simple enough to define.\n\nIf you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery API to construct the appropriate query:\n\nSpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{3}\")));\nSpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{2}\")));\nSpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(\"text\", \"[0-9]{4}\")));\n\nQuery query = new SpanNearQuery({span1, span2, span3}, 0, true);\n\nsearcher.search(query, maxResults)\n",
2014-03-30T23:23:45
l'L'l :

You can use the INTERVAL flag:\n\n/<000-999>/ /<00-99>/ /<0000-9999>/\n\n\n> INTERVAL",
2014-03-29T02:43:51
yy