Saturday, April 5, 2014

LeetCode: Regular Expression Matching

Implement regular expression matching with support for '.' and '*'.

'.' Matches any single character.
'*' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

The function prototype should be:
bool isMatch(const char *s, const char *p)

Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "a*") → true
isMatch("aa", ".*") → true
isMatch("ab", ".*") → true
isMatch("aab", "c*a*b") → true


We can use recursive calls to process the matching.

C++ Code:

/*
 * func: regular_expression_match
 * goal: to check if the string is matched with the current expression
 * @param s: a pointer to the string
 * @param p: a pointer to the pattern
 * return: true or false
 */
/*
 * Idea: use recursion to check every case
 * Code reference: http://leetcode.com/2011/09/regular-expression-matching.html
 */
bool regular_expression_match(const char *s, const char *p){
    //If pattern string reaches the end
    if(*p == '\0'){
        return *s == '\0';
    }
    
    //next char is not '*':
    if(*(p+1) != '*'){
        //If the current char of two string matches, then go to the next one
        //Else return false
        return ((*p == *s) || (*p == '.' && *s != '\0')) && regular_expression_match(s+1, p+1);
    }
    
    //If current char matches and next patter char is *
    while((*p == *s) || (*p == '.' && *s != '\0')){
        //Consider the current char occurs 0 times, like "ab" by "a*ab"
        if( regular_expression_match(s, p+2)){
            return true;
        }
        //Succeed match one char
        ++s;
    }
    
    //Case 1: current char doesn't match, which means the above while loop won't run
    //Case 2: After matching '*', matching the remaining part
    return regular_expression_match(s, p+2);
}

For python case, it took me a very long time since when the above code was converted to python it cannot pass some case like "aaaaaaaaaaaaab", "a*a*a*a*a*a*a*a*a*a*c". Below is a method I found online, the trick is to simplify the pattern before matching. The method is very smart. For the previous case, it will be reduced to "a*c", which eliminates many recursions. But its matching part is a little more complicated than the best way I found.(200ms vs 174 ms)

Python Code 1(referenced online):

#link: http://codesays.com/2014/solution-to-regular-expression-matching-by-leetcode/
class Solution:
    # @return a pre-processed pattern string
    # Make the pattern compact. For example, we will shorten "a*.*" and "a*a*" into
    #   ".*" and "a*" respectively.
    def _compile(self, p):
        result = ""
        while len(p) != 0:
            if len(p) >= 2 and p[1] == "*":
                # Meet with a new *, without * neighbors
                if len(result) < 2 or result[-1] != "*":        
                    result += p[0:2]
                # Meet with a same * segment as previous one
                elif result[-2] == p[0] or result[-2] == ".":   
                    pass
                # Reduce all * segments around .*
                elif p[0] == ".":
                    while len(result) >= 2 and result[-1] == "*":   
                        result = result[:-2]
                    result += p[:2]
                # Meet with a different * segment
                else:                                           
                    result += p[0:2]
                p = p[2:]
            else:
                result += p[0]
                p = p[1:]
        return result

    # @return True if s matches p completely
    #         False else.
    def _isMatch(self, s, p):
        # Both string and pattern reach the end. Nothing left to deal with.
        if len(s) == 0 and len(p) == 0:     
            return True
        # Pattern ends. But string left some substring unmatched
        elif len(p) == 0:                   
            return False
        # Meet with a * segment in pattern
        elif len(p) >= 2 and p[1] == "*":   
            return self._isMatchStar(s, p[2:], p[0])
        # String ends. But pattern remains some part. And the remaining part is not
        # a * segment. In other words, the remaining part DO need some subtring to
        # match. But we do not have any more.
        elif len(s) == 0:                   
            return False
        # Single character matches. Recursively check the next position.
        elif s[0] == p[0] or p[0] == ".":  
            return self._isMatch(s[1:], p[1:])
        # Single character fails to match.
        else:                               
            return False

    # @return True if s matches (ch)*p completely
    #         False else.
    def _isMatchStar(self, s, p, ch):
        # Works like a DFS algorithm
        while True:
            # Simply try to skip the * segment. That is,the * segment does not consume
            # any thing in this round.
            if self._isMatch(s, p):         
                return True
            # The * segment would like to, but has noting to consume.
            elif len(s) == 0:               
                return False
            # This * segment consume one quanlified character in the string
            elif s[0] == ch or ch == ".":   
                s = s[1:]
            # This * segment would like to, but has no quanlified content to consume.
            else:                           
                return False

    def isMatch(self, s, p):
        assert p == "" or p[0] != "*"
        return self._isMatch(s, self._compile(p))

Python Code 2(based on solution referenced online):

# @return a pre-processed pattern string
# Make the pattern compact. For example, we will shorten "a*.*" and "a*a*" into
#   ".*" and "a*" respectively.
def pattern_simplify(p):
    result = ""
    while len(p) != 0:
        if len(p) >= 2 and p[1] == "*":
            # Meet with a new *, without * neighbors
            if len(result) < 2 or result[-1] != "*":
                result += p[0:2]
            # Meet with a same * segment as previous one
            elif result[-2] == p[0] or result[-2] == ".":
                pass
            # Reduce all * segments around .*
            elif p[0] == ".":
                while len(result) >= 2 and result[-1] == "*":
                    result = result[:-2]
                result += p[:2]
            # Meet with a different * segment
            else:
                result += p[0:2]
            p = p[2:]
        else:
            result += p[0]
            p = p[1:]
    return result


# func: regular expression matching
# @param s: the input string
# @param p: the input pattern
# @return: True of False
def is_match(s, p):

    #when p goes to the end
    if len(p) == 0:
        return len(s) == 0

    #If p's length is 1
    if len(p) == 1 or p[1] != '*':
        #If current pattern char doesn't match, return false
        if len(s) < 1 or (p[0] != '.' and s[0] != p[0]):
            return False
        #Continue to match the next one
        return is_match(s[1:], p[1:])
    else:
        i = -1
        while i < len(s) and (i < 0 or p[0] == '.' or p[0] == s[i]):
            if is_match(s[i+1:], p[2:]):
                return True
            i += 1
        return False

def isMatch(s, p):
    return is_match(s, pattern_simplify(p))

Sometimes the test system is like a jerk. However it can help us know more about how to deal with jerk.

Reference Link:

1.Regular Expression Matching: http://leetcode.com/2011/09/regular-expression-matching.html

2.Solution to Regular Expression Matching by LeetCode: http://codesays.com/2014/solution-to-regular-expression-matching-by-leetcode/

3.LeetCode – Regular Expression Matching in Java: http://www.programcreek.com/2012/12/leetcode-regular-expression-matching-in-java/

No comments:

Post a Comment