Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.26 MB, 1,017 trang )
CHAPTER 2 ■ DATA MANIPULATION
Solution
Use regular expressions to ensure that the input data follows the correct structure and contains only
valid characters for the expected type of information.
How It Works
When a user inputs data to your application or your application reads data from a file, it’s good practice
to assume that the data is bad until you have verified its accuracy. One common validation requirement
is to ensure that data entries such as e-mail addresses, telephone numbers, and credit card numbers
follow the pattern and content constraints expected of such data. Obviously, you cannot be sure the
actual data entered is valid until you use it, and you cannot compare it against values that are known to
be correct. However, ensuring the data has the correct structure and content is a good first step to
determining whether the input is accurate. Regular expressions provide an excellent mechanism for
evaluating strings for the presence of patterns, and you can use this to your advantage when validating
input data.
The first thing you must do is figure out the regular expression syntax that will correctly match the
structure and content of the data you are trying to validate. This is by far the most difficult aspect of
using regular expressions. Many resources exist to help you with regular expressions, such as The
Regulator (http://osherove.com/tools), and RegExDesigner.NET, by Chris Sells
(www.sellsbrothers.com/tools/#regexd). The RegExLib.com web site (www.regxlib.com) also provides
hundreds of useful prebuilt expressions.
Regular expressions are constructed from two types of elements: literals and metacharacters.
Literals represent specific characters that appear in the pattern you want to match. Metacharacters
provide support for wildcard matching, ranges, grouping, repetition, conditionals, and other control
mechanisms. Table 2-2 describes some of the more commonly used regular expression metacharacter
elements. (Consult the .NET SDK documentation for a full description of regular expressions. A good
starting point is http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.
regex.aspx.)
Table 2-2. Commonly Used Regular Expression Metacharacter Elements
Element
.
Specifies any character except a newline character (\n)
\d
Specifies any decimal digit
\D
Specifies any nondigit
\s
Specifies any whitespace character
\S
Specifies any non-whitespace character
\w
Specifies any word character
\W
66
Description
Specifies any nonword character
CHAPTER 2 ■ DATA MANIPULATION
Element
Description
^
Specifies the beginning of the string or line
\A
Specifies the beginning of the string
$
Specifies the end of the string or line
\z
Specifies the end of the string
|
Matches one of the expressions separated by the vertical bar (pipe symbol); for example,
AAA|ABA|ABB will match one of AAA, ABA, or ABB (the expression is evaluated left to right)
[abc]
Specifies a match with one of the specified characters; for example, [AbC] will match A, b, or C,
but no other characters
[^abc]
Specifies a match with any one character except those specified; for example, [^AbC] will not
match A, b, or C, but will match B, F, and so on
[a-z]
Specifies a match with any one character in the specified range; for example, [A-C] will match
A, B, or C
( )
Identifies a subexpression so that it’s treated as a single element by the regular expression
elements described in this table
?
Specifies one or zero occurrencesof the previous character or subexpression; for example, A?B
matches B and AB, but not AAB
*
Specifies zero or more occurrences of the previous character or subexpression; for example,
A*B matches B, AB, AAB, AAAB, and so on
+
Specifies one or more occurrences of the previous character or subexpression; for example,
A+B matches AB, AAB, AAAB, and so on, but not B
{n}
Specifies exactly n occurrences of the preceding character or subexpression; for example, A{2}
matches only AA
{n,}
Specifies a minimum of n occurrences of the preceding character or subexpression; for
example, A{2,} matches AA, AAA, AAAA, and so on, but not A
{n, m}
Specifies a minimum of n and a maximum of m occurrences of the preceding character; for
example, A{2,4} matches AA, AAA, and AAAA, but not A or AAAAA
The more complex the data you are trying to match, the more complex the regular expression syntax
becomes. For example, ensuring that input contains only numbers or is of a minimum length is trivial,
but ensuring a string contains a valid URL is extremely complex. Table 2-3 shows some examples of
regular expressions that match against commonly required data types.
67
CHAPTER 2 ■ DATA MANIPULATION
Table 2-3. Commonly Used Regular Expressions
Input Type
Description
Regular Expression
Numeric input
The input consists of one or more decimal digits; for
example, 5 or 5683874674.
^\d+$
Personal
identification
number (PIN)
The input consists of four decimal digits; for example,
1234.
^\d{4}$
Simple password
The input consists of six to eight characters; for example,
ghtd6f or b8c7hogh.
^\w{6,8}$
Credit card
number
The input consists of data that matches the pattern of
most major credit card numbers; for example,
4921835221552042 or 4921-8352-2155-2042.
^\d{4}-?\d{4}?\d{4}-?\d{4}$
E-mail address
The input consists of an Internet e-mail address. The [\w]+ expression indicates that each address element must
consist of one or more word characters or hyphens; for
example, somebody@adatum.com.
^[\w-]+@([\w]+\.)+[\w-]+$
HTTP or HTTPS
URL
The input consists of an HTTP-based or HTTPS-based
URL; for example, http://www.apress.com.
^https?://([\w]+\.)+ [\w-]+(/[\w./?%=]*)?$
Once you know the correct regular expression syntax, create a new System.Text.
RegularExpressions.Regex object, passing a string containing the regular expression to the Regex
constructor. Then call the IsMatch method of the Regex object and pass the string that you want to
validate. IsMatch returns a bool value indicating whether the Regex object found a match in the string.
The regular expression syntax determines whether the Regex object will match against only the full string
or match against patterns contained within the string. (See the ^, \A, $, and \z entries in Table 2-2.)
The Code
The ValidateInput method shown in the following example tests any input string to see if it matches a
specified regular expression.
using System;
using System.Text.RegularExpressions;
namespace Apress.VisualCSharpRecipes.Chapter02
{
class Recipe02_05
{
68
CHAPTER 2 ■ DATA MANIPULATION
public static bool ValidateInput(string regex, string input)
{
// Create a new Regex based on the specified regular expression.
Regex r = new Regex(regex);
// Test if the specified input matches the regular expression.
return r.IsMatch(input);
}
public static void Main(string[] args)
{
// Test the input from the command line. The first argument is the
// regular expression, and the second is the input.
Console.WriteLine("Regular Expression: {0}", args[0]);
Console.WriteLine("Input: {0}", args[1]);
Console.WriteLine("Valid = {0}", ValidateInput(args[0], args[1]));
// Wait to continue.
Console.WriteLine("\nMain method complete. Press Enter");
Console.ReadLine();
}
}
}
Usage
To execute the example, run Recipe02-05.exe and pass the regular expression and data to test as
command-line arguments. For example, to test for a correctly formed e-mail address, type the following:
Recipe02-05 ^[\w-]+@([\w-]+\.)+[\w-]+$ myname@mydomain.com
The result would be as follows:
Regular Expression: ^[\w-]+@([\w-]+\.)+[\w-]+$
Input: myname@mydomain.com
Valid = True
Notes
You can use a Regex object repeatedly to test multiple strings, but you cannot change the regular
expression tested for by a Regex object. You must create a new Regex object to test for a different pattern.
Because the ValidateInput method creates a new Regex instance each time it’s called, you do not get the
ability to reuse the Regex object. As such, a more suitable alternative in this case would be to use a static
overload of the IsMatch method, as shown in the following variant of the ValidateInput method:
// Alternative version of the ValidateInput method that does not create
69
CHAPTER 2 ■ DATA MANIPULATION
// Regex instances.
public static bool ValidateInput(string regex, string input)
{
// Test if the specified input matches the regular expression.
return Regex.IsMatch(input, regex);
}
2-6. Use Compiled Regular Expressions
Problem
You need to minimize the impact on application performance that arises from using complex regular
expressions frequently.
Solution
When you instantiate the System.Text.RegularExpressions.Regex object that represents your regular
expression, specify the Compiled option of the System.Text.RegularExpressions.RegexOptions
enumeration to compile the regular expression to Microsoft Intermediate Language (MSIL).
How It Works
By default, when you create a Regex object, the regular expression pattern you specify in the constructor
is compiled to an intermediate form (not MSIL). Each time you use the Regex object, the runtime
interprets the pattern’s intermediate form and applies it to the target string. With complex regular
expressions that are used frequently, this repeated interpretation process can have a detrimental effect
on the performance of your application.
By specifying the RegexOptions.Compiled option when you create a Regex object, you force the .NET
runtime to compile the regular expression to MSIL instead of the interpreted intermediary form. This
MSIL is just-in-time (JIT) compiled by the runtime to native machine code on first execution, just like
regular assembly code. You use a compiled regular expression in the same way as you use any Regex
object; compilation simply results in faster execution.
However, a couple downsides offset the performance benefits provided by compiling regular
expressions. First, the JIT compiler needs to do more work, which will introduce delays during JIT
compilation. This is most noticeable if you create your compiled regular expressions as your application
starts up. Second, the runtime cannot unload a compiled regular expression once you have finished with
it. Unlike as with a normal regular expression, the runtime’s garbage collector will not reclaim the
memory used by the compiled regular expression. The compiled regular expression will remain in
memory until your program terminates or you unload the application domain in which the compiled
regular expression is loaded.
As well as compiling regular expressions in memory, the static Regex.CompileToAssembly method
allows you to create a compiled regular expression and write it to an external assembly. This means that
you can create assemblies containing standard sets of regular expressions, which you can use from
multiple applications. To compile a regular expression and persist it to an assembly, take the following
steps:
70
CHAPTER 2 ■ DATA MANIPULATION
1.
Create a System.Text.RegularExpressions.RegexCompilationInfo array large
enough to hold one RegexCompilationInfo object for each of the compiled
regular expressions you want to create.
2.
Create a RegexCompilationInfo object for each of the compiled regular
expressions. Specify values for its properties as arguments to the object
constructor. The following are the most commonly used properties:
•
IsPublic, a bool value that specifies whether the generated regular
expression class has public visibility
•
Name, a String value that specifies the class name
•
Namespace, a String value that specifies the namespace of the class
•
Pattern, a String value that specifies the pattern that the regular expression
will match (see recipe 2-5 for more details)
•
Options, a System.Text.RegularExpressions.RegexOptions value that
specifies options for the regular expression
3.
Create a System.Reflection.AssemblyName object. Configure it to represent the
name of the assembly that the Regex.CompileToAssembly method will create.
4.
Execute Regex.CompileToAssembly, passing the RegexCompilationInfo array
and the AssemblyName object.
This process creates an assembly that contains one class declaration for each compiled regular
expression—each class derives from Regex. To use the compiled regular expression contained in the
assembly, instantiate the regular expression you want to use and call its method as if you had simply
created it with the normal Regex constructor. (Remember to add a reference to the assembly when you
compile the code that uses the compiled regular expression classes.)
The Code
This line of code shows how to create a Regex object that is compiled to MSIL instead of the usual
intermediate form:
Regex reg = new Regex(@"[\w-]+@([\w-]+\.)+[\w-]+", RegexOptions.Compiled);
The following example shows how to create an assembly named MyRegEx.dll, which contains two
regular expressions named PinRegex and CreditCardRegex:
using System;
using System.Reflection;
using System.Text.RegularExpressions;
namespace Apress.VisualCSharpRecipes.Chapter02
{
class Recipe02_06
{
71