Regex (Regular Expressions) – Text Analysis, Pattern Recognition, Email & Phone Number
This post is a definition of terms for regular expressions – including exam questions and tags.
In a Nutshell
Regular expressions (Regex) are patterns used to identify, extract, or validate specific character strings in texts. They are a powerful tool for automated text analysis and data processing.
Compact Technical Description
Regular expressions define search patterns for character strings and enable efficient analysis of large volumes of text. They are used in almost all programming languages as well as in many command-line tools (e.g. grep, sed). Typical tasks include extracting structured data (such as email addresses, phone numbers), validating format inputs, and transforming character strings. Regex consist of literal characters, metacharacters (., *, +, ?, []) and control characters such as ^ and $. Complex patterns can be formulated through groupings () and alternatives |.
Exam-Relevant Key Points
- Regex enable pattern recognition in character strings
- They use a specific syntax of literals and metacharacters
- Used for validation, extraction and replacement of text patterns
- Directly usable in many programming languages (e.g. Java, Python, JavaScript)
- Increase efficiency when processing large amounts of data
- Faulty regex can lead to security vulnerabilities or performance problems
- Optimized regex prevent ReDoS (Regular Expression Denial of Service)
- Regex patterns must be documented and tested
Core Components
1. Literal characters (a, b, c, …)
2. Metacharacters (., *, +, ?, {n,m})
3. Character classes ([A-Z], [0-9])
4. Group formation ((...))
5. Alternatives (|)
6. Anchors (^ for start, $ for end)
7. Escaping (\.)
8. Lookaheads (?=...)
9. Greedy vs. Lazy Matching
10. Match and replace tests in test frameworks
Practical Example
// Example: Regex for extracting email addresses
regex = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
Explanation: This pattern recognizes typical email addresses by combining character classes, quantifiers, and literals.
Advantages and Disadvantages
Advantages
- High flexibility and precision in text analysis
- Platform-independent usable
- Ideal for validation and parsing
Disadvantages
- Complex syntax, difficult to read
- Prone to errors with incorrect escaping or greedy matching
- Performance problems with inefficient patterns
Typical Exam Questions (with Short Answer)
- Regex used for? Search, analysis and processing of character strings based on predefined patterns.
- Regex for phone number?
\+49\s\d{3,5}\s\d{4,} - Start and end in regex? ^ (start), $ (end)
- Formulate alternative character strings?
With alternative character |, e.g.
dog|cat - Character
.in regex means? Any single character except line breaks. - Greedy vs. Lazy Matching? Greedy matches as much as possible, lazy as little as necessary.
- Regex security-relevant? Poorly written regex can enable ReDoS attacks.
- Test complex regular expressions? With tools like regex101, unit tests and realistic test data basis.
Most Important Sources
- https://regex101.com
- https://developer.mozilla.org/de/docs/Web/JavaScript/Guide/Regular_Expressions
- https://www.regular-expressions.info