Skip to main content

3 posts tagged with "performance"

View All Tags

Accelerating Unicode string processing with SIMD in Velox

· 8 min read
Ping Liu
Software Engineer
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

Why SIMD, why here

Velox is a high-performance execution engine, and SIMD is one of the primary ways we deliver that performance across compute-heavy paths. This is not a one-off micro-optimization; it is a principle we apply consistently when the data and algorithms make SIMD a natural fit. String processing is one of those places: it is ubiquitous in query execution and often dominated by byte-wise work that maps well to vectorized scanning.

In this post, we describe how we applied this principle to two Unicode helpers that sit on many string processing paths, and how we balanced speed, correctness, and maintainability.

The old implementation calls utf8proc_char_length on every UTF character. That is robust, but it is inefficient for long ASCII segments where every character has only 1 byte.

What these helpers do

  • cappedLengthUnicode returns the number of UTF-8 characters in a string, stopping early once maxChars is reached.
  • cappedByteLengthUnicode returns the byte position of the Nth character, also capped by maxChars. This is the key primitive for truncating strings by character count without re-parsing the entire string.

Consider a UTF-8 string that mixes multi-byte characters and ASCII:

input = "你好abc世界"  // 15 bytes, 7 characters

With maxChars = 3:

  • cappedLengthUnicode returns 3.
  • cappedByteLengthUnicode returns 7 (the byte position after the third character, "好").

Both need to handle mixed ASCII and multi-byte UTF-8 inputs, invalid bytes, and early-exit behavior. These helpers are called on the Unicode code path — the path used when the input is not known to be pure ASCII at the call site. They are hot in several places, most notably the Iceberg truncate transform, which must find the byte position of the Nth character to truncate partition keys. When we profiled the Iceberg write benchmark, cappedByteLengthUnicode stood out as a clear hot spot.

Byte-by-byte UTF-8 scanning becomes a bottleneck on large, ASCII-heavy strings, but we cannot assume pure ASCII because inputs may mix ASCII and multi-byte characters.

The SIMD design

We introduced a shared helper, detail::cappedUnicodeImpl, and parameterized it by whether it should return characters or bytes. The key idea is to process the string in SIMD-width blocks using xsimd:

  1. ASCII fast path. Load a register-width block and check whether any byte has its high bit set. If not, the whole block is ASCII. We can skip the entire block and advance the counters in one step.
  2. Mixed UTF-8 blocks. For a block containing non-ASCII bytes, we count character starts using (byte & 0xC0) != 0x80, convert the predicate into a bitmask, and use popcount to get the number of UTF-8 characters in the block.
  3. Finding the Nth character. For cappedByteLengthUnicode, if the maxChars target falls inside a mixed block, we use ctz on the bitmask to locate the Nth character start and add a small inline helper (getUtf8CharLength) to compute the byte length of that UTF-8 character.
  4. Remainder handling. For the tail, we keep two loops:
    • A non-ASCII remainder loop that avoids ASCII fast-path mispredicts.
    • A branching ASCII loop that is cheaper than the branchless version for short strings.

The journey: what we tried and learned

The final implementation emerged through several rounds of iteration and experimentation. Here are the key insights from that process.

Optimize at the core, not the call site

The initial version added the SIMD fast path directly in the Iceberg truncate function. But the real bottleneck was in cappedByteLengthUnicode itself. Moving the optimization into the shared helper meant every call site benefits — not just Iceberg.

Avoid switching between code paths

The first SIMD version only accelerated the leading ASCII prefix: it scanned SIMD blocks until it hit the first non-ASCII byte, then fell back to scalar for the rest of the string. This is useless if the first character is non-ASCII but the rest are ASCII. The obvious fix — re-entering the SIMD path after each non-ASCII block — caused a ~20% regression on mixed and Unicode-heavy inputs due to the overhead of switching between SIMD and scalar modes. The solution was a single SIMD loop that handles both ASCII and mixed blocks, avoiding frequent code-path switches and extra branching.

Unify the two helpers

cappedLengthUnicode and cappedByteLengthUnicode have nearly identical logic — the only difference is what they return. Factoring them into a shared detail::cappedUnicodeImpl<returnByteLength> template eliminated the code duplication and ensures both helpers stay in sync.

Short strings need special care

After the SIMD path was generalized, the expanded benchmark matrix revealed regressions on very short strings (4 bytes). The fix was simple: skip the SIMD loop entirely when the string is shorter than one register width. Short strings go straight to the scalar path, which avoids the overhead of loading and testing a partial SIMD register.

A branchless getUtf8CharLength

The scalar remainder loop originally called utf8proc_char_length, which involves several branches. A branchless alternative using std::countl_zero compiles to a handful of instructions:

auto n = std::countl_zero((~static_cast<uint32_t>(c) << 24) | 1);
return (n == 0 || n > 4) ? 1 : n;

This improved performance on the short-string and remainder paths.

Confidence through comprehensive testing

Since cappedLengthUnicode is widely used across Velox (unlike cappedByteLengthUnicode which is limited to Iceberg and Spark), the change needed high confidence. The PR added tests for multi-byte UTF-8 sequences that straddle SIMD boundaries, continuation bytes in the scalar tail, and strings that are exactly one register width.

Benchmark results

We added StringCoreBenchmark.cpp to measure both helpers across a matrix of string lengths (4, 16, 64, 256, 1024 bytes) and ASCII ratios (0%, 1%, 25%, 50%, 75%, 99%, 100%). Each benchmark processes 10,000 strings with maxChars = 128.

ASCII ratio indicates the percentage of characters in each string that are single-byte ASCII; the remaining characters are multi-byte UTF-8. The following table highlights representative speedups from the full matrix (measured on an Apple M3):

String LengthASCII RatiocappedByteLengthUnicodecappedLengthUnicode
1650%+9.7×+7.7×
6450%+15.3×+11.3×
64100%+6.3×+7.3×
25625%+9.2×+6.3×
256100%+9.0×+7.2×
102450%+6.1×+7.4×
102499%+2.3×+2.4×
1024100%+4.9×+7.6×
40% (Unicode-only)−1.25×−1.08×
4100% (short ASCII)−1.07×−1.02×

The small regressions on 4-byte strings are within noise for such tiny inputs. For strings ≥ 16 bytes — which dominate real workloads — speedups range from 2× to over 15×, with the largest gains on mixed-content strings where the SIMD loop handles both ASCII and non-ASCII blocks without falling back to scalar.

We also validated the results on an x86_64 server (Intel Xeon Skylake with AVX-512) and observed similar improvement patterns. A full set of raw numbers is available in the perf data gist and summarized in the PR.

Takeaways

  • Measure first, optimize second. The optimization started from a real hot spot found in the Iceberg write benchmark — not from speculation.
  • Optimize the core, not the symptom. Rather than patching the call site, we pushed the improvement into the shared helper so that all callers benefit.
  • SIMD is a first-class tool in Velox for performance-critical paths. String processing — ubiquitous in query execution and dominated by byte-wise work — is a natural fit for vectorized scanning.
  • Avoid switching between code paths. Re-entering the SIMD path after every non-ASCII block introduced extra branching and switching overhead; a unified loop that handles both ASCII and mixed blocks was consistently faster.
  • Branchless is not always faster. A single branchless remainder loop was slower than two tailored loops on short strings.
  • Avoid duplication. The two helpers share a single parameterized implementation, keeping behavior consistent and reducing maintenance burden.
  • Data-driven iteration. We tried multiple strategies and compared them against a realistic benchmark matrix. We kept the version that improved the common case without regressing Unicode-heavy workloads.
  • Comprehensive benchmarks across different string lengths and ASCII mixing ratios are essential for validating correctness-sensitive optimizations.

Acknowledgements

Special thanks to Masha Basmanova and Yuhta for their guidance and detailed code reviews that shaped the SIMD design and pushed for comprehensive benchmarking.

Improve LIKE's performance

· 5 min read
James Xu
Software Engineer @ Alibaba

What is LIKE?

LIKE is a very useful SQL operator. It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:

SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)
WHERE name LIKE '%b%'
--returns 'abc' and 'bcd'

SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)
WHERE name LIKE '_b%'
--returns 'abc'

SELECT * FROM (VALUES ('a_c'), ('_cd'), ('cde')) AS t (name)
WHERE name LIKE '%#_%' ESCAPE '#'
--returns 'a_c' and '_cd'

These examples show the basic usage of LIKE:

  • Use % to match zero or more characters.
  • Use _ to match exactly one character.
  • If we need to match % and _ literally, we can specify an escape char to escape them.

When we use Velox as the backend to evaluate Presto's query, LIKE operation is translated into Velox's function call, e.g. name LIKE '%b%' is translated to like(name, '%b%'). Internally Velox converts the pattern string into a regular expression and then uses regular expression library RE2 to do the pattern matching. RE2 is a very good regular expression library. It is fast and safe, which gives Velox LIKE function a good performance. But some popularly used simple patterns can be optimized using direct simple C++ string functions instead of regex. e.g. Pattern hello% matches inputs that start with hello, which can be implemented by direct memory comparison of prefix ('hello' in this case) bytes of input:

// Match the first 'length' characters of string 'input' and prefix pattern.
bool matchPrefixPattern(
StringView input,
const std::string& pattern,
size_t length) {
return input.size() >= length &&
std::memcmp(input.data(), pattern.data(), length) == 0;
}

It is much faster than using RE2. Benchmark shows it gives us a 750x speedup. We can do similar optimizations for some other patterns:

  • %hello: matches inputs that end with hello. It can be optimized by direct memory comparison of suffix bytes of the inputs.
  • %hello%: matches inputs that contain hello. It can be optimized by using std::string_view::find to check whether inputs contain hello.

These simple patterns are straightforward to optimize. There are some more relaxed patterns that are not so straightforward:

  • hello_velox%: matches inputs that start with 'hello', followed by any character, then followed by 'velox'.
  • %hello_velox: matches inputs that end with 'hello', followed by any character, then followed by 'velox'.
  • %hello_velox%: matches inputs that contain both 'hello' and 'velox', and there is a single character separating them.

Although these patterns look similar to previous ones, but they are not so straightforward to optimize, _ here matches any single character, we can not simply use memory comparison to do the matching. And if user's input is not pure ASCII, _ might match more than one byte which makes the implementation even more complex. Also note that the above patterns are just for illustrative purpose. Actual patterns can be more complex. e.g. h_e_l_l_o, so trivial algorithm will not work.

Optimizing Relaxed Patterns

We optimized these patterns as follows. First, we split the patterns into a list of sub patterns, e.g. hello_velox% is split into sub-patterns: hello, _, velox, %, because there is a % at the end, we determine it as a kRelaxedPrefix pattern, which means we need to do some prefix matching, but it is not a trivial prefix matching, we need to match three sub-patterns:

  • kLiteralString: hello
  • kSingleCharWildcard: _
  • kLiteralString: velox

For kLiteralString we simply do a memory comparison:

if (subPattern.kind == SubPatternKind::kLiteralString &&
std::memcmp(
input.data() + start + subPattern.start,
patternMetadata.fixedPattern().data() + subPattern.start,
subPattern.length) != 0) {
return false;
}

Note that since it is a memory comparison, it handles both pure ASCII inputs and inputs that contain Unicode characters.

Matching _ is more complex considering that there are variable length multi-bytes character in unicode inputs. Fortunately there are existing libraries which provides unicode related operations: utf8proc. It provides functions that tells us whether a byte in input is the start of a character or not, how many bytes current character consists of etc. So to match a sequence of _ our algorithm is:

if (subPattern.kind == SubPatternKind::kSingleCharWildcard) {
// Match every single char wildcard.
for (auto i = 0; i < subPattern.length; i++) {
if (cursor >= input.size()) {
return false;
}

auto numBytes = unicodeCharLength(input.data() + cursor);
cursor += numBytes;
}
}

Here:

  • cursor is the index in the input we are trying to match.
  • unicodeCharLength is a function which wraps utf8proc function to determine how many bytes current character consists of.

So the logic is basically repeatedly calculate size of current character and skip it.

It seems not that complex, but we should note that this logic is not effective for pure ASCII input. Every character is one byte in pure ASCII input. So to match a sequence of _, we don't need to calculate the size of each character and compare in a for-loop. In fact, we don't need to explicitly match _ for pure ASCII input as well. We can use the following logic instead:

for (const auto& subPattern : patternMetadata.subPatterns()) {
if (subPattern.kind == SubPatternKind::kLiteralString &&
std::memcmp(
input.data() + start + subPattern.start,
patternMetadata.fixedPattern().data() + subPattern.start,
subPattern.length) != 0) {
return false;
}
}

It only matches the kLiteralString pattern at the right position of the inputs, _ is automatically matched(actually skipped). No need to match it explicitly. With this optimization we get 40x speedup for kRelaxedPrefix patterns, 100x speedup for kRelaxedSuffix patterns.

Thank you Maria Basmanova for spending a lot of time reviewing the code.

Learnings from optimizing try_cast

· 4 min read
Laith Sakka
Software Engineer @ Meta

One of the queries shadowed internally at Meta was much slower in Velox compared to presto(2 CPU days vs. 4.5 CPU hours). Initial investigation identified that the overhead is related to casting empty strings inside a try_cast.

In this blogpost I summarize my learnings from investigating and optimizing try_cast.


Start and end results

Initial benchmark:

name                                             total time
try_cast(empty_string_col as int) 4.88s
try_cast(valid_string_col as int) 2.15ms

The difference between casting a valid and invalid input is huge (>1000X), although ideally casting an invalid string should be just setting a null and should not be that expensive.

Benchmark results after optimization:

name                                             total time
try_cast(empty_string_col as int) 1.24ms
try_cast(valid_string_col as int) 2.15ms

Sources of regression

The investigation revealed several factors that contributed to the huge gap, summarized in the points below in addition to their approximate significance.

Error logs overhead.

Whenever a VeloxUserError is thrown an error log used to be generated, however those errors are expected to, (1) either get converted to null if is thrown from within a try, (2) or show up to the user otherwise. Hence, no need for that expensive logging .

Moreover, each failing row used to generate two log message because VELOX_USER_FAIL was called twice. Disabling logging for user error helped save 2.6s of the 4.88s.

Throwing overhead.

Each time a row is casted four exception were thrown:

  1. From within Folly library.
  2. From Cast in Conversions.h, the function catch the exception thrown by Folly and convert it to Velox exception and throw it.
  3. From castToPrimitive function, which catch the exception and threw a new exception with more context.
  4. Finally, a forth throw came from applyToSelectedNoThrow which caught an exception and called toVeloxException to check exception type and re-throw.

Those are addressed and avoided using the following:

  1. When the input is an empty string, avoid calling folly by directly checking if the input is empty.
  2. Remove the catch and re-throw from Conversions.h
  3. Introduce setVeloxExceptionError, which can be used to set the error directly in evalContext without throwing (does not call toVeloxException).
  4. Optimize applyToSelectedNoThrow to call setVeloxExceptionError if it catches Velox exception.

With all those changes throwing exceptions is completely avoided when casting empty strings. This takes the runtime down to 382.07ms, but its still much higher than 2.15ms.

** Velox exception construction overhead.**

Constructing Velox exception is expensive, even when there is no throw at all! Luckily this can be avoided for try_cast, since the output can be directly set to null without having to use the errorVector to track errors. By doing so the benchmark runtime goes down to 1.24ms.


Follow up tasks

After all the changes we have the following performance numbers for other patterns of similar expressions (much better than before but still can be optimized a lot).

try_cast(empty_string_col as int)                     1.24ms    808.79

try_cast(invalid_string_col as int) 393.61ms 2.54

try(cast(empty_string_col as int)) 375.82ms 2.66

try(cast(invalid_string_col as int)) 767.74ms 1.30

All these can be optimized to have the same runtime cost of the first expression 1.24ms.

To do that two thing are needed:

1) Tracking errors for try, should not require constructing exceptions

The way errors are tracked when evaluating a try expression is by setting values in an ErrorVector; which is a vector of VeloxException pointers. This forces the construction of a Velox exception for each row, but that is not needed (for try expressions) since only row numbers need to be tracked to be converted eventually to nulls, but not the actual errors.

This can be changed such that errors are tracked using a selectivity vector. Its worth noting that for other expressions such as conjunct expressions this tracking is needed, hence we need to distinguish between both.

This would help optimize any try(x) expression where x throws for large number of rows.

2)Use throw-free cast library

Avoiding throw and instead returning a boolean should allow us to directly set null in try_cast and avoid construction of exceptions completely.

While this is done now for empty strings, its not done for all other types of errors. Folly provides a non-throwing API (folly::tryTo) that can be tried for that purpose. folly::tryTo. According to the folly documentation On the error path, you can expect tryTo to be roughly three orders of magnitude faster than the throwing to and to completely avoid any lock contention arising from stack unwinding.