记一次unicode编码引发的问题

背景

遇到的场景是:客户从PDF粘贴一段文字到搜索框,内容是工时规范四个中文字,发现搜不出结果,但是自己手动敲击文字则是正常。请求如下:

1
2
https://xxx?name=%E5%B7%A5%E6%97%B6%E8%A7%84%E8%8C%83
https://xxx?name=%E2%BC%AF%E6%97%B6%E8%A7%84%E8%8C%83

以上链接是URL编码后的结果。

对比得知两者区别在于编码是不一样的,但展示结果看起来是一致的。URL编码是16进制的,区别只有前面几个编码不同,也就是工字。

  • E5B7A5
  • E2BCAF

排查思路

遇到的第一个问题是,为什么不同的编码会出现一样的字?
通常URL编码对应的是UTF8编码格式,可从encode_utf8 网站检索如下。

字符 编码10进制 编码16进制 Unicode编码10进制 Unicode编码16进制
15054757 E5B7A5 24037 5DE5
14859439 E2BCAF 12079 2F2F

对于编码来讲,我的理解就是一组kv集合,key就是编码,v就是值。无论是UTF8还是unicode都是一样的,只是两者范围和大小上有所不同。

通过symbl可以通过unicode编码获取对应的字符是什么,也就是get(k)。

可以看到两者的归属是不一样的:

也就是说,两个工字都是中文,只是一个是部首偏旁,另外一个我们正常使用的汉字。
目前有使用ChatGPT辅助我查找问题,当我扔上面两个字符给到ChatGPT时,它告诉我们区别在于是否是全角和半角,于是我开始搜索全角和半角的区别。但显然并不是这个原因,也就是说ChatGPT在回答这种问题时,是会误导用户的。

思路

得知问题所在后,就是开始找解决方案了。检索发现unicode是提供 normalization 功能的。我理解解决的就是兼容性的问题,比如两个长得很像字符,不只是中文,其他语言也有类似的问题。在计算机的世界里面,这两者的编码不同,但在人类世界中可能是一致的。

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'

>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True

对应Java代码也很简单

1
Normalizer.normalize(argStr, Normalizer.Form.NFKC);

基本解决方案有了,但如何集成到工程当中呢。工程使用的SpringBoot, Web使用的是Spring MVC。问题是如何把这段逻辑放在工厂中,最简单的方案就是哪个字段有问题就转一下就可以,这样做的问题就是改的多了就是多处存在相同的代码片段。类似这种统一处理的问题,在Spring中最明显就是AOP方案。做一个切面,获取到所有的参数都做一遍规范化处理。这样做当然可以,只是过于简单粗暴了,笔者在想有没有更优雅的方案。
那么需要先回答一个问题:Spring是如何帮我们自动把URL编码转成文本的?其实这里应该有两个过程:

  • URL编码的二进制流转成文本
  • 文本绑定到方法的参数上

在Spring MVC中,实际做这件事的是HandlerMethodArgumentResolver,这个接口有大量的默认实现去解析不同场景下的参数。
例如RequestMappingHandlerAdapter#getDefaultArgumentResolvers()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
private List<HandlerMethodArgumentResolver> getDefaultArgumentResolvers() {
List<HandlerMethodArgumentResolver> resolvers = new ArrayList<>();

// Annotation-based argument resolution
resolvers.add(new RequestParamMethodArgumentResolver(getBeanFactory(), false));
resolvers.add(new RequestParamMapMethodArgumentResolver());
resolvers.add(new PathVariableMethodArgumentResolver());
resolvers.add(new PathVariableMapMethodArgumentResolver());
resolvers.add(new MatrixVariableMethodArgumentResolver());
resolvers.add(new MatrixVariableMapMethodArgumentResolver());
resolvers.add(new ServletModelAttributeMethodProcessor(false));
resolvers.add(new RequestResponseBodyMethodProcessor(getMessageConverters(), this.requestResponseBodyAdvice));
resolvers.add(new RequestPartMethodArgumentResolver(getMessageConverters(), this.requestResponseBodyAdvice));
resolvers.add(new RequestHeaderMethodArgumentResolver(getBeanFactory()));
resolvers.add(new RequestHeaderMapMethodArgumentResolver());
resolvers.add(new ServletCookieValueMethodArgumentResolver(getBeanFactory()));
resolvers.add(new ExpressionValueMethodArgumentResolver(getBeanFactory()));
resolvers.add(new SessionAttributeMethodArgumentResolver());
resolvers.add(new RequestAttributeMethodArgumentResolver());

// Type-based argument resolution
resolvers.add(new ServletRequestMethodArgumentResolver());
resolvers.add(new ServletResponseMethodArgumentResolver());
resolvers.add(new HttpEntityMethodProcessor(getMessageConverters(), this.requestResponseBodyAdvice));
resolvers.add(new RedirectAttributesMethodArgumentResolver());
resolvers.add(new ModelMethodProcessor());
resolvers.add(new MapMethodProcessor());
resolvers.add(new ErrorsMethodArgumentResolver());
resolvers.add(new SessionStatusMethodArgumentResolver());
resolvers.add(new UriComponentsBuilderMethodArgumentResolver());

// Custom arguments
if (getCustomArgumentResolvers() != null) {
resolvers.addAll(getCustomArgumentResolvers());
}

// Catch-all
resolvers.add(new RequestParamMethodArgumentResolver(getBeanFactory(), true));
resolvers.add(new ServletModelAttributeMethodProcessor(true));

return resolvers;
}

HandlerMethodArgumentResolver 提供两个方法:

  • 判断是否支持该解析器,返回值是boolean
  • 实际用来解析的方法,返回的是Object,也就是解析后的值
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    public interface HandlerMethodArgumentResolver {

    /**
    * Whether the given {@linkplain MethodParameter method parameter} is
    * supported by this resolver.
    * @param parameter the method parameter to check
    * @return {@code true} if this resolver supports the supplied parameter;
    * {@code false} otherwise
    */
    boolean supportsParameter(MethodParameter parameter);

    /**
    * Resolves a method parameter into an argument value from a given request.
    * A {@link ModelAndViewContainer} provides access to the model for the
    * request. A {@link WebDataBinderFactory} provides a way to create
    * a {@link WebDataBinder} instance when needed for data binding and
    * type conversion purposes.
    * @param parameter the method parameter to resolve. This parameter must
    * have previously been passed to {@link #supportsParameter} which must
    * have returned {@code true}.public interface HandlerMethodArgumentResolver {

    /**
    * Whether the given {@linkplain MethodParameter method parameter} is
    * supported by this resolver.
    * @param parameter the method parameter to check
    * @return {@code true} if this resolver supports the supplied parameter;
    * {@code false} otherwise
    */
    boolean supportsParameter(MethodParameter parameter);

    /**
    * Resolves a method parameter into an argument value from a given request.
    * A {@link ModelAndViewContainer} provides access to the model for the
    * request. A {@link WebDataBinderFactory} provides a way to create
    * a {@link WebDataBinder} instance when needed for data binding and
    * type conversion purposes.
    * @param parameter the method parameter to resolve. This parameter must
    * have previously been passed to {@link #supportsParameter} which must
    * have returned {@code true}.
    * @param mavContainer the ModelAndViewContainer for the current request
    * @param webRequest the current request
    * @param binderFactory a factory for creating {@link WebDataBinder} instances
    * @return the resolved argument value, or {@code null} if not resolvable
    * @throws Exception in case of errors with the preparation of argument values
    */
    @Nullable
    Object resolveArgument(MethodParameter parameter, @Nullable ModelAndViewContainer mavContainer,
    NativeWebRequest webRequest, @Nullable WebDataBinderFactory binderFactory) throws Exception;

    }public interface HandlerMethodArgumentResolver {

    /**
    * Whether the given {@linkplain MethodParameter method parameter} is
    * supported by this resolver.
    * @param parameter the method parameter to check
    * @return {@code true} if this resolver supports the supplied parameter;
    * {@code false} otherwise
    */
    boolean supportsParameter(MethodParameter parameter);

    /**
    * Resolves a method parameter into an argument value from a given request.
    * A {@link ModelAndViewContainer} provides access to the model for the
    * request. A {@link WebDataBinderFactory} provides a way to create
    * a {@link WebDataBinder} instance when needed for data binding and
    * type conversion purposes.
    * @param parameter the method parameter to resolve. This parameter must
    * have previously been passed to {@link #supportsParameter} which must
    * have returned {@code true}.
    * @param mavContainer the ModelAndViewContainer for the current request
    * @param webRequest the current request
    * @param binderFactory a factory for creating {@link WebDataBinder} instances
    * @return the resolved argument value, or {@code null} if not resolvable
    * @throws Exception in case of errors with the preparation of argument values
    */
    @Nullable
    Object resolveArgument(MethodParameter parameter, @Nullable ModelAndViewContainer mavContainer,
    NativeWebRequest webRequest, @Nullable WebDataBinderFactory binderFactory) throws Exception;

    }
    * @param mavContainer the ModelAndViewContainer for the current request
    * @param webRequest the current request
    * @param binderFactory a factory for creating {@link WebDataBinder} instances
    * @return the resolved argument value, or {@code null} if not resolvable
    * @throws Exception in case of errors with the preparation of argument values
    */
    @Nullable
    Object resolveArgument(MethodParameter parameter, @Nullable ModelAndViewContainer mavContainer,
    NativeWebRequest webRequest, @Nullable WebDataBinderFactory binderFactory) throws Exception;

    }

很明显,@RequestParam 注解对应的类就是RequestParamMethodArgumentResolver

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
protected Object resolveName(String name, MethodParameter parameter, NativeWebRequest request) throws Exception {
HttpServletRequest servletRequest = request.getNativeRequest(HttpServletRequest.class);

if (servletRequest != null) {
Object mpArg = MultipartResolutionDelegate.resolveMultipartArgument(name, parameter, servletRequest);
if (mpArg != MultipartResolutionDelegate.UNRESOLVABLE) {
return mpArg;
}
}

Object arg = null;
MultipartRequest multipartRequest = request.getNativeRequest(MultipartRequest.class);
if (multipartRequest != null) {
List<MultipartFile> files = multipartRequest.getFiles(name);
if (!files.isEmpty()) {
arg = (files.size() == 1 ? files.get(0) : files);
}
}
if (arg == null) {
String[] paramValues = request.getParameterValues(name);
if (paramValues != null) {
arg = (paramValues.length == 1 ? paramValues[0] : paramValues);
}
}
return arg;
}

简单看其中逻辑可见,最后是通过request.getParameterValues(name);获取到实际的值,这里跟下去的话就到了tomcat那层的代码。
笔者在思考是否有什么类似于编码的参数,可以用于配置编码这个流程,实际找到有HttpProperties这个配置类,但其中只是对于编码类型的配置,不涉及normalization的部分。
现在目标就是如何增强RequestParamMethodArgumentResolver,使其有规范化的能力。思路只有在RequestParamMethodArgumentResolver本身动手。

是否提供扩展点进行定制化

简单阅读代码发现,该类还继承AbstractNamedValueMethodArgumentResolver,该类有一个方法handleResolvedValue,签名如下

1
2
3
4
5
6
7
8
9
10
11
12
13

/**
* Invoked after a value is resolved.
* @param arg the resolved argument value
* @param name the argument name
* @param parameter the argument parameter type
* @param mavContainer the {@link ModelAndViewContainer} (may be {@code null})
* @param webRequest the current request
*/
protected void handleResolvedValue(@Nullable Object arg, String name, MethodParameter parameter,
@Nullable ModelAndViewContainer mavContainer, NativeWebRequest webRequest) {
}

这个可以看作一个扩展点,但很无奈的是,这个只是一个void方法,Java是值传递,而且String时不可变的,所以就算重写这个方法也没办法对返回新的arg。

重新定义一个RequestParamMethodArgumentResolver

要解决两个问题

  • 重新定义RequestParamMethodArgumentResolver,并且重写resolveName方法。
  • 如何覆盖默认的RequestParamMethodArgumentResolver

简单实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public class RequestParamMethodNormalizedArgumentResolver extends RequestParamMethodArgumentResolver {


public RequestParamMethodNormalizedArgumentResolver(boolean useDefaultResolution) {
super(useDefaultResolution);
}

public RequestParamMethodNormalizedArgumentResolver(ConfigurableBeanFactory beanFactory, boolean useDefaultResolution) {
super(beanFactory, useDefaultResolution);
}

@Override
protected Object resolveName(String name, MethodParameter parameter, NativeWebRequest request) throws Exception {
Object arg = super.resolveName(name, parameter, request);
// 对参数进行NFKC规范化
if (Objects.nonNull(arg) && arg instanceof String ) {
String argStr = String.valueOf(arg);
if(StringUtils.isNotEmpty(argStr)){
arg = Normalizer.normalize(argStr, Normalizer.Form.NFKC);
}
}
return arg;
}
}

对于第二个问题,根据上面getDefaultArgumentResolvers方法可知,其顺序是hardcode的。只要在合适时间去set这个list就可以。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Configuration
public class WebMvcConfig extends WebMvcConfigurationSupport implements BeanFactoryAware, InitializingBean {
@Nullable
private ConfigurableBeanFactory beanFactory;

@Autowired
private RequestMappingHandlerAdapter requestMappingHandlerAdapter;
@Override
public void setBeanFactory(BeanFactory beanFactory) throws BeansException {
if (beanFactory instanceof ConfigurableBeanFactory) {
this.beanFactory = (ConfigurableBeanFactory) beanFactory;
}
}

@Override
public void afterPropertiesSet() {
List<HandlerMethodArgumentResolver> argumentResolvers = requestMappingHandlerAdapter.getArgumentResolvers();
List<HandlerMethodArgumentResolver> newArgumentResolvers = new LinkedList<>();
newArgumentResolvers.add(new RequestParamMethodNormalizedArgumentResolver(this.beanFactory,false));
newArgumentResolvers.addAll(argumentResolvers);
requestMappingHandlerAdapter.setArgumentResolvers(Collections.unmodifiableList(newArgumentResolvers));
}
}

对于第2个问题,我同时也让ChatGPT提供了一段代码,发现其只是用来增加自定义的Resolver,顺序依旧在默认实现后面。其实ChatGPT的话不能全信,只能信一部份,作为提供思路的辅助,其提供的代码并不一定就能直接使用。
到现在,对于URL的Query参数的normalization完成了。
当然这只是解决Query参数的场景,对于Body里面的参数,甚至Header,Path里面的参数,要考虑的话都需要去考虑。
实际上,笔者认为如果前端提交请求之前,做了normalization处理的话,就能从源头上避免大部份的问题。

感想

事后有两个感想:

  • 也许这个只是一个伪问题:因为用户的确提交的字符本身就是「错」的,只是这两个字符长得很像而已。是否有必要帮助用户去规范化字符,也许体验上会好很多,毕竟在实际使用中,很少人真的会搜寻部首。笔者的疑问是,对于一个错误的输入,是否应该返回一个对的输出?其实这个抉择在方案设计时经常会碰到,所谓的优化会不会去掩盖掉「错误」?笔者个人觉得,其实不应该返回对的输出。
  • ChatGPT的作用没有想象中那么大,其给出的回答可以作为参考方向,因为有可能回答部份都是错的。对于开发而言,依旧需要依托自己的经验,去不断提问题,探究问题的本质,整理思路,不断摸索,千万不能完全依托ChatGPT的回答。

引用