


Char. number range (hexadecimal) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

单字节在最高位为0,之后7个字节存放内容. 超过1个字节的,在最高位补字节个数个1,再加上10(也即是说,用字节长度个+2个bit作为信息位).从第二个字节开始最高的两位为10.因为这种特性,UTF-8最高可以有6位.(再多就没有坑啦).

知道了规律,那么我们面对一个UTF-8的字符串,切字就很方便了.首先检查字符串长度,然后从当前位开始切长度个过去.我写了个demo 在此处,各位有兴趣可以参考下.


#include <string>
#include <vector>

#include "argcv/cxx/str/str_helper.h"

using std::string;
using std::vector;

using namespace argcv;

int main(int argc, char* argv[]) {
  vector<string> elems = Utf8Split("abcd\u00A0你好世界123\n");
  for (size_t i = 0; i < elems.size(); i++) {
    printf("%zu (%s):%lu\n", i,
      elems[i].c_str(), elems[i].length());
  return 0;


$ ./run_example
0 (a):1
1 (b):1
2 (c):1
3 (d):1
4 ( ):2
5 (你):3
6 (好):3
7 (世):3
8 (界):3
9 (1):1
10 (2):1
11 (3):1
12 (
Categories: Code


Ideals are like the stars: we never reach them, but like the mariners of the sea, we chart our course by them.


Yuan · July 30, 2015 at 19:26

Google Chrome 43.0.2357.134 Google Chrome 43.0.2357.134 Mac OS X  10.10.1 Mac OS X 10.10.1


mooc · July 8, 2015 at 17:22

Google Chrome 42.0.2311.152 Google Chrome 42.0.2311.152 Windows 8.1 x64 Edition Windows 8.1 x64 Edition


    yu · July 8, 2015 at 23:07

    Google Chrome 43.0.2357.130 Google Chrome 43.0.2357.130 Mac OS X  10.10.4 Mac OS X 10.10.4

    @mooc huh? 但是我这段代码并没有用模板啊

      mooc · July 8, 2015 at 23:16

      Google Chrome 42.0.2311.152 Google Chrome 42.0.2311.152 Windows 8.1 x64 Edition Windows 8.1 x64 Edition

      @yu QAQ弄错了,是vector。。不过vector容器也算模版吧

        yu · July 8, 2015 at 23:25

        Google Chrome 43.0.2357.130 Google Chrome 43.0.2357.130 Mac OS X  10.10.4 Mac OS X 10.10.4

        @mooc 是, 不过不是我写的模板.

        感知机我倒是的确用模板来写的,可惜太水了没人看 ….


          mooc · July 8, 2015 at 23:29

          Google Chrome 42.0.2311.152 Google Chrome 42.0.2311.152 Windows 8.1 x64 Edition Windows 8.1 x64 Edition

          @yu 哎,差你一大截,我还得多跟前辈学习呢:) 你的文章除了编程以外我一篇都看不懂

            yu · July 8, 2015 at 23:36

            Google Chrome 43.0.2357.130 Google Chrome 43.0.2357.130 Mac OS X  10.10.4 Mac OS X 10.10.4

            @mooc 按你高三算,我比你虚长上六七岁呢.

Leave a Reply

Your email address will not be published. Required fields are marked *