python – 如何从字符串中找到子字符串列表的位置?

如何从字符串中找到子串列表的位置?

给定一个字符串:

“The plane, bound for St Petersburg, crashed in Egypt’s Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday.”

以及子字符串列表:

[‘The’, ‘plane’, ‘,’, ‘bound’, ‘for’, ‘St’, ‘Petersburg’, ‘,’, ‘crashed’, ‘in’, ‘Egypt’, “‘s”, ‘Sinai’, ‘desert’, ‘just’, ’23’, ‘minutes’, ‘after’, ‘take-off’, ‘from’, ‘Sharm’, ‘el-Sheikh’, ‘on’, ‘Saturday’, ‘.’]

期望的输出:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
        (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
        (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
        (110, 119), (120, 122), (123, 131), (131, 132)]

对输出的解释,第一个子串“The”可以通过使用字符串s使用(start,end)索引找到.所以从期望的输出.

因此,如果我们从所需的输出循环遍历所有整数元组,我们将返回子字符串列表,即

>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

我试过了:

def find_offset(tokens, s):
    index = 0
    offsets = []
    for token in tokens:
        start = s[index:].index(token) + index
        index = start + len(token)
        offsets.append((start, index))
    return offsets

还有另一种方法可以从字符串中找到子串列表的位置吗?

最佳答案
如果我们不知道子串,除了为每个子列重新扫描整个文本之外别无他法.

如果从数据看来,我们知道这些是文本的顺序片段,以文本顺序给出,则很容易在每次匹配后仅扫描文本的其余部分.但是,每次剪切文本都没有意义.

def spans(text, fragments):
    result = []
    point = 0  # Where we're in the text.
    for fragment in fragments:
        found_start = text.index(fragment, point)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
        point = found_end
    return result

测试:

>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]

这假设每个片段都存在于正确位置的文本中.您的输出格式未提供错配报告的示例.使用.find而不是.index可以帮助,但只是部分.

转载注明原文:python – 如何从字符串中找到子字符串列表的位置? - 代码日志