Readwise Reader 中无法识别中文批注版本 Epub 中的行内批注,为了更好的在 Readwise Reader 中阅读,以及当初在 Calibre 中阅读时批注,不好将批注内容很好的分地复制出来,所以需要将批注的格式进行修改。

通过正则匹配替换内容,保留批注正文。

Find Regex Replace Regex
<span class="xiu"><span class="ord">绣<span class="ord0">旁<\/span><\/span>(.*?)<\/span> 「绣旁:\1」
<span class="xiu"><span class="ord">绣<span class="ord0">眉<\/span><\/span>(.*?)<\/span> 「绣眉:\1」
<span class="xiu"><span class="ord">绣<span class="ord0">夹<\/span><\/span>(.*?)<\/span> 「绣夹:\1」
<span class="jia"><span class="ord">张<span class="ord0">旁<\/span><\/span>(.*?)<\/span> 「张旁:\1」
<span class="jia"><span class="ord">张<span class="ord0">眉<\/span><\/span>(.*?)<\/span> 「张眉:\1」
<span class="jia"><span class="ord">张<span class="ord0">夹<\/span><\/span>(.*?)<\/span> 「张夹:\1」
<div class="quote">\s(<p>)<span class="ord">张<span class="ord0">回<\/span><\/span> \1「张回:
(<\/p>)\s<\/div>(\s<p>词曰) 」\1\2
(<\/p>)\s<\/div>(\s<p>诗曰) 」\1\2
<div class="wenlong">\s(<p>)<span class="ord">文<span class="ord0">回<\/span><\/span> \1「文回:
(<\/p>\s)<\/div>\s(<\/body>) 」\1\2

这里主要用到的就是 () 内的内容可以用 \ 加上数字来表示,进行保留。

Find Regex Replace Regex
<span class="kt"><img alt="庚辰本" class="font_patch" src="../Images/image00844.gif"/>(.*?)</span> 「庚:\1」
<span class="kt"><img alt="甲戌本" class="font_patch" src="../Images/image00842.gif"/>(.*?)</span> 「甲:\1」
<span class="kt"><img alt="戚序本" class="font_patch" src="../Images/image00845.gif"/>(.*?)</span> 「戚:\1」
<span class="kt"><img alt="己卯本" class="font_patch" src="../Images/image00843.gif"/>(.*?)</span> 「己:\1」
<span class="kt"><span class="red"><img alt="庚辰本" class="font_patch" src="../Images/image00844.gif"/>(.*?)</span></span> 「庚:\1」
<img alt="甲戌本" class="font_patch" src="../Images/image00842.gif"/><img alt="侧批" class="font_patch" src="../Images/image00851.gif"/>((?!img).*?)(</span>) 「甲侧:\1」\2
<img alt="庚辰本" class="font_patch" src="../Images/image00844.gif"/><img alt="侧批" class="font_patch" src="../Images/image00851.gif"/>((?!img).*?)(</span>) 「庚侧:\1」\2
<img alt="甲戌本" class="font_patch" src="../Images/image00842.gif"/><img alt="眉批" class="font_patch" src="../Images/image00850.gif"/>((?!img).*?)(</span>) 「甲眉:\1」\2
<img alt="甲戌本" class="font_patch" src="../Images/image00842.gif"/><img alt="夹批" class="font_patch" src="../Images/image00852.gif"/>((?!img).*?)(</span>) 「甲夹:\1」\2
<img alt="庚辰本" class="font_patch" src="../Images/image00844.gif"/><img alt="眉批" class="font_patch" src="../Images/image00850.gif"/>((?!img).*?)(</span>) 「庚眉:\1」\2
<img alt="蒙府本" class="font_patch" src="../Images/image00846.gif"/><img alt="侧批" class="font_patch" src="../Images/image00851.gif"/>((?!img).*?)(</span>) 「蒙侧:\1」\2
<img alt="庚辰本" class="font_patch" src="../Images/image00844.gif"/><img alt="夹批" class="font_patch" src="../Images/image00852.gif"/>((?!img).*?)(</span>) 「庚夹:\1」\2
<img alt="戚序本" class="font_patch" src="../Images/image00845.gif"/><img alt="夹批" class="font_patch" src="../Images/image00852.gif"/>((?!img).*?)(</span>) 「戚夹:\1」\2
<span class="small kt red">((?!span).*?)</span> \1
<span class="x-small">((?!span).*?)</span> \1
<span class="small kt">((?!span).*?)</span> \1
<span class="red">((?!span).*?)</span> \1
<img alt="戚序本" class="font_patch" src="../Images/image00845.gif"/><img alt="夹批" class="font_patch" src="../Images/image00852.gif"/>((?!img).*?)(</p>) 「戚夹:\1」\2
<img alt="甲辰本" class="font_patch" src="../Images/image00849.gif"/><img alt="夹批" class="font_patch" src="../Images/image00852.gif"/>((?!img).*?)(</p>) 「甲夹:\1」\2

这里的 (?!span) 是防止出现 <span></span> 相互嵌套,导致匹配的内容并不是一个成对 HTML 标签的情况,也就是不允许 <span></span> 中存在额外的 <span> 标签。

红楼梦批注的 HTML 结构很乱,余者手动校对。

正则参考:正则表达式 - JavaScript | MDN


No notes link to this note