当前位置: 首页 > news >正文

怎建立自己网站做淘宝客如何查询百度收录情况

怎建立自己网站做淘宝客,如何查询百度收录情况,做头像的网站横杆带字,wordpress 测试数据返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int 250, chunk_overlap: int 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分 from typing import Union, Listimport jieba import recla…

返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int = 250, chunk_overlap: int = 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分

from typing import Union, Listimport jieba
import reclass SentenceSplitter:def __init__(self, chunk_size: int = 250, chunk_overlap: int = 50):self.chunk_size = chunk_sizeself.chunk_overlap = chunk_overlapdef split_text(self, text: str) -> List[str]:if self._is_has_chinese(text):return self._split_chinese_text(text)else:return self._split_english_text(text)def _split_chinese_text(self, text: str) -> List[str]:sentence_endings = {'\n', '。', '!', '?', ';', '…'}  # 句末标点符号chunks, current_chunk = [], ''for word in jieba.cut(text):if len(current_chunk) + len(word) > self.chunk_size:chunks.append(current_chunk.strip())current_chunk = wordelse:current_chunk += wordif word[-1] in sentence_endings and len(current_chunk) > self.chunk_size - self.chunk_overlap:chunks.append(current_chunk.strip())current_chunk = ''if current_chunk:chunks.append(current_chunk.strip())if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _split_english_text(self, text: str) -> List[str]:# 使用正则表达式按句子分割英文文本sentences = re.split(r'(?<=[.!?])\s+', text.replace('\n', ' '))chunks, current_chunk = [], ''for sentence in sentences:if len(current_chunk) + len(sentence) <= self.chunk_size or not current_chunk:current_chunk += (' ' if current_chunk else '') + sentenceelse:chunks.append(current_chunk)current_chunk = sentenceif current_chunk:  # Add the last chunkchunks.append(current_chunk)if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _is_has_chinese(self, text: str) -> bool:# check if contains chinese charactersif any("\u4e00" <= ch <= "\u9fff" for ch in text):return Trueelse:return Falsedef _handle_overlap(self, chunks: List[str]) -> List[str]:# 处理块间重叠overlapped_chunks = []for i in range(len(chunks) - 1):chunk = chunks[i] + ' ' + chunks[i + 1][:self.chunk_overlap]overlapped_chunks.append(chunk.strip())overlapped_chunks.append(chunks[-1])return overlapped_chunkstext_splitter = SentenceSplitter()def load_file(filepath):print("filepath:",filepath)if filepath.endswith(".md"):contents = extract_text_from_markdown(filepath)elif filepath.endswith(".pdf"):contents = extract_text_from_pdf(filepath)elif filepath.endswith('.docx'):contents = extract_text_from_docx(filepath)else:contents = extract_text_from_txt(filepath)return contentsdef extract_text_from_pdf(file_path: str):"""Extract text content from a PDF file."""import PyPDF2contents = []with open(file_path, 'rb') as f:pdf_reader = PyPDF2.PdfReader(f)for page in pdf_reader.pages:page_text = page.extract_text().strip()raw_text = [text.strip() for text in page_text.splitlines() if text.strip()]new_text = ''for text in raw_text:new_text += textif text[-1] in ['.', '!', '?', '。', '!', '?', '…', ';', ';', ':', ':', '”', '’', ')', '】', '》', '」','』', '〕', '〉', '》', '〗', '〞', '〟', '»', '"', "'", ')', ']', '}']:contents.append(new_text)new_text = ''if new_text:contents.append(new_text)return contentsdef extract_text_from_txt(file_path: str):"""Extract text content from a TXT file."""with open(file_path, 'r', encoding='utf-8') as f:contents = [text.strip() for text in f.readlines() if text.strip()]return contentsdef extract_text_from_docx(file_path: str):"""Extract text content from a DOCX file."""import docxdocument = docx.Document(file_path)contents = [paragraph.text.strip() for paragraph in document.paragraphs if paragraph.text.strip()]return contentsdef extract_text_from_markdown(file_path: str):"""Extract text content from a Markdown file."""import markdownfrom bs4 import BeautifulSoupwith open(file_path, 'r', encoding='utf-8') as f:markdown_text = f.read()html = markdown.markdown(markdown_text)soup = BeautifulSoup(html, 'html.parser')contents = [text.strip() for text in soup.get_text().splitlines() if text.strip()]return contentstexts = load_file(r"C:\Users\lo***山市城市建筑外立面管理条例.docx")
print(texts)

在这里插入图片描述

http://www.ds6.com.cn/news/40103.html

相关文章:

  • 网站做谷歌推广有效果吗seo自学网
  • 书店商城网站html模板下载百度搜索热度排名
  • 足球网站怎么做如何在百度上做免费推广
  • 做博客网站需要工具吗国际时事新闻
  • 天津百度建网站bt磁力猪
  • 贸易公司网站建哪家网络公司比较好
  • 文化建设基金管理有限公司网站媒体营销
  • 成都金牛网站建设公司aso网站
  • 用代码做一号店网站怎么做如何做推广和引流
  • 做任务赚q红包的网站2024百度下载
  • 网站建设公司用5g衡阳网站优化公司
  • 我的世界充值网站怎么做app001推广平台官网
  • 域名解析到别人网站怎么建造自己的网站
  • 难道做网站的工资都不高吗品牌策划的五个步骤
  • 网站不备案访问网络营销评价的名词解释
  • cms代码做网站百度在线提问
  • 微信怎么建小程序网站关键词在线优化
  • 个体户年报网上申报流程福州seo公司
  • 西宁个人网站建设怎样创建网站
  • 哪个网站专做童装批发网络营销软件推广
  • 网站留言系统是怎么做的推广平台有哪些?
  • 做任务赚钱网站有哪些帮收款的接单平台
  • 兰州有做百度网站的吗品牌营销策划有限公司
  • 合肥网站推广公司哪家好网页设计的流程
  • 江苏网站优化企业推广软文
  • 北辰正方建设集团有限公司官方网站google推广一年的费用
  • 深圳网站建设智能 乐云践新武汉好的seo优化网
  • 网站建设行业广告语全网营销系统是不是传销
  • 峨山网站建设seo营销专员
  • 湘西州建设银行网站南京百度推广开户