不是所有数据格式都会采用表格格式。随着我们进入大数据时代,数据的格式非常多样化,包括图像、文本、图形等等。

因为格式非常多样,从一个数据到另一个数据,所以将这些数据预处理为计算机可读的格式是非常必要的。
在本文中,将展示如何使用Python预处理文本数据,我们需要用到 NLTK 和 re-library 库。
1.文本小写
在我们开始处理文本之前,最好先将所有字符都小写。我们这样做的原因是为了避免区分大小写的过程。
假设我们想从字符串中删除停止词,正常操作是将非停止词合并成一个句子。如果不使用小写,则无法检测到停止词,并将导致相同的字符串。这就是为什么降低文本大小写这么重要了。
用Python实现这一点很容易。代码是这样的:
- # 样例
 - x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
 - # 将文本小写
 - x = x.lower()
 - print(x)
 - >>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy
 
2.删除 Unicode 字符
一些文章中可能包含 Unicode 字符,当我们在 ASCII 格式上看到它时,它是不可读的。大多数情况下,这些字符用于表情符号和非 ASCII 字符。要删除该字符,我们可以使用这样的代码:
- # 示例
 - x = "Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
 - # 删除 unicode 字符
 - x = x.encode('ascii', 'ignore').decode()
 - print(x)
 - >>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP
 
3.删除停止词
停止词是一种对文本意义没有显著贡献的词。因此,我们可以删除这些词。为了检索停止词,我们可以从 NLTK 库中下载一个资料库。以下为实现代码:
- import nltk
 - nltk.download()
 - # 只需下载所有nltk
 - stop_words = stopwords.words("english")
 - # 示例
 - x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
 - # 删除停止词
 - x = ' '.join([word for word in x.split(' ') if word not in stop_words])
 - print(x)
 - >>> America like South Africa traumatised sick country - different ways course - still messed up.
 
4.删除诸如提及、标签、链接等术语。
除了删除 Unicode 和停止词外,还有几个术语需要删除,包括提及、哈希标记、链接、标点符号等。
要去除这些,如果我们仅依赖于已经定义的字符,很难做到这些操作。因此,我们需要通过使用正则表达式(Regex)来匹配我们想要的术语的模式。
Regex 是一个特殊的字符串,它包含一个可以匹配与该模式相关联的单词的模式。通过使用名为 re. 的 Python 库搜索或删除这些模式。以下为实现代码:
- import re
 - # 删除提及
 - x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
 - x = re.sub("@\S+", " ", x)
 - print(x)
 - >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
 - # 删除 URL 链接
 - x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
 - x = re.sub("https*\S+", " ", x)
 - print(x)
 - >>> Severe Thunderstorm pictures from across the Mid-South
 - # 删除标签
 - x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
 - x = re.sub("#\S+", " ", x)
 - print(x)
 - >>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
 - # 删除记号和下一个字符
 - x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
 - x = re.sub("\'\w+", '', x)
 - print(x)
 - >>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
 - # 删除标点符号
 - x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
 - x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
 - print(x)
 - >>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
 - # 删除数字
 - x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
 - x = re.sub(r'\w*\d+\w*', '', x)
 - print(x)
 - >>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
 - #替换空格
 - x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
 - x = re.sub('\s{2,}', " ", x)
 - print(x)
 - >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
 
5.功能组合
在我们了解了文本预处理的每个步骤之后,让我们将其应用于列表。如果仔细看这些步骤,你会发现其实每个方法都是相互关联的。因此,必须将其应用于函数,以便我们可以按顺序同时处理所有问题。在应用预处理步骤之前,以下是文本示例:
- Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
 - Forest fire near La Ronge Sask. Canada
 - All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
 - 13,000 people receive #wildfires evacuation orders in California
 - Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
 
在预处理文本列表时,我们应先执行几个步骤:
代码如下:
- # 导入错误的情况下
 - # ! pip install nltk
 - # ! pip install textblob
 - import numpy as np
 - import matplotlib.pyplot as plt
 - import pandas as pd
 - import re
 - import nltk
 - import string
 - from nltk.corpus import stopwords
 - # # 如果缺少语料库
 - # 下载 all-nltk
 - nltk.download()
 - df = pd.read_csv('train.csv')
 - stop_words = stopwords.words("english")
 - wordnet = WordNetLemmatizer()
 - def text_preproc(x):
 - x = x.lower()
 - x = ' '.join([word for word in x.split(' ') if word not in stop_words])
 - x = x.encode('ascii', 'ignore').decode()
 - x = re.sub(r'https*\S+', ' ', x)
 - x = re.sub(r'@\S+', ' ', x)
 - x = re.sub(r'#\S+', ' ', x)
 - x = re.sub(r'\'\w+', '', x)
 - x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
 - x = re.sub(r'\w*\d+\w*', '', x)
 - x = re.sub(r'\s{2,}', ' ', x)
 - return x
 - df['clean_text'] = df.text.apply(text_preproc)
 
上面的文本预处理结果如下:
- deeds reason may allah forgive us
 - forest fire near la ronge sask canada
 - residents asked place notified officers evacuation shelter place orders expected
 - people receive evacuation orders california
 - got sent photo ruby smoke pours school
 
以上内容就是使用 Python 进行文本预处理的具体步骤,希望能够帮助大家用它来解决与文本数据相关的问题,提高文本数据的规范性以及模型的准确度。
                文章名称:如何用Python清理文本数据?
                
                文章分享:http://www.csdahua.cn/qtweb/news2/378252.html
            
网站建设、网络推广公司-快上网,是专注品牌与效果的网站制作,网络营销seo公司;服务项目有等
声明:本网站发布的内容(图片、视频和文字)以用户投稿、用户转载内容为主,如果涉及侵权请尽快告知,我们将会在第一时间删除。文章观点不代表本网站立场,如需处理请联系客服。电话:028-86922220;邮箱:631063699@qq.com。内容未经允许不得转载,或转载时需注明来源: 快上网