当前位置：服务支持 > 软件文章 > 如何在一个月内构建复杂的文本分析应用程序：Meteor开发经验

如何在一个月内构建复杂的文本分析应用程序：Meteor开发经验

阅读数 130

构建meteor应用程序

by Jeffrey Flynt

由Jeffrey Flynt

我如何在一个月内构建一个复杂的文本分析应用程序 (How I built a complex text analysis app in a month)

它是如何开始的 (How it started)

I was in a week-long humanities intensive learning workshop at the University of Texas at Austin, in a Text Analysis session. One of the participants asked:

我在德克萨斯大学奥斯汀分校的为期一周的人文强化学习研讨会上，进行了文本分析会议。一位参与者问：

“Why doesn’t a software developer make this easier instead of us having to know R or Python?”

“为什么软件开发人员不使这变得更容易，而不是我们不必了解R或Python？”

I felt comfortable in the session, as I had previous experience with both. But I could definitely understand the sentiment coming from users not comfortable writing commands to view quick outputs.

我在这两次会议上都经验丰富，因此在会议中感到很舒服。但是我绝对可以理解用户不喜欢编写命令来查看快速输出的感觉。

I am a research associate with the Quantitative Criticism Lab (QCL) at UT Austin. The principal investigators suggested I take the course. This course definitely helped me refine and discover new skills in natural language processing (NLP).

我是UT Austin定量批评实验室(QCL)的研究助理。主要研究人员建议我参加该课程。这门课程肯定有助于我完善和发现自然语言处理(NLP)的新技能。

Inadvertently, I put this issue on the back burner while I focused on other toolkits and projects. While attending a classical studies workshop in Boston, I noticed many showed frustration at the lack of simpler tools for text analysis and visualization.

不经意间，当我专注于其他工具箱和项目时，我把这个问题搁置一旁。在参加波士顿的古典研究研讨会时，我注意到许多人由于缺乏用于文本分析和可视化的简单工具而感到沮丧。

The team I am on at UT Austin was developing a web-based stylometric toolkit for multiple languages, but there currently was not a full-featured option for the English language.

我在UT奥斯汀分校的团队正在开发一种基于Web的多种语言的测绘工具包，但是目前还没有功能齐全的英语选项。

Granted there are options like Voyant. But there are no readily available solutions that offer features such as named entity extraction, part-of-speech (POS) tagging, word segmentation from noisy text, and sentiment analysis to people without prior programming knowledge. Coupled with this and the above, it reinforced the idea to roll out a simpler app for NLP.

当然有像Voyant这样的选项。但是，尚无任何易于使用的解决方案提供诸如命名实体提取，词性(POS)标记，从嘈杂的文本中进行单词分割以及对没有事先编程知识的人们的情感分析等功能。结合以上这些，它增强了为NLP推出更简单的应用程序的想法。

我从哪说起呢？ (Where do I start?)

While waiting to board the plane, my mind started reeling with where I should start. I ended up settling on creating the User Interface. My reasoning for this was that it would make it easier to create the functions after I figured out the workflow for the user.

在等待登机的过程中，我的脑海开始思考应该从哪里开始。我最终决定创建用户界面。我这样做的原因是，在为用户弄清楚工作流程之后，可以更轻松地创建功能。

Once the captain said it was alright for portable electronics, I pulled out my laptop and got to work. I’m sure the guy sitting next to me was probably thinking I was hacking something with all of my time in the console.

机长说对便携式电子设备没问题之后，我拿出笔记本电脑开始工作。我确定坐在我旁边的那个家伙可能以为我在控制台中花了所有的时间在破解某些东西。

By the time I landed from the five hour flight, I had finished the Login, Registration, Forgot Password, and the Corpus Builder screens.

到五小时的飞行降落时，我已经完成了登录，注册，忘记密码和语料库生成器的屏幕。

You might ask how is it possible to finish all of that with the corresponding JavaScript functions and testing. A good practice I learned early on to reduce development time is to keep boiler plate code just for these situations.

您可能会问，如何用相应JavaScript函数和测试来完成所有这些工作。我很早就学会了减少开发时间的一个好习惯是仅针对这些情况保留样板代码。

My drop in code usually consists of:

我的代码通常包括：

Registration / Login
注册/登录
Notifications
通知事项
Visualization (Chart.js, Bootstrap Tables, Handsontable.js)
可视化(Chart.js，Bootstrap表，Handsontable.js)
Routing
路由

Another approach to reduce development time, especially when you are working on a project solo, is to use pre-made admin templates for the UI. The Themeforest Admin Website Templates section has some great UI elements that I use in my projects. I shaved more than 50% off of usual development time by using pre-built assets.

减少开发时间(尤其是当您单独处理项目时)的另一种方法是对UI使用预制的管理模板。 Themeforest Admin网站模板部分提供了一些很棒的UI元素，我可以在项目中使用它们。通过使用预制资产，我将正常开发时间缩短了50％以上。

Granted I had to know my way around HTML, CSS, and jQuery. But having these assets already designed, I only had to worry about placement and data integration.

当然，我必须了解围绕HTML，CSS和jQuery的方式。但是，在已经设计好这些资产之后，我只需要担心放置和数据集成。

My framework of choice is Meteor.js. Meteor is a JavaScript framework that sits on top of Node.js. I am using MongoDB as my database and Python for the heavy NLP tasks.

我选择的框架是Meteor.js 。 Meteor是一个位于Node.js之上JavaScript框架。我使用MongoDB作为数据库，使用Python处理繁重的NLP任务。

For those of you who are familiar with Meteor, I opted not to include publications and subscriptions only methods and calls, and instead utilized dynamic imports for all third party libraries running on the client. This helped boost performance. I also employed the use of workers for any client code that needs to manipulate strings. This got me down to around a 500kb bundle size.

对于那些熟悉Meteor的人，我选择不只包含发布和订阅的方法和调用，而是将动态导入用于客户端上运行的所有第三方库。这有助于提高性能。对于需要操作字符串的所有客户端代码，我也使用了worker的用法。这使我的包大小降到了500kb左右。

I settled on Textalytic for the name and secured the website.

我选择了Textalytic作为名称，并确保了网站的安全。

哦，您认为这很简单？ (Oh, you thought it would be simple?)

I assumed this would all be pretty straightforward with my previous experience working on NLP related toolkits with the QCL Lab. But, there are always those gotcha moments.

我以为我以前在QCL Lab从事NLP相关工具包的经验，这一切都将非常简单。但是，总有那些陷阱时刻。

I wanted users to be able to view highlighted named entities within their corpus. I first had to obtain a fast JavaScript compatible library to extract named entities.

我希望用户能够查看其语料库中突出显示的命名实体。我首先必须获得一个快速JavaScript兼容库来提取命名实体。

At first I used compromise.js. This worked pretty well to an extent, but the speed on relatively large text was something to be desired.

最初，我使用threat.js 。在某种程度上，这很好用，但是相对较大的文本的速度是需要的。

I then settled on SpaCy, but this was in Python. I had never had to integrate two different languages before, so it was off to Stack Overflow for some learning.

然后，我选择了SpaCy ，但这是在Python中进行的。我以前从未集成过两种不同的语言，因此可以通过Stack Overflow进行一些学习。

After getting SpaCy working with JavaScript, I encountered two issues with SpaCy. The first was that SpaCy would not return accurate counts.

使SpaCy使用JavaScript后，我遇到了SpaCy两个问题。首先是SpaCy不会返回准确的计数。

Users were able to view frequency of nouns, adjectives, verbs, and so on. SpaCy would return 31 instances of “car” but when doing a manual count, I would get 44.

用户能够查看名词，形容词，动词等的频率。 SpaCy将返回31个“汽车”实例，但是当进行手动计数时，我将得到44个。

At first I had Python handle returning the frequency of nouns:

最初，我让Python处理了返回名词频率的操作：

I ended up opting to just return the raw noun array and have JavaScript return the top 10 nouns.

我最终选择只返回原始名词数组，并让JavaScript返回前10个名词。

This led to accurate counts for nouns:

这导致名词的准确计数：

The second issue was with named entities. Most text analysis models, if not all, will not get 100% accuracy on named entities. To supplement SpaCy, I imported a large list of named entities taken from WikiData into MongoDB.

第二个问题是命名实体。大多数文本分析模型(如果不是全部的话)将无法在命名实体上获得100％的准确性。为了补充SpaCy，我将来自WikiData的大量命名实体导入了MongoDB。

The text is run through SpaCy which returns an array of found entities. MongoDB then produces a large array of around 150k entities, which is sent along with the text to a function that performs a match against word boundaries. Regex accounting for punctuation and boundaries will cause many headaches.

文本通过SpaCy运行，该SpaCy返回找到的实体的数组。然后，MongoDB会产生大约15万个实体的大型数组，并将其与文本一起发送到对单词边界进行匹配的函数。正则表达式考虑到标点和边界会引起很多麻烦。

These two arrays are filtered and duplicate entries removed to produce a final array of entities. This method seemed to be the fastest, returning results in ~5 seconds.

过滤这两个数组，并删除重复的条目以生成最终的实体数组。该方法似乎是最快的，可在约5秒内返回结果。

This method provided better coverage in obtaining greater accuracy than SpaCy’s 85.85%.

与SpaCy的85.85％相比，此方法可提供更好的覆盖范围，从而获得更高的准确性。

能简单点吗？ (Can it be simpler, please?)

Many of the tutorials for NLP tasks call for users to pre-process the text before analyzing. I wanted users to have a simpler approach.

NLP任务的许多教程都要求用户在分析之前对文本进行预处理。我希望用户采用一种更简单的方法。

With the Corpus Builder, users are able to type or copy and paste text or select a file from their computer or Dropbox.

使用Corpus Builder，用户可以在其计算机或Dropbox中键入或复制和粘贴文本，或选择一个文件。

前处理 (Pre-processing)

I now had to account for parsing different file types, sanitizing user input, and offering point and click options for pre-processing.

现在，我不得不考虑解析不同的文件类型，清理用户输入，并提供用于预处理的指向和单击选项。

For pre-processing, users can remove blank lines, stop words, duplicate lines, punctuation, extra spaces, line breaks, and lines containing a user-selected word.

对于预处理，用户可以删除空白行，停用词，重复行，标点符号，多余的空格，换行符和包含用户选择的单词的行。

To keep in line with making text analysis easier, the user doesn’t have to do any of this. Depending upon the feature selected at run time, the back-end decides how to best handle the text.

为了与简化文本分析保持一致，用户无需执行任何操作。根据运行时选择的功能，后端将决定如何最好地处理文本。

文字转换 (Text Transformation)

When taking the text analysis class, it seemed we did a lot of text transforming before starting to test various models.

在上课文本分析课时，似乎我们在开始测试各种模型之前做了很多文本转换。

I first tried integrating the Dropbox API into the app. I had assumed this was the only way to get this functionality. I was wrong, as Dropbox has a component called Chooser which allows the user to bring in their documents into the app without me spending more time adding in API calls.

我首先尝试将Dropbox API集成到应用中。我以为这是获得此功能的唯一方法。我错了，因为Dropbox有一个名为Chooser的组件，它使用户可以将他们的文档带入应用程序，而无需花费更多时间添加API调用。

While in the text analysis class, users had to download text files from someone’s Google Drive or download ready-made corpora from NLTK. This took up quite a bit of time while waiting for everyone to have the files downloaded and imported.

在进行文本分析时，用户必须从某人的Google云端硬盘下载文本文件，或者从NLTK下载现成的语料库。在等待所有人下载和导入文件时，这花费了很多时间。

For users only wanting to test out features or get an understanding of how text analysis works, I opted to include a library of public domain literary works that they could add to their corpora to choose from. I am hopeful that this will provide a relaxed barrier to entry for beginning users.

对于只想测试功能或了解文本分析如何工作的用户，我选择包括一个公共领域文学作品库，他们可以将其添加到语料库中进行选择。我希望这将为初学者提供轻松的进入障碍。

For advanced users, I wanted them to have options and not be tied down to a default configuration. I implemented custom stop words, custom top occurring word lists, and more. Some users may want to search for frequency of words that end in “-ing”, so I threw that in too.

对于高级用户，我希望他们有选择权，而不是将其绑定到默认配置。我实现了自定义停用词，自定义排名靠前的单词列表等。一些用户可能想搜索以“ -ing”结尾的单词的频率，因此我也将其放入。

While adding these options, I had to account for those extra spaces, transforming their input into a usable array, setting limits on how big their custom list could, get and so forth.

在添加这些选项时，我必须考虑到这些额外的空间，将它们的输入转换为可用的数组，并设置其自定义列表的大小限制，获取等等。

不要阻塞循环！ (Do Not Block the Loop!)

I didn’t want users to be able to only view normalized frequency of nouns and verbs. So I ended up adding relative and subordinate clauses.

我不希望用户只能查看名词和动词的标准化频率。因此，我最终添加了相对和从属子句。

I’m currently testing more complex cases such as dangling modifiers, direct objects, and parallel structures.

我目前正在测试更复杂的情况，例如悬空修改器，直接对象和并行结构。

性能 (Performance)

I was excited to have this in place and that the results were returning pretty quick. Then I started thinking about performance. I brought up the site on my laptop and desktop, and then proceeded to run an analysis on some very large corpora. As you might expect, my results weren’t returning as fast when I was running searches with just me.

我很高兴能做到这一点，并且结果很快恢复了。然后我开始考虑性能。我在笔记本电脑和台式机上打开了该站点，然后对一些非常大型的语料库进行了分析。如您所料，当我只与我一起进行搜索时，我的搜索结果返回的速度并没有那么快。

The issue was that my long running functions were blocking the main event loop. I needed to offload these tasks to a separate process to keep Node responsive. I tried for hours to get functions running on another process.

问题是我长时间运行的函数阻塞了主事件循环。我需要将这些任务转移到一个单独的进程中，以保持Node的响应速度。我尝试了几个小时才能使功能在另一个进程上运行。

Finally I found Napa.js from Microsoft. It was really simple to integrate and I didn’t have to change any of my functions.

最后，我从Microsoft找到了Napa.js。集成起来真的很简单，我不必更改任何功能。

The app was now running smoothly with large corpora analyzed by multiple users. However, there is always a “but”!

该应用程序现在运行平稳，并且具有由多个用户分析的大型语料库。但是，总有一个“但是”！

When running searches with corpora that consisted of a very large body of ~500k words, Python would throw a ValueError. SpaCy has a set limit of 1,000,000 characters in a single string, which is modifiable. Naturally, I split the corpus into chunks.

当使用包含约500k个很大单词的语料库运行搜索时，Python会抛出ValueError 。 SpaCy在单个字符串中的设置限制为1,000,000个字符，可以修改。自然地，我将语料库拆分为大块。

Since this is a free app supported by myself — and server resources could get expensive — I opted to set hard limits for accounts of 1,000,000 corpus words per account and 50,000 words per corpus. A user can run an analysis over a group of corpora, but each corpus is analyzed individually. This should help prevent the server maxing out on computationally intensive functions.

由于这是我自己支持的免费应用程序-并且服务器资源可能会很昂贵-我选择为每个帐户1,000,000个语料库单词和每个语料库50,000个单词设置帐户硬限制。用户可以对一组语料库进行分析，但是每个语料库都会进行单独分析。这应有助于防止服务器在计算密集型功能上发挥最大作用。

POS标签 (POS Tagging)

Part of Speech tagging that was visualized in a meaningful way was something I knew I had to have in the app. SpaCy returned the POS tags for each word in a large array of objects without issue, but this wasnt helpful for the user. I had to manage to transform this array into a visually pleasing format for the user.

我知道必须在应用程序中以有意义的方式可视化语音标记的一部分。 SpaCy可以为大量对象中的每个单词返回POS标签，没有问题，但这对用户没有帮助。我必须设法将该数组转换为用户满意的视觉格式。

Compromise.js has a nice format for doing this, which I got the inspiration from.

Compromise.js具有很好的格式，我从中得到了启发。

I placed that array into a loop that added color tags based on the POS and transformed the new array into a string and updated the page to this:

我将该数组放入一个循环中，该循环添加了基于POS的颜色标签，并将新数组转换为字符串并将页面更新为此：

结论 (Conclusion)

In the span of a month, the app was in good shape to be released. I have since made various changes for optimization and other tweaks. I’m trying to stay away from adding npm modules unless I have to. Everything was written in vanilla JavaScript with the exception of the visualization libraries and toastr notifications. By doing this, the codebase is leaner and I don’t have to worry about when the maintainer of said project is going to do x.

在一个月的时间内，该应用程序处于良好状态，可以发布。从那以后，我对优化和其他调整进行了各种更改。除非必须，否则我将尽量避免添加npm模块。除了可视化库和Toastr通知外，所有内容均使用香草JavaScript编写。这样，代码库更加精简，我不必担心该项目的维护人员何时执行x 。

Towards the end of this project, I started thinking: “Who would use this?” “Is this app actually good enough?”“Did I mess up somewhere and it’ll be forever tarnished?”

在该项目即将结束时，我开始思考：“谁将使用此工具？” “这个应用程序实际上足够好吗？”“我在某个地方弄乱了，它将永远失去光泽吗？”

I suppressed those thoughts and figured if it fails, I learned a heck of a lot, which I probably wouldn’t have learned doing something else.

我压抑了那些想法，并弄清楚了如果失败了，我学到了很多东西，我可能不会学到别的东西。

You can waste a lot of time trying to optimize every function. I learned quickly to abandon the notion of trying to write functions in the latest ES syntax. I did, however, focus on the performance of various functions, more so for user experience.

您可能会浪费大量时间来尝试优化每个功能。我很快学会了放弃尝试使用最新的ES语法编写函数的概念。但是，我确实专注于各种功能的性能，因此更着重于用户体验。

One of the best time saving strategies was to use Gitlab’s CI/CD pipeline — and it’s free!

节省时间的最佳策略之一是使用Gitlab的CI / CD管道-它是免费的！

Instead of manually building the bundle, stopping the service, uploading and so forth, I just did one commit in GitKraken. GitLab handles everything else on the server.

我没有手动构建捆绑包，停止服务，上传等等，而只是在GitKraken中进行了一次提交。 GitLab处理服务器上的其他所有内容。

There was a learning curve with getting NGINX setup with multiple instances, load balancing, and sticky sessions. But there are so many resources out there to help you along the way such as freeCodeCamp, Stack Overflow, and Digital Ocean’s blog section.

获取具有多个实例的NGINX设置，负载平衡和粘性会话是一个学习过程。但是有很多资源可以帮助您，例如freeCodeCamp ， Stack Overflow和Digital Ocean的Blog部分。

I am constantly thinking of new features to add that may be of use to users. Document summarization, custom machine learning models, and argument/stance detection are a few features I plan to add over the summer. If you’re interested in an NLP feature that might be useful, please let me know in the comments section. Thanks for reading!

我一直在想添加可能对用户有用的新功能。我计划在整个夏天添加文档摘要，自定义机器学习模型以及参数/姿势检测等功能。如果您对可能有用的NLP功能感兴趣，请在评论部分告诉我。谢谢阅读！