(by Eric Forsyth, Jane Lin, and Craig Martell)
License and Legal Issues
This corpus is distributed solely for non-commercial, non-profit educational and research use. It is a derivative compilation work of multiple works whose copyrights are held by the respective original authors.
How to get the NPS Chat Corpus
The NPS Chat Corpus is part of the Natural Language Toolkit (NLTK) distribution. NLTK is a “platform for building Python programs to work with human language data.” It includes both the whole NPS Chat Corpus, as well as a number of modules for working with the data.
Description of the NPS Chat Corpus
The NPS Chat Corpus, Release 1.0 consists of 10,567 posts out of approximately 500,000 posts we have gathered from various online chat services in accordance with their terms of service. Future releases will contain more posts from more domains. New releases will be announced and described at
The posts included in Release 1.0 have been:
1) Hand privacy masked;
2) Part-of-speech tagged; and
3) Dialogue-act tagged.
The posts have been privacy masked in two ways. Firstly, all usernames have been changed to generic names of the form "UserN", where N is a unique integer consistently used for each respective poster across all files. Secondly, the posts have been read by humans to remove other personally identifiable information.
All of the recordings included with this release are from age-specific chat rooms. Each file is a recording from one of these chat rooms for a short period on a particular day. The filename contains all this information plus the number of posts contained in the file. For example, the file 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006. All data released here was gathered in 2006. Please note that within each file usernames get prepended with the date and chat-room portions of the filename. So, UserN becomes 10-19-20sUserN.
As of release 1.0, the POS tag set is the Penn Treebank tag set. We are in the process of modifying the tag set to make it more chat specific, and hope to implement these changes in future releases. For example, conversational initialisms, like LOL, BRB, should have their own tag (CI). But we currently simply tag them as interjections (UH).
The dialogue-act tags are Accept, Bye, Clarify, Continuer, Emotion, Emphasis, Greet, No Answer, Other, Reject, Statement, System, Wh-Question, Yes Answer, Yes/No Question. (See  and , below.)
Here is a sample post from the corpus:
<Post class="whQuestion" user="11-08-teensUser117">whats balck and white and red all over?<terminals>
<t pos="WP" word="whats"/>
<t pos="^JJ" word="balck"/>
<t pos="CC" word="and"/>
<t pos="JJ" word="white"/>
<t pos="CC" word="and"/>
<t pos="JJ" word="red"/>
<t pos="DT" word="all"/>
<t pos="IN" word="over"/>
<t pos="." word="?"/>
Author, Citation and Contact Information
Please address all questions, comments and suggestions to Craig Martell (firstname.lastname@example.org).
Files in the corpus
 Eric N. Forsyth and Craig H. Martell, "Lexical and Discourse Analysis of
Online Chat Dialog," Proceedings of the First IEEE International Conference on
Semantic Computing (ICSC 2007), pp. 19-26, September 2007.
 T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler and W. M. Pottenger,
"Posting act tagging using transformation-based learning," Proceedings of the
Workshop on Foundations of Data Mining and Discovery, IEEE International
Conference on Data Mining, December 2002.
 A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P.
Taylor, R. Martin, C. Van Ess-Dykema and M. Meteer, "Dialogue act modeling for
automatic tagging and recognition of conversational speech," Computational
Linguistics, vol. 26, no. 3, pp. 339-373, 2000.
"Material contained herein is made available for the purpose of peer review and discussion and does not necessarily reflect the views of the Department of the Navy or the Department of Defense."
Last modified: 25 June 2008