public class TextDocument extends LinkedList<Token> implements Document
Constructor and Description |
---|
TextDocument() |
Modifier and Type | Method and Description |
---|---|
String |
getContent()
Returns the cleaned content of this document represented as a string.
|
String |
getContentStartingAtToken(Token start,
SurfaceFormLevel l)
Returns a string containing all tokens starting at the token
start until the end of the list. |
String |
getPOSTaggedContent()
Returns the uncleaned content with POS tags in form of word1/pos1 word2/pos2 ...
|
String |
getRawContent()
Returns the uncleaned content, i.e., as originally retrieved, of this document represented as string.
|
List<Token> |
getTokensStartingAtToken(Token start,
boolean ignorePunctuation)
Returns a list containing all successive tokens from this document starting at the given start
token.
|
List<Token> |
getTokensStartingAtToken(Token start,
int numberOfTokens,
boolean ignorePunctuation)
Returns a list containing
numberOfTokens successive tokens from this document starting at the given start
token. |
static void |
main(String[] args) |
add, add, addAll, addAll, addFirst, addLast, clear, clone, contains, descendingIterator, element, get, getFirst, getLast, indexOf, lastIndexOf, listIterator, offer, offerFirst, offerLast, peek, peekFirst, peekLast, poll, pollFirst, pollLast, pop, push, remove, remove, remove, removeFirst, removeFirstOccurrence, removeLast, removeLastOccurrence, set, size, spliterator, toArray, toArray
iterator
equals, hashCode, listIterator, subList
containsAll, isEmpty, removeAll, retainAll, toString
containsAll, equals, hashCode, isEmpty, iterator, listIterator, removeAll, replaceAll, retainAll, sort, subList
parallelStream, removeIf, stream
public TextDocument()
public String getContent()
Document
Document.getRawContent()
.
Methods for retrieving more specialized content formats might be implemented by the actual implementations.getContent
in interface Document
public String getRawContent()
Document
getRawContent
in interface Document
public String getPOSTaggedContent()
Document
getPOSTaggedContent
in interface Document
public String getContentStartingAtToken(Token start, SurfaceFormLevel l)
start
until the end of the list. The
surface forms according to level
are used to build the string.start
- token to start building the string at, i.e., the first token in the returned stringl
- level of surface forms to usepublic List<Token> getTokensStartingAtToken(Token start, int numberOfTokens, boolean ignorePunctuation)
numberOfTokens
successive tokens from this document starting at the given start
token. If ignorePunctuation
is set, tokens which represent punctuation are added to the result but not
counted for the number of tokens.start
- token to start collecting tokens from the documentnumberOfTokens
- number of tokens to collect from the documentignorePunctuation
- if true, punctuation are not counted towards the number of tokens to returnpublic List<Token> getTokensStartingAtToken(Token start, boolean ignorePunctuation)
ignorePunctuation
is set, tokens which represent punctuation are added to the result but not
counted for the number of tokens.start
- token to start collecting tokens from the documentignorePunctuation
- if true, punctuation are not counted towards the number of tokens to return DL-Learner is licenced under the terms of the GNU General Public License.
Copyright © 2007-2019 Jens Lehmann