Statistical Natural Language Processing.pdf

(11546 KB) Pobierz
Foundations
of
Statistical
Natural
Language
Processing
E0123734
Christopher D. Manning
Hinrich Schiitze
The MIT Press
Cambridge, Massachusetts
London, England
Second printing,
1999
0
1999 Massachusetts Institute of Technology
Second printing with corrections, 2000
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or informa-
tion storage and retrieval) without permission in writing from the publisher.
Typeset in lo/13 Lucida Bright by the authors using ETPX2E.
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Information
Manning, Christopher D.
Foundations of statistical natural language processing / Christopher D.
Manning, Hinrich Schutze.
p. cm.
Includes bibliographical references (p.
)
and index.
ISBN 0-262-13360-l
1. Computational linguistics-Statistical methods. I. Schutze, Hinrich.
II. Title.
P98.5.S83M36 1999
99-21137
410’.285-dc21
CIP
Brief Contents
I Preliminaries 1
1
2
3
4
Introduction 3
Mathematical Foundations
Linguistic Essentials
81
Corpus-Based Work
117
39
II W o r d s 1 4 9
5 Collocations
151
6
Statistical Inference: n-gram Models over Sparse Data
7 Word Sense Disambiguation
229
8 Lexical Acquisition 265
191
III Grammar 315
9
10
11
12
Iv
13
14
15
16
Markov Models 317
Part-of-Speech Tagging
341
Probabilistic Context Free Grammars
Probabilistic Parsing
407
381
Applications and Techniques
461
463
Statistical Alignment and Machine Translation
Clustering
495
Topics in Information Retrieval
529
Text Categorization
575
Contents
List of Tables
List of Figures
xv
xxi
xxv
Table of Notations
P r e f a c e rodx
R o a d M a p mxv
I Preliminaries 1
1 Introduction 3
1.1 Rationalist and Empiricist Approaches to Language
4
1.2 Scientific Content 7
1.2.1 Questions that linguistics should answer
8
1.2.2 Non-categorical phenomena in language
11
1.2.3 Language and cognition as probabilistic
phenomena 15
1.3 The Ambiguity of Language: Why NLP Is Difficult
17
1.4 Dirty Hands 19
1.4.1 Lexical resources 19
1.4.2 Word counts 20
1.4.3 Zipf’s laws 23
1.4.4 Collocations 29
1.4.5 Concordances 31
1.5 Further Reading 34
Vlll
.
Contents
1.6
Exercises
35
39
2
Mathematical Foundations
40
2.1
Elementary Probability Theory
2.1.1 Probability spaces 40
42
2.1.2 Conditional probability and independence
2.1.3 Bayes’ theorem 43
45
2.1.4 Random variables
46
2.1.5 Expectation and variance
2.1.6 Notation 4 7
48
2.1.7 Joint and conditional distributions
2.1.8 D e t e r m i n i n g P 48
2.1.9 Standard distributions 50
2.1.10 Bayesian statistics 54
2.1.11 E x e r c i s e s 5 9
2.2
Essential Information Theory 60
2.2.1
Entropy
61
63
2.2.2 Joint entropy and conditional entropy
2.2.3 Mutual information 66
2.2.4 The noisy channel model 68
2.2.5 Relative entropy or Kullback-Leibler divergence
73
2.2.6 The relation to language: Cross entropy
2.2.7 The entropy of English 76
2.2.8 Perplexity 78
2.2.9 Exercises 78
2.3 Further Reading 79
3
Linguistic Essentials
8 1
3.1 Parts of Speech and Morphology 8 1
3.1.1
Nouns pronouns
and
83
3.1.2 Words that accompany nouns: Determiners and
adjectives 87
3.1.3
Verbs
88
3.1.4 Other parts of speech 91
3.2 Phrase Structure 93
grammars
3.2.1
Phrase
structure
96
101
3.2.2 Dependency: Arguments and adjuncts
3.2.3 X’ theory 106
3.2.4 Phrase structure ambiguity 107
72
Zgłoś jeśli naruszono regulamin