Language learning materials by Steve Pepper

This is a revision of Fairbanks, Gair and De Silva "Colloquial Sinhalese", which was first publis... more This is a revision of Fairbanks, Gair and De Silva "Colloquial Sinhalese", which was first published in 1968 and reprinted (with minor corrections) in 1981 and 1993. The book has been typeset from scratch based on a scan of the original typewritten publication, and various features have been added, including an index of grammatical topics and a key to exercises. Minor typographical errors have been corrected (but new ones have doubtless been introduced). Additional material includes a comprehensive index of grammatical notes and a key to exercises (for the first 13 lessons).
The original PDF can be found on Library Genesis. This draft covers all 36 lessons using a romanization scheme that has been slightly amended from the original version. The Sinhala script is introduced incrementally from lesson 5 onwards.
This draft has been proofread by a native speaker up to Lesson 12. Updated versions will be made available as the proofreading proceeds. Please read the preface to the 2nd edition for some background to the present work and some caveats regarding the text.
A six-page summary of basic Sinhala grammar, by my Sinhalese alter-ego ගම්මිරිස්. Feedback and co... more A six-page summary of basic Sinhala grammar, by my Sinhalese alter-ego ගම්මිරිස්. Feedback and corrections most welcome.
Dissertations by Steve Pepper

This dissertation establishes ‘binominal lexeme’ as a comparative concept and discusses its cross... more This dissertation establishes ‘binominal lexeme’ as a comparative concept and discusses its cross-linguistic typology and semantics. Informally, a binominal lexeme is a noun-noun compound or functional equivalent; more precisely, it is a lexical item that consists primarily of two thing-morphs between which there exists an unstated semantic relation.
Examples of binominals include Mandarin Chinese 铁路 (tiělù) [iron road], French chemin de fer [way of iron] and Russian железная дорога (želez.naja doroga) [iron.ADJZ road]. All of these combine a word denoting ‘iron’ and a word denoting ‘road’ or ‘way’ to denote the meaning ‘railway’. In each case, the unstated semantic relation is one of COMPOSITION: a railway is conceptualized as a road that is composed, or made, of iron. However, three different morphosyntactic strategies are employed: compounding, prepositional phrase and relational adjective. In this study, I explore the range of such strategies used by a worldwide sample of languages to express a set of 100 meanings from various semantic domains, resulting in a classification consisting of nine different morphosyntactic types.
I also investigate the semantic relations found in the data and develop a classification called the Hatcher-Bourque system that operates at two levels of granularity, together with a tool for classifying binominals, the Bourquifier. The classification is extended to other subfields of language, including metonymy and lexical semantics, and beyond language to the ___domain of Topic Maps and knowledge representation, resulting in a proposal for a general model of associative relations called the PHAB model.
Among the other findings of the research are: universals concerning the recruitment of anchoring nominal modification strategies; a method for comparing non-binary typologies; the non-universality (despite its predominance) of compounding; and a scale of frequencies for semantic relations which may provide insights into the associative nature of human thought.

The study described in this masters dissertation is an investigation into the nature of nominal c... more The study described in this masters dissertation is an investigation into the nature of nominal compounding in the Cameroonian language Nizaa (ISO 639-3 code sgi), based on data collected in the field by Professor Rolf Theil of the University of Oslo in the 1980s.
It shows that Nizaa occupies a unique position among those languages for which compounding has so far been investigated, in that it exhibits no clear preference for either right-headed or left-headed nominal compounds; rather it has both kinds of compounds in approximately equal measure.
A simple statistical analysis reveals some significant differences between the two kinds of compound and this is confirmed by an analysis of the semantic relations between their constituents. On the basis of these relations it is shown that left headed compounds correspond to adjectival noun phrases and right-headed compounds to possessives. Functional and cognitive explanations for these facts are proposed within the general framework of cognitive linguistics, and with particular reference to metaphor theory, construction grammar, grammaticalization theory and Cognitive Grammar, and predictions are made as to the likelihood of other, as yet unstudied, languages exhibiting the same “unusual” feature.

This study investigates cross-linguistic influence (‘transfer’) in Norwegian interlanguage using ... more This study investigates cross-linguistic influence (‘transfer’) in Norwegian interlanguage using predictive data mining technology and with a focus on lexical transfer. The impetus for the present work came from the publication of a series of studies (Jarvis & Crossley 2012) that explore the ‘detection-based approach’ to language transfer.
The following research questions are addressed:
1. Can data mining techniques be used to identify the L1 background of Norwegian language learners on the basis of their use of lexical features of the target language?
2. If so, what are the best predictors of L1 background?
3. And can those predictors be traced to cross-linguistic influence?
The study utilizes data from Norsk andrespråkskorpus (ASK), the Norwegian Second Language Corpus housed at the University of Bergen, and draws on resources from the ASKeladden project. The source data consists of texts written by 1,736 second language learners of Norwegian from ten different L1 backgrounds, and a control corpus of 200 texts written by native speakers. Word frequencies computed from this data are analysed using multivariate statistical methods that include analysis of variance and linear discriminant analysis, and the results are subjected to contrastive analysis.
The combination of discriminant analysis and contrastive analysis produces all three types of evidence called for by Jarvis (2000) in his methodological requirements for language transfer research: intragroup homogeneity, intergroup heterogeneity and cross-language congruity. Well-known transfer effects, such as the tendency for Russian learners to omit indefinite articles, are confirmed, and other, more subtle patterns of learner language are revealed, such as the tendency amongst Dutch learners to overuse the modal verb skal to a far greater extent than other learners. In addition to confirming the reality of lexical transfer, these results provide abundant material for further research, while the methodology employed can be harnessed in many areas of linguistic research.
Apps by Steve Pepper
An Excel-based tool for the computer-assisted analysis of semantic relations in noun-noun compoun... more An Excel-based tool for the computer-assisted analysis of semantic relations in noun-noun compounds and other binominal lexemes, based on the Hatcher-Bourque classification (Pepper 2020; 2022).
Papers (Linguistics) by Steve Pepper

Binominal lexemes in cross-linguistic perspective, 2023
A key feature of binominal lexemes is the unstated (or underspecified) relation, ℜ, that pertains... more A key feature of binominal lexemes is the unstated (or underspecified) relation, ℜ, that pertains between the two major constituents. The nature of ℜ -- the kinds of relations -- has been the topic of considerable research during recent decades. While early studies focused almost exclusively on English, the last few years have seen a spate of work on other languages. Unfortunately, this work has been uncoordinated and each researcher entering the field has tended to devise their own classification, making it difficult to compare results and advance our understanding of the phenomenon. This is a pity, because such an understanding has the potential to provide insights into the nature of concept combination and the associative character of human thought. The purpose of this chapter is to present a well-documented, systematic classification of semantic relations that operates at multiple levels of granularity and is suitable for reuse across languages.

Binominal lexemes in cross-linguistic perspective, 2023
This chapter starts by demonstrating the need for the comparative concept 'binominal lexeme' in o... more This chapter starts by demonstrating the need for the comparative concept 'binominal lexeme' in order to cover both 'noun-noun compounds' and their 'functional equivalents' (§1). To complement this informal definition, four different, but compatible definitions of binominal lexeme are developed: functional, onomasiological, formal and typological (§2). Although couched in a variety of terms based on different theoretical frameworks, these have essentially identical extensions. In §3 a nine-way classification of binominal strategies is presented, together with the mnemonics used throughout this volume: jxt, cmp, der, cls; prp, gen, adj, con, and dbl. These nine types are represented on a two-dimensional grid that captures the number of markers, the locus of marking and the degree of fusion. The grid reveals two lacunae or "missing types": prn and nml. Whereas the first of these probably exists somewhere in the world's languages, the second seems to be a logical impossibility. §4 discusses types that are intermediate between the nine main types and the grammaticalization pathways that produce them. It then goes on to examine the relationship between binominal constructions and adnominal possessives, and introduces a new methodology, based on the Pwav scale, for comparing two non-binary constructions. This leads to the formulation of two Greenbergian universals concerning binominals and nominal modification.

Binominal lexemes in cross-linguistic perspective, 2023
Concept-naming is one of the most fundamental activities performed by speakers, who need either r... more Concept-naming is one of the most fundamental activities performed by speakers, who need either ready-made labels to talk about entities or devices to build new labels (be they rules or processes, schemas or analogical mechanisms). Knowing how languages perform the basic function of creating labels to name concepts, especially complex concepts, is crucial to understanding their creative potential in building new (potentially stable) categories and, more generally, to understanding how they (may) categorize reality, and refer to it. What are the strategies employed by languages for naming complex concepts? How do they differ cross-linguistically, and what are the limits of their variation? Are there strategies that are more widespread than others, or even universal? 1 These are questions for lexical typology and/or word-formation typology, but what we know about the typology of complex concept naming is very limited compared to what we know about domains like word order or inflectional morphology. There may be different reasons behind this state-of-affairs. Analysing all of them falls outside the scope of the present introduction: we will just discuss some factors that we deem relevant for our current purposes.

The Mental Lexicon, 2020
There have been many attempts at classifying the semantic modification relations (ℜ) of N + N com... more There have been many attempts at classifying the semantic modification relations (ℜ) of N + N compounds but this work has not led to the acceptance of a definitive scheme, so that devising a reusable classification is a worthwhile aim. The scope of this undertaking is extended to other binominal lexemes, i.e. units that contain two thing-morphemes without explicitly stating ℜ, like prepositional units, N + relational adjective units, etc. The 25-relation taxonomy of Bourque (2014) was tested against over 15,000 binominal lexemes from 106 languages and extended to a 29-relation scheme ("Bourque2") through the introduction of two new reversible relations. Bourque2 is then mapped onto Hatcher's (1960) four-relation scheme (extended by the addition of a fifth relation, Similarity, as "Hatcher2"). This results in a two-tier system usable at different degrees of granularities. On account of its semantic proximity to compounding, metonymy is then taken into account, following Janda's (2011) suggestion that it plays a role in word formation; Peirsman and Geeraerts' (2006) inventory of 23 metonymic patterns is mapped onto Bourque2, confirming the identity of metonymic and binominal modification relations. Finally, Blank's (2003) and Koch's (2001) work on lexical semantics justifies the addition to the scheme of a third, superordinate level which comprises the three Aristote-lean principles of similarity, contiguity and contrast.
Word-formation in the languages of the world
The Cameroonian language Nizaa (ISO 639 code ‘sgi’) is unusual in that it has both left-headed an... more The Cameroonian language Nizaa (ISO 639 code ‘sgi’) is unusual in that it has both left-headed and right-headed nominal compounds in approximately equal measure. Furthermore, there is no evidence to suggest that this state of affairs can be attributed to language contact or diachronic word order changes, as is the case with other languages that exhibit this feature. An investigation into the semantics of the two compound types in Nizaa prompts us to revise and refine two of the major achievements of the Morbo/Comp project conducted at the University of Bologna (Guevara et al. 2006), namely, the “Canonical Head Position hypothesis” (Scalise & Fábregas 2010) and the basic tripartite classification of compounds that represents the current state of the art (Bisetto & Scalise 2005; 2009).
Replication data for: Windmills, Nizaa and the typology of binominal compounds
Word-formation in the Languages of the World, 2016
This data set consists of 500+ nominal compounds from the African language Nizaa (sgi; Niger-Cong... more This data set consists of 500+ nominal compounds from the African language Nizaa (sgi; Niger-Congo, Cameroon). It is based on an unpublished word list collected by Rolf Theil ( genannt Endresen) of the University of Oslo in the 1980s. Each compound and its constituents are glossed and annotated for word class, and 201 transparent noun-noun compounds are annotated for head position and semantic relation. The data set was originally prepared for the author's 2010 MA dissertation, which sought to explain the presence of both left-headed and right-headed nominal compounds in Nizaa. It was updated and revised in conjunction with the publication of his 2016 article I>Windmills, Nizaa and the typology of binominal compounds.
dictionaria/sidaama: Sidaama dictionary
Yri, Kjell Magne and Pepper, Steve. 2019. Sidaama dictionary. Dictionaria 6. 1-3578

Language Documentation and Description 9, 199-218, 2011
What is the role of ontologies in language documentation theory and practice? This paper clarifie... more What is the role of ontologies in language documentation theory and practice? This paper clarifies the meaning of the term ‘ontology’ in the context of information management and the Web, and emphasizes the importance of distinguishing between knowledge representation and knowledge organization. It then examines how the term ‘ontology’ has been applied in the field of linguistics, focusing on a particular kind of ontology that is regarded as especially relevant in the context of language documentation. The General Ontology for Linguistic Description (GOLD) is presented in some detail, along with criticisms that have been raised against it. Finally it is suggested that the discipline of language documentation has more need for a knowledge organization system, and a shared thesaurus, than for an ontology-based knowledge representation system.
Maximizing classification accuracy in native language identification
Presentations (Linguistics) by Steve Pepper

The nature of the semantic relation ℜ has been the subject of considerable research, often with e... more The nature of the semantic relation ℜ has been the subject of considerable research, often with each new researcher reinventing the classificatory wheel (see Hacken 2016 for a recent summary). We focus on two classification schemes of the “reductionist” type (Søgaard 2005) which operate at different levels of granularity: Hatcher’s (1960) system of four logical relations and Bourque’s (2014) 25-way empirically-derived classification. Following Arnaud (2016), we show how these two systems can be mapped together into a two-tiered system (the “HatcherBourque classification”). We argue that this resolves the dispute regarding the number of relations involved. That number depends on the requirements of the analysis, and the degree of granularity can range from one (as suggested by Bauer 1979) to unlimited (as opined by Jespersen 1942). Our resulting two-tiered system has been tested against a database of over 3,700 noun-noun compounds and their functional equivalents from 106 languages.
A short presentation of the comparative concept 'binominal lexemes' with a focus on lexico-constr... more A short presentation of the comparative concept 'binominal lexemes' with a focus on lexico-constructional patterns.
In this talk given at the Typology and Universals in Word-Formation IV conference (Košice, 2018) ... more In this talk given at the Typology and Universals in Word-Formation IV conference (Košice, 2018) I take issue with recent changes to the classification of onomasiological types. I argue that the changes, introduced in Körtvélyessy & al (2015) and Štekauer (2016), are based on inconsistent criteria and destroy backwards compatibility unnecessarily. I propose an alternative classification and suggest ways in which it might be further extended.
A preliminary attempt to create a formal taxonomy of binominal lexemes. Talk given at the Departm... more A preliminary attempt to create a formal taxonomy of binominal lexemes. Talk given at the Department of Linguistics, Stockholm University, 15 March 2018.
An early presentation on semantic relations in binominal lexemes, given at SLE 2017 in Zürich, Sw... more An early presentation on semantic relations in binominal lexemes, given at SLE 2017 in Zürich, Switzerland.
Uploads
Language learning materials by Steve Pepper
The original PDF can be found on Library Genesis. This draft covers all 36 lessons using a romanization scheme that has been slightly amended from the original version. The Sinhala script is introduced incrementally from lesson 5 onwards.
This draft has been proofread by a native speaker up to Lesson 12. Updated versions will be made available as the proofreading proceeds. Please read the preface to the 2nd edition for some background to the present work and some caveats regarding the text.
Dissertations by Steve Pepper
Examples of binominals include Mandarin Chinese 铁路 (tiělù) [iron road], French chemin de fer [way of iron] and Russian железная дорога (želez.naja doroga) [iron.ADJZ road]. All of these combine a word denoting ‘iron’ and a word denoting ‘road’ or ‘way’ to denote the meaning ‘railway’. In each case, the unstated semantic relation is one of COMPOSITION: a railway is conceptualized as a road that is composed, or made, of iron. However, three different morphosyntactic strategies are employed: compounding, prepositional phrase and relational adjective. In this study, I explore the range of such strategies used by a worldwide sample of languages to express a set of 100 meanings from various semantic domains, resulting in a classification consisting of nine different morphosyntactic types.
I also investigate the semantic relations found in the data and develop a classification called the Hatcher-Bourque system that operates at two levels of granularity, together with a tool for classifying binominals, the Bourquifier. The classification is extended to other subfields of language, including metonymy and lexical semantics, and beyond language to the ___domain of Topic Maps and knowledge representation, resulting in a proposal for a general model of associative relations called the PHAB model.
Among the other findings of the research are: universals concerning the recruitment of anchoring nominal modification strategies; a method for comparing non-binary typologies; the non-universality (despite its predominance) of compounding; and a scale of frequencies for semantic relations which may provide insights into the associative nature of human thought.
It shows that Nizaa occupies a unique position among those languages for which compounding has so far been investigated, in that it exhibits no clear preference for either right-headed or left-headed nominal compounds; rather it has both kinds of compounds in approximately equal measure.
A simple statistical analysis reveals some significant differences between the two kinds of compound and this is confirmed by an analysis of the semantic relations between their constituents. On the basis of these relations it is shown that left headed compounds correspond to adjectival noun phrases and right-headed compounds to possessives. Functional and cognitive explanations for these facts are proposed within the general framework of cognitive linguistics, and with particular reference to metaphor theory, construction grammar, grammaticalization theory and Cognitive Grammar, and predictions are made as to the likelihood of other, as yet unstudied, languages exhibiting the same “unusual” feature.
The following research questions are addressed:
1. Can data mining techniques be used to identify the L1 background of Norwegian language learners on the basis of their use of lexical features of the target language?
2. If so, what are the best predictors of L1 background?
3. And can those predictors be traced to cross-linguistic influence?
The study utilizes data from Norsk andrespråkskorpus (ASK), the Norwegian Second Language Corpus housed at the University of Bergen, and draws on resources from the ASKeladden project. The source data consists of texts written by 1,736 second language learners of Norwegian from ten different L1 backgrounds, and a control corpus of 200 texts written by native speakers. Word frequencies computed from this data are analysed using multivariate statistical methods that include analysis of variance and linear discriminant analysis, and the results are subjected to contrastive analysis.
The combination of discriminant analysis and contrastive analysis produces all three types of evidence called for by Jarvis (2000) in his methodological requirements for language transfer research: intragroup homogeneity, intergroup heterogeneity and cross-language congruity. Well-known transfer effects, such as the tendency for Russian learners to omit indefinite articles, are confirmed, and other, more subtle patterns of learner language are revealed, such as the tendency amongst Dutch learners to overuse the modal verb skal to a far greater extent than other learners. In addition to confirming the reality of lexical transfer, these results provide abundant material for further research, while the methodology employed can be harnessed in many areas of linguistic research.
Apps by Steve Pepper
Papers (Linguistics) by Steve Pepper
Presentations (Linguistics) by Steve Pepper
The original PDF can be found on Library Genesis. This draft covers all 36 lessons using a romanization scheme that has been slightly amended from the original version. The Sinhala script is introduced incrementally from lesson 5 onwards.
This draft has been proofread by a native speaker up to Lesson 12. Updated versions will be made available as the proofreading proceeds. Please read the preface to the 2nd edition for some background to the present work and some caveats regarding the text.
Examples of binominals include Mandarin Chinese 铁路 (tiělù) [iron road], French chemin de fer [way of iron] and Russian железная дорога (želez.naja doroga) [iron.ADJZ road]. All of these combine a word denoting ‘iron’ and a word denoting ‘road’ or ‘way’ to denote the meaning ‘railway’. In each case, the unstated semantic relation is one of COMPOSITION: a railway is conceptualized as a road that is composed, or made, of iron. However, three different morphosyntactic strategies are employed: compounding, prepositional phrase and relational adjective. In this study, I explore the range of such strategies used by a worldwide sample of languages to express a set of 100 meanings from various semantic domains, resulting in a classification consisting of nine different morphosyntactic types.
I also investigate the semantic relations found in the data and develop a classification called the Hatcher-Bourque system that operates at two levels of granularity, together with a tool for classifying binominals, the Bourquifier. The classification is extended to other subfields of language, including metonymy and lexical semantics, and beyond language to the ___domain of Topic Maps and knowledge representation, resulting in a proposal for a general model of associative relations called the PHAB model.
Among the other findings of the research are: universals concerning the recruitment of anchoring nominal modification strategies; a method for comparing non-binary typologies; the non-universality (despite its predominance) of compounding; and a scale of frequencies for semantic relations which may provide insights into the associative nature of human thought.
It shows that Nizaa occupies a unique position among those languages for which compounding has so far been investigated, in that it exhibits no clear preference for either right-headed or left-headed nominal compounds; rather it has both kinds of compounds in approximately equal measure.
A simple statistical analysis reveals some significant differences between the two kinds of compound and this is confirmed by an analysis of the semantic relations between their constituents. On the basis of these relations it is shown that left headed compounds correspond to adjectival noun phrases and right-headed compounds to possessives. Functional and cognitive explanations for these facts are proposed within the general framework of cognitive linguistics, and with particular reference to metaphor theory, construction grammar, grammaticalization theory and Cognitive Grammar, and predictions are made as to the likelihood of other, as yet unstudied, languages exhibiting the same “unusual” feature.
The following research questions are addressed:
1. Can data mining techniques be used to identify the L1 background of Norwegian language learners on the basis of their use of lexical features of the target language?
2. If so, what are the best predictors of L1 background?
3. And can those predictors be traced to cross-linguistic influence?
The study utilizes data from Norsk andrespråkskorpus (ASK), the Norwegian Second Language Corpus housed at the University of Bergen, and draws on resources from the ASKeladden project. The source data consists of texts written by 1,736 second language learners of Norwegian from ten different L1 backgrounds, and a control corpus of 200 texts written by native speakers. Word frequencies computed from this data are analysed using multivariate statistical methods that include analysis of variance and linear discriminant analysis, and the results are subjected to contrastive analysis.
The combination of discriminant analysis and contrastive analysis produces all three types of evidence called for by Jarvis (2000) in his methodological requirements for language transfer research: intragroup homogeneity, intergroup heterogeneity and cross-language congruity. Well-known transfer effects, such as the tendency for Russian learners to omit indefinite articles, are confirmed, and other, more subtle patterns of learner language are revealed, such as the tendency amongst Dutch learners to overuse the modal verb skal to a far greater extent than other learners. In addition to confirming the reality of lexical transfer, these results provide abundant material for further research, while the methodology employed can be harnessed in many areas of linguistic research.
defining and assigning unique global identifiers for arbitrary subjects on the World Wide Web in order to solve the problem of information overload. It presents the case for Published Subjects and published subject indicators (PSIs) being the best solution to this problem, and briefly characterizes the strengths and weaknesses of alternative approaches. It ends with a call to action.
NOTE: This paper has been largely superceded by The TAO of Topic Maps.
Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Texas: SIL International. Online version: http://archive.ethnologue.com/16/.
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info.)
This project is a typological study of the mechanisms available in the languages of the world for naming complex concepts that involve an (unspecified) relation between two entities. The prototypical mechanism serving this function is noun-noun compounding, as in Eng. rail.way [N.N], but there are many others, including prepositional phrases [N PREP N] (Fr. chemin de fer ‘railway’), relational adjectives [N.ADJZ N] (Rus. želez.naja doroga ‘railway’) and more. Using an onomasiological approach and data originally collected for the World Loanword Database, this project seeks to chart the diversity of formal and semantic mechanisms exhibited by such “binominal lexemes”, and investigate their correlations with genetic, areal and typological features of language. The longer term interest of the project is in understanding how this aspect of word-formation reflects the associative nature of human thought.