==============================
  Release Notes for LSA 0.67
       August 1, 2012
==============================


# # # # # # # # # # # # # # # # # # # # # # # # 
Open Issues
# # # # # #

Wishlist & Unresolved
---------------------

-- robert koblischke:
* svd(...,sparse=SVDLIBCInterface("/home/fwild/svdlibc"))

Und SVDLIBCInterface ist eine subklasse von SparseSVDInterface die eine 
methode zum transformieren der matrix, zum berechnen der svd und zum 
transformieren der svd in die R-LSA struktur).
--- ///

* error handling for empty files (no term docs)?
* error handling for empty textvectors?
* bugfix neccessary: textmatrix with controlled
  vocabulary fails if no term is left
  
* maybe GF*IDF boundaries instead of frequency 
  boundaries

* generalise architecture with text-processing chain

* input document sanitizing routines or at least a
  testing environment that can tell which files will
  produce errors.

* corpora package: global weights should always come
  from the original textmatrix, especially when folding
  in additional texts (see essay scoring example).
  
* corpora package: should decide automatically whether
  it is a textfile, a directory, or a string.

* normalized local weights! (phillip türtscher)

* integrate with tm package

* add phrase detection to textvector function ("my phrase"),
  should not strip any special chars and should stay
  case sensitive (Neal Snider, Stanford)

* rewrite fold_in 
    dqs = t(docvecs) %*% LSAspace$tk %*% diag(LSAspace$sk)
    dtm = LSAspace$tk %*% diag(LSAspace$sk) %*% t(dqs)
  with 
    crossprod(x,y) = t(x) %*% y
    tcrossprod(x,y) = x %*% t(y)

* replace 0 with . in the print routine

* check: are html special entities removed in xml remove: e.g. ’


# # # # # # # # # # # # # # # # # # # # # # # # 
Changes
# # # #

Changes in 0.70 to 0.72
---------------

o   Fridolin Wild (2014-03-30)
     * fixed smaller warnings, including binding problem of textvector
       variables alnumx and specialchars, renamed HISTORY to ChangeLog,
       tm dependancy as suggests


Changes in 0.69
---------------

o   Fridolin Wild (2014-03-28)
     * fixed remaining snowballC dependancies
     * fixed documentation, namespace


Changes in 0.68
---------------

o   Fridolin Wild (2014-03-21)
     * fixed snowballC dependancy
     * added arab stopword list stopwords_ar


Changes in 0.67
---------------

o   Fridolin Wild (2012-08-01)
     * fixed some warnings for the package build


Changes in 0.66
---------------

o   Fridolin Wild (2012-07-23)
      * added textvector support for vietnamese and polish
        (thanks to Grażyna Paliwoda-Pękosz, Cracow University 
        of Economics for the Polish request; and Hien Pham for
        the Vietnamese)
      

Changes in 0.65
---------------

o   Fridolin Wild (2010-10-07)
       * new dutch stopword list (thanks to Adriana Berlanga, Open University
		   of the Netherlands, and Jan Hensgens, AURUS)


Changes in 0.64
---------------

o   Fridolin Wild (2009-09-14)
       * associate.R and cosine.R updated to deal with subsets
		   thanks to Yue Shan, National Cheng Kung University, Taiwan 

Changes in 0.63
---------------

o   Fridolin Wild (2009-09-04)
        * Rstem replaced with Snowball (in textmatrix, query),
          thanks to Kurt Hornik.
		  * patched test routines to work on all OS

Changes in 0.62
---------------

o   Fridolin Wild (2009-05-28)
	* french stopwords added (thanks to Haykel Demnati, ISG Tunis)

Changes in 0.61
---------------

o   Fridolin Wild (2008-11-26)
	* phrase detection added (thanks to Eileen Hlavka,
	  PRGS, RAND Corporation and Neil Snider, Stanford).
	  So far phrases can be provided as character vector,
	  the textvector routine changes them into the
	  format word1_word2_word3 (and the like) to
	  replace them in the texts. When phrases are
	  used, underscores are no longer stripped from
	  the texts!

Changes in 0.60
---------------

o   Fridolin Wild (2008-09-04)
	* bug fixed dimcalc_share when using
	  extremely small dimensions (e.g. 2).

o   Fridolin Wild (2008-08-31)
	* bug fixed encoding problems on windows

o   Kurt Hornik (2008-03-12)
    * T => TRUE, F => FALSE


Changes in 0.59
---------------

o   Fridolin Wild, Kurt Hornik (2007-12-18)
    * several patches for encoding problems

o   Fridolin Wild (2007-12-11)
	
	* bug fix (R crashed when calling lsa_corpus
	  demo): essay scoring demo now calls data
	  files to avoid this unicode problem (seems
	  to be a bug in R).
	
	* stopword lists converted to .rda data files
	
	* unicode bugfix in tests
	
	* unicode bugfix for german umlaut
	  conversion from html-entities in textvector()
	  
	* demo index readlines bugfix (two blank lines
	  added)
	
	* landauer demo: X was using dimcalc_share()
	  instead of dimcalc_raw()

o   Fridolin Wild (2007-11-28)

	* Dutch stopword list added (thanks to Marco Kalz,
	  Open University Netherlands)
	
	* UTF-8 support enforced in stopword list, package
	  description, textmatrix
	  
	* stemming bug fixed (stemming was _after_
	  filtering by controlled vocabulary)
	
	* testing routine added for one-term matrices
	
	* special characters cleaned in textvector()
	
	* Optimised support for Arabic buckwalter transliterations
	  (referring to the earlier request of Neal Snider, 
	  Stanford, below). Included the following characters
	  to 'be' alphanumerics: ' $ | _ - ~ > < & { } * `
	  
	* utf-8 conform umlaut replacement in textmatrix()
	
	* added warning for 'empty' files (empty after filtering)
	  to textvector()
	

Changes in 0.58
---------------

o   Fridolin Wild (2006-08-01)

    * added simple <XML> tag handling: tags are automatically
      removed (requested by Simon Lin, Northwestern @ Feb 23, 2006)

    * added arabic support for Buckwalter transliterations
	  (requested by Neal Snider, Stanford @ Feb 21, 2006)
	
	* changed textmatrix() / textvector() standard language 
	  to english

    * textmatrix can now automatically remove terms with only
	  numbers (requested by Simon Lin, Northwestern @ Feb 23, 2006)

    * extended special character stripping ('#', '+', ...)

    * added upper and lower boundaries for global frequencies

    * demo for essay scoring added
	
	* data set with essays (corpus.6) added


o   Fridolin Wild (2006-07-31)

    * added random sample function for corpus selection.
	  index can be returned to allow for re-use of the sample.

    * added dimcalc_fraction()

    * added support to textmatrix() to run not only over directories, 
	  but also over a single file or a vector of files (or a mixed 
	  vector with files and directories)

    * added maxWordLength filtering

    * added maxDocFreq filtering


o   Jeff Verhulst (2006-04-21)

    * bugfix: print.textmatrix()
	  bug appeared: 2nd of jan 2006 (Claudia Mayr)
      fix provided by: Jeff Verhulst, J&J Pharma R&D IM (2006)


Changes in 0.57 (first public release)
--------------------------------------

o   2005-11-23:

    * a lot of minor changes to make documentation better
	
    * smaller code changes
	
    * renamed core functions to lsa(), as.textmatrix(), fold_in()

o   2005-11-22:
	* chose NOT to integrate separator lines (would splash the handling!)
      changed summary.textmatrix from matrix to vector output

o   2005-11-12:

    * documentation refactured, added documentation for several new methods.
	
    * removed meanmax.R (doesn't fit the package)
	
    * checked query() to ensure it's working

o   2005-11-11:

    * bugfix of textmatrix() to work properly with the vocabulary list
	* textmatrix(): integrated the vocabulary order/sort functions...

o   2005-11-08:

    * added high-level functions:
	   
       * lsa_fold-in
	   
       * lsa
    
    * refacturing:
	
        * eliminated pseudo_docs
          -> integrate into textmatrix
	  
        * connections for textmatrix turned out to be impossible
		
        * summary method
		
	    * print method
		
        * rewrote "pseudo_docs" to table / factor
		
		* added vocabulary filter to textmatrix / textvector
		
        * in triples.r: use of "With(environment, { bla })" 
	      turned out impossible
        
		* getTriples: use of "return list(S=S, P=P, O=O)"
	      turned out impossible
    

o   2005-10-04: added nchar(..., type="chars") to count characters, not bytes


Changes in 0.47
---------------

o   2005-08-26: 

    * renamed dt_triples to textvector and dt_matrix to textmatrix


Changes in 0.46
---------------

o   2005-08-25:

    * added "\\[|\\]|\\{|\\}" to gsub in textvector

--------------------------------------------------