Using 'external' word2vec models

kaymankayman MemberPosts:662Unicorn
edited August 2019 inHelp
Hi there, might be overlooking something but when I tried to use some word2vec models generated using colab I got an error message as below :

Message: Expected a space in the first line of file '/data/models/word2vec/tryout': 'タcgensim.models.word2vec'

Does this mean we cannot really use word2vec models generated outside of rapidminer, even if they are generated 'according to the rules'?
还是有办法bypass these errors?

I could in the end regenerate the model in rapidminer also, but it's a bit less friendly on server resources and limits sharing across different apps so I'd prefer to be able to create once and share many.

Best Answer

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,388RM Data Scientist
    Hi@kayman,

    did you try the read word2vec model operator on the file?

    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kaymankayman MemberPosts:662Unicorn

    It is the read word2vec model operator that gives me the error.
    It doesn't really work the other way around either. Creating a word2vec model using Rapidminer works fine within RM, but the model fails if I want to use it in a python workflow using Gensim. Then I get a UnpicklingError stating invalid load key.

    Guess they are not fully compatible indeed?




  • kaymankayman MemberPosts:662Unicorn
    Hi@mschmitz,

    Gensim is also using binaries, but I went through some of the documentation and they state as follows :

    The training algorithms were originally ported from the C packagehttps://code.google.com/p/word2vec/and extended with additional functionality and optimizations over the years.

    So probably the changes are what causes the incompatibility problem.

    Thanks anyway, I'll dig deeper in the documentation to understand if there is a way to get a more 'core' format output , and otherwise it's just working with duplicates.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,388RM Data Scientist
    Hi@kayman,
    we will probably touch word2vec "soon". I'll add gensim compatibility to the todo list.
    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kaymankayman MemberPosts:662Unicorn
    Thanks@mschmitz

    Anyway, some further diving into the gensim code learned me that with using alternative load and save methods using KeyedVectors the model can be interchangeable after all.

    For the ones interested, loading a rapidminer generated model with gensim word2vec can be done like this :
    myModel = gensim.models.KeyedVectors.load_word2vec_format(path_to_model_from_RapidMiner, binary=True)
    Saving a model from gensim to be used with rapidminer requires the save_word2vec_format, rather than using the 'gensim optimised' save procedures.
    MartinLiebig
Sign InorRegisterto comment.