Jump to content
Tuts 4 You

Looking for Regular Expression lib


LCF-AT

Recommended Posts

Hello,

at the moment I trying to lern how to work with RegEx syntax and I have to say it makes fun and seems to be also very usefully.I also like the online site regex101 for testing / checking and to get the results and infos in realtime.(I would like to have any offline tool similar as regex101 [not notepad++]) :)

Now my question is whether there are also any libs for MASM already I could use for myself in my apps if I need or want to check text sources for anything,so with RegEx it would be much better than to search manually with limited small pattern.Maybe any lib with functions I can also use with same string syntax etc.Does anyone know some latest thing etc?

greetz

Link to comment

I'm not sure about libraries for MASM but I found Regex Buddy invaluable when I was learning regex and still use it for testing, creating and debugging complex regexes which i've used for grabbing certain content from HTML, it's not free but I would recommend it

Edited by NOP
  • Like 1
Link to comment

Hi,

thanks for the link.I found some other free offline tools like The Regex Coach & .NET Regular Expression Designer to test and play with.Also found a lib I should use with MASM made by "transcribed from regex.h by _Servil_" but I get strange compiler errors = failed to use it. :(

regex.lib(39243.obj) : warning LNK4229: invalid directive '/comment:Intel(R) C++ Compiler for 32-bit applications, Version 7.1 Build 20030307Z' encountered; ignored
....

I cant bypass this error anyhow and it also wants many else other C libs too.Now I am still looking for any other lib I could use etc.

PS: One more question about RegEx.I see many tools / AddOns using it also where you can enter new & own RegEx strings to search with.My question is,how does the engine reads for example the pagesource of website?

Lets say I do use a RegEx string to search with starting with the sign ^ then it means to find all strings which are starting at a line like (^https?:\/\/).Now as result I get all links which was found on any page starting with htttp:// or https://.If I check the pagesource manually then I do find href="Link" of it.Lets say I wanna to it manually and read pagesource to get it in buffer so need I then to format it anyhow before I use it later with any RegEx function to get same results back?Ok till now I could not test it without having any working RegEx lib & functions.

greetz

Link to comment

When your app "grabs" a webpage the returned result will be HTML. It is up to you to create a regex to grab only your expected result(s) based on your needs. HTML links rarely start on a new line and you should not rely on any sort of formatting. HTML meets standards and so certain aspects are always in standard tags. Using () to return just the results you want, you could do something like this to return links...

href="(.+?)"

 

  • Like 1
Link to comment

Hi again,

just asking of course to be save and to know how to deal with it.Example,last week I did test a FF AddOn called Download Them All and also I did download the plugin container AddOn for this DTA what does allow to write own plugins for any site you want using RegEx pattern etc.I wasnt happy with that plugin because some trouble with creating any own plugins and got always any format erros etc but I wanted to that custom RegEx feature anyhow with any AddOn and after looking some more for other AddOns I found a other one called ImageHost Grabber.I did download this AddOn and using it with any older FF version before Quantum version and I could use the plugin for a while till I FF did disabled it because of signatures issues.Also to disable (xpinstall.signatures.required to false) didnt help and I needed to use a more older FF version 44 to keep it enabled.Now I checked the AddOn out and checked also the added host xml file where I found many sites.In that file I can see a field for the URL patter itself and a field for the RegEx searchpattern.

Example of one:

<host id="4chan">
<urlpattern>^https?:\/\/.+\.4chan\.org\/.+\/src\/.+</urlpattern>
<searchpattern><![CDATA[function(pageData, pageUrl) {
	return {imgUrl: pageUrl, status: "OK"};
	}]]></searchpattern>
</host>

So as you can see it does check the site itself with the URL RegEx pattern and if they match *x then they will called and downloaded if status ok.All clear so far.My question in this case now is how it does download / check the pagesource of this site to compare it with the URL pattern.Above you can see the url pattern using the sign ^ which means to check at the line start only without anything before at the line etc you know.In the pagesource itself if I check it in Firefox I can see all matching links but of course they dont start at any line beginning and are just somewhere to find and also so I can see the pagesource if I download it manually into a  buffer and if I would now use the RegEx search pattern above with ^ sign then it will match nothing in my case.Clear so far right.Back to my question,so if I see it right then the AddOn seems to use any PHP function to download pagesource or to check for pageUrls maybe one by one for entire pagesource (I dont use PHP / dont know it very well) to get stripped URL addresses out.

So should I maybe do it on the same way anyhow if I should only check for URLs (format anyhow only to get URLs out) or not?But I dont use PHP and dont know any function xy yet or format option to get only URLs out from entire pagesource and should maybe write a own function for this.

There is also another AddOn I found called LinkGopher what can extract all links of any site and does show them all line by line.So just asking what makes more sense or not you know.

greetz

Link to comment

How about PCRE lib for Masm32? Google will get you few hits - haven't tried the lib myself, however.

 

Offtopic:

1 hour ago, LCF-AT said:

So just asking what makes more sense or not you know.

Not reinventing the wheel makes more sense. But you always prefer to do it your own way.. ;)

  • Like 1
Link to comment

Hi,

I use this lib already for testing but have some trouble with that one.First,After some testing I see a problem using extra paramter for the RegEx flags /***/flags global/multiline etc.In the inc file of PCRE81S.inc I can find a value called PCRE_MULTILINE (equ 00000002h) and tried to use it with pcre_compile function but it dosent return all results like I also get at RegEx101 page using PCRE.I also dont check why there are so much flags I could set with that function so I thought I could do that only or mainly with the RegEx string patter I do use.Now its more complex again and no more same handling like with the Online tool.

In the description of PCRE I can read this about modifers.

https://www.pcre.org/original/doc/html/pcretest.html

Perl-compatible modifiers

The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre[16|32]_compile() is called. These four modifier letters have the same effect as they do in Perl.

I dont see any global modifer here I could use.Why?So I would need to use /gm for my case.Is global flag not supported in this version etc?I dont check that yet.

PS: "Not reinventing the wheel makes more sense" - :drive1:Maybe you're right but as you know already that I mostly have trouble to check this stuff.Dont ask me why so I dont know it too.I think "I do think" anyhow else than others.Not sure whether its better or not. ;)

greetz

Link to comment
On 2/19/2018 at 6:18 PM, LCF-AT said:

My question in this case now is how it does download / check the pagesource of this site to compare it with the URL pattern.Above you can see the url pattern using the sign ^ which means to check at the line start only without anything before at the line etc you know.In the pagesource itself if I check it in Firefox I can see all matching links but of course they dont start at any line beginning and are just somewhere to find and also so I can see the pagesource if I download it manually into a  buffer and if I would now use the RegEx search pattern above with ^ sign then it will match nothing in my case.Clear so far right.Back to my question,so if I see it right then the AddOn seems to use any PHP function to download pagesource or to check for pageUrls maybe one by one for entire pagesource (I dont use PHP / dont know it very well) to get stripped URL addresses out.
 

Browser plugins use javascript which can set up handlers and extract the links directly from the document as the page source is already dealt with by your browser

One of the plugins you mention above uses this javascript function to extract all links...

function extractLinks() {
  const links = [];

  for (let index = 0; index < document.links.length; index++) {
    links.push(decodeURI(document.links[index].href));
  }

  return links.length ? links : null;
};

As you can see the links are extracted from the href tags and saved to the links array

You are getting confused with different uses of regex. eg: It sounds like you want to work directly with HTML source and extract the links from it. So you would use a regex which I've explained above without ^ to grab all the links

I'll try to explain when you would ^ in your case as an example

If you was working with a list of urls either grabbed from HTML or a text file etc. then you may have links like this with a query string / folders / files...

http://example.com/redirect?to=http://example2.com
http://example.com/testfile.zip
http://example.com/folder/index.php?http://example3.com

If for example you wanted to extract the base url  without any extra path / filename / query string then you might run a regex to extract text matching http:// and ending with / but if you do not use ^ to indicate you only want to match from the start then it will match unwanted urls

Quote

So should I maybe do it on the same way anyhow if I should only check for URLs (format anyhow only to get URLs out) or not?But I dont use PHP and dont know any function xy yet or format option to get only URLs out from entire pagesource and should maybe write a own function for this.

You could write your own browser plugin if you know javascript or edit an existing one for your own needs (license permitting ;) )  or write a program to grab HTML and regex the content to extract what you need. There shouldn't be any need to write your own grabber from scratch, there are plenty of libraries to deal with urls, server responses, grabbing HTML etc., I'm not sure about masm but INDY for Delphi for example

:)

 

  • Like 1
Link to comment

Hi again,

yes something like that I thought already that they extract the links anyhow by something else bfore using RegEx.In my case I just have the entire source without extra extracted links so that I cant use ^ in this raw source.My question is whether it makes sense to extract the links too "anyhow" on same way or not.In some html sources the links are not always using entire http path you cant see in this source parts.

I still didnt got the PCRE lib I got working to search with a global flag. :( Without that its useless for me so far.

I really would like to write any simple own AddOns also if I have not so much to do with Java.Since Firefox Quantum I have 3 simple AddOns I cant use anymore (one of them called SearchWP) I would rewrite or make it working with FFQ if I could.

greetz

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...